The OANC data is distributed with the following annotations:
- Structural markup (sections, chapters, etc.) down to the level of paragraph
- Sentence boundaries
- Words (tokens) with part of speech annotations and lemma using the Penn tagset
- Noun chunks
- Verb chunks
- Named Entities (Person, Location, Organization, Date)
All annotations were originally produced automatically using our enhanced versions of GATE‘s ANNIE system. Some of the texts in the OANC include manually validated sentence boundaries (the list of texts validated for sentence boundaries is here). Note that the validated sentence boundaries are not included in the ANC Second Release.
In addition to the annotations distributed with the OANC, we distribute contributed annotations of the OANC, including BBN named entitites and several different syntactic parses. Please consult the contributed annotations page.
All ANC annotations are in stand-off format–that is, each annotation type is stored in a separate file and linked to the primary data, which is contained in a plain text (UTF-8) file. Annotations are represented as a graph of feature structures according to the specifications of the ISO Linguistic Annotation Format (LAF) (ISO 24612). Please download the LAF/GrAF standard specification; see also Ide and Suderman 2012, Ide and Suderman 2007, Ide and Romary 2007, and Ide and Suderman 2006.
A version of all, or part, of the ANC data with annotations merged in-line can be generated using ANC2Go. Several output options are provided, including XML and non-XML formats that can be input to a variety of other software. In addition, GrAF annotations can be loaded into annotation tools such as GATE and UIMA; see the tools page for details.