The Open ANC includes over 14 million words from the Second Release that can be freely distributed. Please see the OANC license for more details.
The OANC includes the following data from the ANC Second Release:
Spoken
|
|||
Name | Domain | No. files | No. words |
charlotte | face to face | 93 | 198,295 |
switchboard | telephone | 2,307 | 3,019,477 |
Spoken Totals | 2,410 | 3,217,772 | |
Written |
|||
Name | Domain | No. files | No. words |
911 report | government, technical | 17 | 281,093 |
berlitz | travel guides | 179 | 1,012,496 |
biomed | technical | 837 | 3,349,714 |
eggan | fiction | 1 | 61,746 |
icic | letters | 245 | 91,318 |
oup | non-fiction | 45 | 330,524 |
plos | technical | 252 | 409,280 |
slate | journal | 4,531 | 4,238,808 |
verbatim | journal | 32 | 582,384 |
web data | government | 285 | 1,048,792 |
Written Totals | 6424 | 11,406,155 | |
Corpus Totals | 8,832 | 14,623,927 |
Back to the top.
The file organization and encoding conventions for the OANC is the same as in the ANC Second Release. Please consult the Second Release document encoding conventions for a full description.
The OANC data is distributed with the following annotations:
All annotations were originally produced automatically using our enhancements to GATE's ANNIE system. Some of the texts in the OANC include manually validated sentence boundaries (the list of texts validated for sentence boundaries is here). Note that the validated sentence boundaries are not included in the ANC Second Release.
In addition to the annotations distributed with the OANC, we distribute contributed annotations of the OANC, including BBN named entitites and several different syntactic parses. Please consult the annotations page.
All ANC annotations are in stand-off format--that is, each annotation type is stored in a separate file and linked to the primary data, which is contained in a plain text (UTF-8) file. Annotations are represented as a graph of feature structures according to the specifications of the ISO Linguistic Annotation Format (LAF) (ISO 24612). Please download the LAF/GrAF standard specification; see also Ide and Suderman 2012, Ide and Suderman 2007, Ide and Romary 2007, and Ide and Suderman 2006.
A version of all, or part, of the ANC data with annotations merged in-line can be generated using ANC2Go. Several output options are provided, including XML and non-XML formats that can be input to a variety of other software. In addition, GrAF annotations can be loaded into annotation tools such as GATE and UIMA; see the tools page for details.
Back to the top.
The OANC is a community resource that is freely available for download and use for research and development, including commercial development.
We ask that you provide us with any of the following that may have resulted from your use of the OANC, which we will make freely available to the user community on this website:
PREVIOUS VERSIONS
Download the Open ANC in the original XML format as a zip file. (326 MB)
Download the Open ANC in the original XML format as a self installing jar file. (316 MB) See below for installation instructions.
The OANC will unpack to approximately 4.8 GB.
The Java installers are executable jar files that can be used to install the Open ANC and the ANC Tool. On most operating systems you should be able double click on the .jar file. If that does not work, open a command prompt (Windows) shell (Linux), or terminal window (Max OS X) and run the command:
java -jar OANC-installer.jar
Installation Notes
File dialog boxes in Java are implemented slightly differently on different platforms. For instance, the "Open File" dialog box in Mac OS X does not allow the user to create a directory from within the dialog. Therefore on Mac OS X, users must do one of the following:
Back to the top.