Annotations of the ANC data contributed by members of the community are made available in the format in which they were contributed, and also are or will be made available in GrAF format compatible with the OANC.
Contents
- BBN Named Entities (inline format)
- Syntactic parses (various formats)
- Slate coreference (anaphora) annotations
- CLAWS part of speech tags
Annotations
BBN Named Entity Annotation
The entire OANC has been automatically annotated for named entities using the BBN NE Tagger, contributed by Sameer Pradhan. The download contains the OANC texts with entity annotation is inline; we will provide a standoff version in GrAF format so that these annoations can be merged with other OANC annotations in the near future. A document describing the BBN entity types is also included.
DOWNLOAD OANC with inline BBN NE annotations: tgz (40.5 MB) | zip (50.1 MB)
Syntactic parses
Over 11 million words of the OANC have been parsed automatically using the Charniak constituency-based parser (Charniak & Johnson, 2005), the LTH dependency converter (Johansson & Nugues, 2007), and MaltParser (Nivre et al., 2007). The output of the Charniak parser is in the inline Penn Treebank format, and the dependency parses are output in CONLL format. These annotations will be trasnduced to the standoff GrAF format for comaptibility with the other OANC annotations in the near future. The annotations were contributed by Rasul Kalajahi. An overview of the annotations is here.
DOWNLOAD THE PARSES : tgz (170.1 MB) | zip (170.5 MB)
Slate Coreference
Shane Bergsma of the University of Alberta has annotated a sub-set of the Slate data for coreference (anaphora). The annotations consist of pronoun-antecedent pairs in 118 documents (128717 words) from the Slate data of the ANC/OANC. The data include a test set and a training set; there are 1398 labelled pronouns in 78 documents in the training set and 1381 labelled pronouns in 40 documents in the test set. Most of the Slate documents are “gist” articles which provide factual background information for stories currently in the news. Only pronouns that refer to noun phrases given previously in the text are annotated; pronouns referring to implicit entities not specifically mentioned are labeled and ignored, including cataphora (e.g., “After he was elected, president Clinton…”), and pleonastic pronouns without antecedent (e.g., “ it is raining”). Of the 2779 total pronouns labelled, 219 are so identified.
The coreference annotations are provided in standoff format. At present these annotations are provided in the standoff XCES format used for the ANC First and Second releases and for the early version of the OANC. They are packaged as a separate corpus, which includes the 118 OANC texts, all annotations from the OANC corpus (in XCES format), and the co-reference annotations. If you have the ANC or OANC on your machine the coreference corpus should not be installed into the ANC home directory. When installed there will be three folders created: test (including the test set), training (including the training set), and Uninstaller (program to uninstall the download).
DOWNLOAD THE SLATE COREFERENCE INSTALLER
Installation
The coreference annotations are packaged in an executable jar file. To install the annotations run the jar file by double clicking on it, or by opening a command prompt (Windows) or a shell (Unix/Linux/MacOSX) and running the command:
java -jar Slate-coref-install.jar
Once the annotations have been installed you may want to process the files with the ANC Tool.
CLAWS
Please Note: The CLAWS part of speech annotations in the ANC Second Release may not be usable with the Open ANC, since some OANC texts have been modified as a result of manual validation. Therefore, the CLAWS stand-off annotations may contain invalid offsets. These annotations can be used with the ANC Second Release available through the Linguistic Data Consortium. Versions of the CLAWS annotations may be made available for the Open ANC in the future.
The written portion of the ANC has been tagged for part speech using the C5 tagset (the tag set used in the BNC) and the C7 tagset by the University of Lancaster. The two sets of annotations have been packaged separately so that users can install portions of each tag set; for example, it is possible to install in C5 tags for the Slate corpus and the C7 tags for the New York Times corpus.
Each set of annotations can be installed on your system using either of two installers:
- Web installer : downloads the annotations from the ANC web site at installation time. Use the web installer if you plan on installing a small subset of the available annotations, to avoid downloading the entire ANC.
- Standalone installer : includes annotations for the entire ANC in one large file. This installer can be used without internet access.
C5 Annotations
C7 Annotations
Installation
The installation process is the same regarless of the installer you use. Each installer is an executable jar file. On most systems either installer can be run simply by double clicking on the installer’s jar file. If that does not work, open a command prompt (Windows) or a shell (Unix/Linux/MaxOSX) and run the command:
java -jar installer.jar
where installer.jar is the name of the installer you downloaded. For example, to run the web installer for the C7 annotations, the command would be:
java -jar C7-web.jar
Installation Notes
1. If you use the web installer, please note that the installer displays messages indicating that it is “connecting to the internet” while it is downloading the various packages. For this reason it is recommended that you use the stand-alone installer unless only a small subset of the annotations will be installed.
2. When you select the $ANC_HOME directory, the installer will warn you that the directory already exists and ask if you are sure you want to overwrite its contents. Select “Yes”.
3. The installers assume that the ANC directory structure as it is on the DVD distributed by LDC has been preserved. The expected directory structure is shown below.
\---data +---spoken | +---academic-discourse | | \---micase | +---face-to-face | | \---charlotte | \---telephone | +---callhome | \---switchboard +---written_1 | +---fiction | | +---eggan | | \---hargrave | +---journal | | +---slate | | \---verbatim | +---leisure | | \---blog | \---letters | \---icic \---written_2 +---newspapers | \---nytimes +---non-fiction | \---OUP +---technical | +---911report | +---biomed | +---government | \---plos \---travel_guides +---berlitz1 \---berlitz2
Back to the top.