Penn Treebank Syntax
Penn Treebank Syntax: syntax annotations for the entire 500K words of MASC in the original PTB (bracketed) format.
CoInCo (“Concepts in Context”) is a lexical substitution corpus based on contiguous texts from MASC. It contains substitute words collected via crowdsourcing for every content word in selected (complete) text files.
ANC2Go is a web service that allows uers to select the texts and annotations they want and obtain them in any of several different formats. ANC2Go is currently available for MASC data only; OANC data will be available soon.
MASC in LDC
The full 500,000 word MASC with annotations is now available for download. See the MASC project page for details.
The full 15 million word OANC is now available in GrAF format. GrAF is the ISO standard serialization format for standoff annotations over linguistic data. GrAF annotations can be loaded into annotation tools such as GATE and UIMA and/or transduced to other formats using ANC2Go. Please consult ISO 26412: Linguistic Annotation Framework for details about GrAF.
Inline named entity annotation produced by the BBN tagger are now available. A rendering of the annotations in GrAF to enable merging with other OANC annotations is forthcoming. Contributed by Sameer Pradhan.
Three syntactic parses of 11 million words of the OANC, using the Charniak & Johnson (2005) parser, MaltParser, and LHT dependency converter, have been contributed by Rasul Kalajahi.