Penn Treebank Syntax

Penn Treebank Syntax: syntax annotations for the entire 500K words of MASC in the original PTB (bracketed) format.


CoInCo (“Concepts in Context”) is a lexical substitution corpus based on contiguous texts from MASC. It contains substitute words collected via crowdsourcing for every content word in selected (complete) text files.


MASC-NEWS provides automatic annotations of MASC  for  named entities and word senses based on BabelNet 2.0.1.


ANC2Go is a web service that allows uers to select the texts and annotations they want and obtain them in any of several different formats. ANC2Go is currently available for MASC data only; OANC data will be available soon.


MASC is available through the Linguistic Data Consortium. See the LDC catalogue entry for details.


The full 500,000 word MASC with annotations is now available for download. See the MASC project page for details.


With funding from an IBM UIMA Innovation Award, we have developed tools to enable import and export of OANC and MASC annotations in GrAF format in UIMA.

OANC in GrAF format »

The full 15 million word OANC is now available in GrAF format. GrAF is the ISO standard serialization format for standoff annotations over linguistic data. GrAF annotations can be loaded into annotation tools such as GATE and UIMA and/or transduced to other formats using ANC2Go. Please consult ISO 26412: Linguistic Annotation Framework for details about GrAF.

BBN Named Entity annotations of the OANC »

Inline named entity annotation produced by the BBN tagger are now available. A rendering of the annotations in GrAF to enable merging with other OANC annotations is forthcoming. Contributed by Sameer Pradhan.

Syntactic parses of 11 million words of OANC data »

Three syntactic parses of 11 million words of the OANC, using the Charniak & Johnson (2005) parser, MaltParser, and LHT dependency converter, have been contributed by Rasul Kalajahi.