data

The ANC Project distributes the following:

Corpora and annotations

  • The Open American National Corpus (OANC), consisting of approximately 15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities. Portions of the corpus are automatically annotated for additional phenomena.
  • The Manually Annotated Sub-Corpus (MASC), 500,000 words of American English mostly drawn from the OANC, with manually-produced or hand-validated annotations for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), named entities, Penn Treebank syntax, and co-reference; portions annotated for FrameNet frames elements, PropBank predicate arguments, MPQA Opinion, ISO-TimeML, and several other annotations.
  • The ANC Second Release, a superset of the OANC including an additional 800,000 words of licensed data.
MASC Sentence Corpus
  • A corpus consisting of approximately 110,000 sentences drawn from MASC and the OANC that have been manually annotated with WordNet 3.1 senses for 114 words, together with detailed inter-annotator agreement statistics.

Derived Data