MASC I

MASC I contains about 82K words of data. It is the most heavily annotated portion of the sub-corpus.

Validated or manually-generated annotations for *all* MASC I texts include the following annotations:

  1. Token

  2. Part of speech

  3. Sentence boundary

  4. Shallow parse (noun chunk, verb chunk)

  5. Named entities (person, location, organization, date)

  6. Penn Treebank syntax

In addition, MASC I contains over 10K words of full text annotation for FrameNet frame elements, and WordNet sense annotations for every instance of the 100 words under study in the WordNet-FrameNet harmonization project.

MASC I includes 40K of texts annotated by the Unified Linguistic Annotation Project, which generated or is generating annotations for PropBank, TimeML, and the Pittsburgh Opinion Annotation Project. These annotations are either available now or will be contributed and made available soon.

MASC I also contains about half of the 10K words from the Language Understanding Corpus, which was heavily annotated by several projects in the US. LU Corpus texts that are not in original English or have licensing restrictions were excluded. At present, only those LU annotations that were relatively easy to convert to MASC format are available.


The complete contents of MASC I, with information about text length, genre, etc., together with a list of additional annotations available for portions of the data, is here.