Full MASC

 

The full Manually Annotated Sub-Corpus consists of 504,299 words of written and spoken texts. Twenty-five thousand words are drawn from each of 20 different genres. MASC is the first multi-genre corpus with a variety of linguistic annotations in existence, and is intended to be useful for machine learning, genre-based analyses, etc.


The following table summarizes the MASC contents:

All 500K texts are being manually annotated or validated for a variety of linguistic phenomena, including:

  1. Token

  2. Part of speech

  3. Sentence boundary

  4. Shallow parse (noun chunk, verb chunk)

  5. Named entities (person, location, organization, date)

Some MASC texts include additional types of annotations; see the MASC1 web page for a description.