Full MASC

The full Manually Annotated Sub-Corpus consists of 504,299 words of written and spoken texts. Twenty-five thousand words are drawn from each of 20 different genres. MASC is the first multi-genre corpus with a variety of linguistic annotations in existence, and is intended to be useful for machine learning, genre-based analyses, etc.

The following table summarizes the MASC contents:

All 500K texts are being manually annotated or validated for a variety of linguistic phenomena, including:

•Token
•Part of speech
•Sentence boundary
•Shallow parse (noun chunk, verb chunk)
•Named entities (person, location, organization, date)

Some MASC texts include additional types of annotations; see the MASC1 web page for a description.