Full MASC
Full MASC
The full Manually Annotated Sub-Corpus consists of 504,299 words of written and spoken texts. Twenty-five thousand words are drawn from each of 20 different genres. MASC is the first multi-genre corpus with a variety of linguistic annotations in existence, and is intended to be useful for machine learning, genre-based analyses, etc.
The following table summarizes the MASC contents:
All 500K texts are being manually annotated or validated for a variety of linguistic phenomena, including:
•Token
•Part of speech
•Sentence boundary
•Shallow parse (noun chunk, verb chunk)
•Named entities (person, location, organization, date)
Some MASC texts include additional types of annotations; see the MASC1 web page for a description.