MASC | Open American National Corpus

MASC: An Open Language Data Community Resource

MASC is an open language data resource that can be downloaded by anyone for any purpose, under the conditions of the Creative Commons Attribution 3.0 United States License.

NEW MASC ANNOTATIONS

The annotations below are currently downloadable separately. They will be included in GrAF format in the next release of MASC.

Penn Treebank Syntax: syntax annotations for the entire 500K words of MASC in the original PTB (bracketed) format.

MASC-NEWS : automatic annotation of MASC for named entities and word senses based on BabelNet.

CoInCo : lexical substitution corpus CoInCo (“Concepts in Context”) based on contiguous texts from MASC. It contains substitute words collected via crowdsourcing for every content word in selected (complete) text files.

The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).

All of MASC includes manually validated annotations for sentence boundaries, token, lemma and POS; noun and verb chunks; named entities (person, location, organization, date); Penn Treebank syntax; coreference; and discourse structure. Additional manually produced or validated annotations have been produced by the MASC project for portions of the sub-corpus, including full-text annotation for FrameNet frame elements and a 100K+ sentence corpus with WordNet 3.1 sense tags, of which one-tenth are also annotated for FrameNet frame elements. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects, including PropBank, TimeBank, Pittsburgh opinion, and several others.

Unlike most freely available corpora including a wide variety of linguistic annotations, MASC contains a balanced selection of texts from a broad range of genres.

MASC is a COLLABORATIVE COMMUNITY RESOURCE that will ultimately be sustained by community contributions of annotations and derived data.

WE SOLICIT ANNOTATIONS OF ANY PORTION OF MASC DATA FOR LINGUISTIC PHENOMENA OF ANY TYPE, IN ANY FORMAT. WE ALSO SOLICIT CONTRIBUTIONS OF DERIVED DATA SUCH AS FREQUENCY LISTS, NGRAM DATA, ETC.

LEARN MORE

MASC development is supported by the US National Science Foundation