Data Download | Open American National Corpus

MASC data and annotations can be obtained in two ways:

Use the ANCTool to select portions of the corpus and annotations and receive a “customized” corpus including only your selections in one of the following output formats:
- in-line XML (XCES), suitable for use with the BNC’s XAIRA search and access interface and other XML-aware software
- token / part of speech, a common input format for general-purpose concordance software
- Format readable by the Natural Language Toolkit (NLTK) using a TaggedCorpusReader
- CONLL IOB format
Download the data, alone or with all available annotations in the ANC format, below.

You may also use the GATE Tools and UIMA Tools to read MASC data and annotations into these applications.

Quick Data Download

If you know what you are looking for you can download directly from the following list. Otherwise continue reading below.

MASC data and annotations (v 3.0.0)	zip	tgz
MASC (500K) – data only	zip	tgz
MASC I (v. 1.0.3)	zip	tgz
MASC Sentence Corpus	zip	tgz
MASC Sentence Corpus (tab-separated format)	–	tgz
MASC AMT Word Sense Annotations	zip	tgz
Mini-MASC	zip	tgz
40K MASC1 data in CONLL format	zip	tgz
Propbank annotations of 88K of MASC data, in original PB format (original Penn Treebank annotations included)	zip	tgz
Penn Treebank constituency annotation of entire MASC in original PTB bracket format	zip	tgz

MASC 3.0.0

Over 500K words of written and spoken data, including 25K words from each of 19 genres, all or parts of the data annotated for 17 different annotation types. See this page for details.

DOWNLOAD DATA AND STANDOFF ANNOTATIONS

MASC data and annotations (v 3.0.0)

zip

tgz

DOWNLOAD DATA ONLY (500K words UTF-8 textfiles)

MASC (500K) – data only

zip

tgz

MASC I

80K words of data with validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, and Penn Treebank syntax; and full-text FrameNet annotation for seventeen texts. This portion of the corpus contains 40K of texts annotated by the Unified Linguistic Annotation Project and about 5000 words of license-free English language data from the Language Understanding Corpus.

DOWNLOAD DATA AND STANDOFF ANNOTATIONS

Date Version Release notes Download

2010-09-20 1.0.3 1.0.3_notes MASC-1.0.3.zip | MASC-1.0.3.tgz

2010-07-23 1.0.2 1.0.2_notes MASC-1.0.2.zip | MASC-1.0.2.tgz

2010-05-17 1.0.1 MASC1.zip | MASC1.tgz

MASC SENTENCE CORPUS

One thousand occurrences of 114 words chosen by the FrameNet-WordNet harmonization effort manually annotated with WordNet 3.1 senses. The sentences containing the occurrences for 100 instances of each word have also been annotated for FrameNet frame elements. The data and annotations are distributed as a separate corpus. See the MASC Sentence Corpus page for more information.

DOWNLOAD SENTENCE CORPUS WITH STANDOFF ANNOTATIONS, DOCUMENTATION, AND INTER-ANNOTATOR AGREEMENT DATA

masc_wordsense.zip | masc_wordsense.tgz

MINI-MASC

A selection of five thousand words of MASC1 data from diverse genres, intended to support small annotation tasks and small supplements to larger annotation tasks. The data include four written and four spoken files, each roughly 500 words in length. Mini-MASC was originally conceived at the Copenhagen Dependency Treebank Workshop in August, 2010.

DOWNLOAD Mini-MASC

Mini-MASC.zip | Mini-MASC.tgz

MASC-CONLL

A 40K subset of MASC1 data with annotations for Penn Treebank syntactic dependencies and semantic dependencies from NomBank and PropBank in CONLL IOB format. This data set was used in the CONLL 2008 shared task on Joint Parsing of Syntactic and Semantic Dependencies.

DOWNLOAD MASC-CONLL

masc-conll.zip | masc-conll.tgz

MASC-PROPBANK-ORIG

An 88K subset of MASC data with annotations for Propbank in their original format, together with the Penn Treebank annotations upon which they rely. The Propbank data will be released in GrAF format so as to be compatible with other MASC annotations.

DOWNLOAD PROPBANK-ORIG

Propbank-original-format.zip | Propbank-original-format.tgz