MASC data and annotations can be obtained in two ways:
- Use the ANCTool to select portions of the corpus and annotations and receive a “customized” corpus including only your selections in one of the following output formats:
- in-line XML (XCES), suitable for use with the BNC’s XAIRA search and access interface and other XML-aware software
- token / part of speech, a common input format for general-purpose concordance software
- Format readable by the Natural Language Toolkit (NLTK) using a TaggedCorpusReader
- CONLL IOB format
- Download the data, alone or with all available annotations in the ANC format, below.
You may also use the GATE Tools and UIMA Tools to read MASC data and annotations into these applications.
Quick Data Download
If you know what you are looking for you can download directly from the following list. Otherwise continue reading below.
MASC data and annotations (v 3.0.0) | ||
MASC (500K) – data only | ||
MASC I (v. 1.0.3) | zip | tgz |
MASC Sentence Corpus | ||
MASC Sentence Corpus (tab-separated format) | – | |
MASC AMT Word Sense Annotations | zip | |
Mini-MASC | ||
40K MASC1 data in CONLL format | ||
Propbank annotations of 88K of MASC data, in original PB format (original Penn Treebank annotations included) | ||
Penn Treebank constituency annotation of entire MASC in original PTB bracket format |
MASC 3.0.0
Over 500K words of written and spoken data, including 25K words from each of 19 genres, all or parts of the data annotated for 17 different annotation types. See this page for details.
DOWNLOAD DATA AND STANDOFF ANNOTATIONS
MASC data and annotations (v 3.0.0) |
DOWNLOAD DATA ONLY (500K words UTF-8 textfiles)
MASC (500K) – data only |
MASC I
80K words of data with validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, and Penn Treebank syntax; and full-text FrameNet annotation for seventeen texts. This portion of the corpus contains 40K of texts annotated by the Unified Linguistic Annotation Project and about 5000 words of license-free English language data from the Language Understanding Corpus.
DOWNLOAD DATA AND STANDOFF ANNOTATIONS
Date Version Release notes Download
2010-09-20 1.0.3 1.0.3_notes MASC-1.0.3.zip | MASC-1.0.3.tgz
2010-07-23 1.0.2 1.0.2_notes MASC-1.0.2.zip | MASC-1.0.2.tgz
2010-05-17 1.0.1 MASC1.zip | MASC1.tgz
MASC SENTENCE CORPUS
One thousand occurrences of 114 words chosen by the FrameNet-WordNet harmonization effort manually annotated with WordNet 3.1 senses. The sentences containing the occurrences for 100 instances of each word have also been annotated for FrameNet frame elements. The data and annotations are distributed as a separate corpus. See the MASC Sentence Corpus page for more information.
DOWNLOAD SENTENCE CORPUS WITH STANDOFF ANNOTATIONS, DOCUMENTATION, AND INTER-ANNOTATOR AGREEMENT DATA
masc_wordsense.zip | masc_wordsense.tgz
MINI-MASC
A selection of five thousand words of MASC1 data from diverse genres, intended to support small annotation tasks and small supplements to larger annotation tasks. The data include four written and four spoken files, each roughly 500 words in length. Mini-MASC was originally conceived at the Copenhagen Dependency Treebank Workshop in August, 2010.
DOWNLOAD Mini-MASC
MASC-CONLL
A 40K subset of MASC1 data with annotations for Penn Treebank syntactic dependencies and semantic dependencies from NomBank and PropBank in CONLL IOB format. This data set was used in the CONLL 2008 shared task on Joint Parsing of Syntactic and Semantic Dependencies.
DOWNLOAD MASC-CONLL
masc-conll.zip | masc-conll.tgz
MASC-PROPBANK-ORIG
An 88K subset of MASC data with annotations for Propbank in their original format, together with the Penn Treebank annotations upon which they rely. The Propbank data will be released in GrAF format so as to be compatible with other MASC annotations.
DOWNLOAD PROPBANK-ORIG