MASC Downloads

 

MASC is a community resource that is freely available for download and use. In turn we ask that you provide us with any of the following that may have resulted from your use of MASC data and/or annotations, which we will make freely available to the user community:

  1. errors or problems

  2. corrections/validations of any part of MASC or the OANC, both text and annotations

  3. additional annotations in any format

  4. derived resources, including word lists, frequency lists, n-grams, extracted entities or other knowledge, statistics of any kind, etc.

Please send reports or comments to anc@anc.org.

DATA DOWNLOAD


MASC data and annotations can be obtained in two ways:

  1. use ANC2Go to select portions of the corpus and annotations and receive a “customized” corpus including only your selections in one of the following output formats:

  2. in-line XML (XCES), suitable for use with the BNC’s XAIRA search and access interface and other XML-aware software

  3. token / part of speech, a common input format for general-purpose concordance software such as MonoConc, as well as the Natural Language Toolkit (NLTK)

  4. CONLL IOB format

  5. download the data, alone or with all available annotations in the ANC format, below.

The “core” MASC corpus is divided into three sets:


MASC I                            *** Data and annotations available ***

80K words of data with validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, and Penn Treebank syntax; and full-text FrameNet annotation for seventeen texts. This portion of the corpus contains 40K of texts annotated by the Unified Linguistic Annotation Project and about 5000 words of license-free English language data from the Language Understanding Corpus.


DOWNLOAD DATA ONLY (82K words UTF-8 textfiles)

masc1_data-only.zip    |    masc1_data-only.tgz


DOWNLOAD DATA AND STANDOFF ANNOTATIONS

Date                Version       Release notes          Download

2010-09-20      1.0.3            1.0.3_notes              MASC-1.0.3.zip  |   MASC-1.0.3.tgz

2010-07-23      1.0.2            1.0.2_notes              MASC-1.0.2.zip  |   MASC-1.0.2.tgz

2010-05-17      1.0.1                                             MASC1.zip         |   MASC1.tgz

MASC II                              *** Data available ***

120K words of additional data from a range of genres. Annotations produced within the MASC project (token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, FrameNet, plus WordNet sense annotations) will be released in fall, 2010.


DOWNLOAD DATA ONLY (140K words UTF-8 textfiles)

masc2_data-only.zip    |   masc2_data-only.tgz


MASC III                             *** Data available ***

280K words of additional data, filling out the 500K sub-corpus and rounding out the genre distribution. Annotations produced within the MASC project (token, part of speech, sentence boundary, noun chunks, verb chunks, named entities, FrameNet, plus WordNet sense annotations) will be released in early 2012.


DOWNLOAD DATA ONLY (280K words UTF-8 textfiles)

masc3_data-only.zip    |   masc3_data-only.tgz



FULL 500K MASC                             *** Data available ***

Over 500K words of written and spoken data, including 25K words from each of 20 genres. See the Full MASC webpage for details.


DOWNLOAD DATA ONLY (500K words UTF-8 textfiles)

masc_500k_tetxts.zip    |   masc_500k_texts.tgz

WORDNET SENSE ANNOTATIONS

One thousand occurrences of 100 words chosen by the FrameNet-WordNet harmonization effort have been manually annotated with WordNet 3.1 senses. The sentences containing the occurrences for 100 instances of each word have also been annotated for FrameNet frame elements. The data and annotations are distributed as a separate corpus. See WordNet - FrameNet Annotations for more information.


DOWNLOAD SENTENCE CORPUS WITH STANDOFF ANNOTATIONS, DOCUMENTATION, AND INTER-ANNOTATOR AGREEMENT DATA

masc_wordsense.zip     |     masc_wordsense.tgz

TOOL DOWNLOAD


The ANC project has not developed project-specific software for MASC and OANC data. Our approach is to instead provide the data and annotations in formats compatible with a wide variety of existing applications and frameworks.


  1. For XML-aware tools and applications, BNC’s XIARA, concordancing software such as MonoConc, and NLTK (token/pos only), use ANC2Go to generate the corpora and annotations in the appropriate format. Output in CONLL IOB format will be available in early October.


  1. To use MASC/OANC data and annotations in the General Architecture for Text Engineering (GATE) and/or output annotations created in GATE in GrAF format, DOWNLOAD THE ANC/GrAF GATE PLUGINS. Installation and use instructions are available here.


  1. To use MASC/OANC data and annotations in the Unstructured Information Management Architecture (UIMA), DOWNLOAD ANC UIMAUtils.jar. Installation and use instructions are available here.


  1. Available Spring 2011: To use MASC/OANC data and annotations in the Natural Language Toolkit (NLTK), DOWNLOAD NLTK CORPUS READER. Installation and use instructions are available here.


  1. To access and manipulate GrAF annotations directly from Java programs, USE THE GrAF API. The GrAF API also provides a renderer that generates input to the open source GraphViz graph visualization application.

Quick Data Download

If you know what you are looking for you can download directly from the following list. Otherwise continue reading below.

MINI-MASC

A selection of five thousand words of MASC1 data from diverse genres, intended to support small annotation tasks and small supplements to larger annotation tasks. The data include four written and four spoken files, each roughly 500 words in length. Mini-MASC was originally conceived at the Copenhagen Dependency Treebank Workshop in August, 2010.


DOWNLOAD Mini-MASC

Mini-MASC.zip    |   Mini-MASC.tgz

MASC-CONLL

A 40K subset of MASC1 data with annotations for Penn Treebank syntactic dependencies and semantic dependencies from NomBank and PropBank in CONLL IOB format. This data set was used in the CONLL 2008 shared task on Joint Parsing of Syntactic and Semantic Dependencies.


DOWNLOAD MASC-CONLL

masc-conll.zip    |   masc-conll.tgz

Dec. 22, 2010

MASC I is now also available from the Linguistic Data Consortium.

Please note that additional licensing conditions apply to the LDC version. Consult the LDC Catalog entry (Catalog ID LDC2010T22) for more information.

MASC-PROPBANK-ORIG

A 40K subset of MASC1 data with annotations for Propbank in their original format, together with the Penn Treebank annotations upon which they rely. The Propbank data will be released later this year in GrAF format so as to be compatible with other MASC annotations.


DOWNLOAD PROPBANK-ORIG

Propbank-original-format.zip    |   Propbank-original-format.tgz