ABOUT MASC

 

PURPOSE

The Manually Annotated Sub-Corpus (MASC) project has been established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. The project is providing appropriate data and annotations to serve as the base for a community-wide annotation effort, together with an infrastructure that enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or transduced to any of a variety of other formats. The MASC project's aim is to offset some of the high costs of producing high quality linguistic annotations via a distribution of effort, and to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. It also provides data from a much wider variety of genres than existing multiply-annotated corpora of English, and all of the data in the corpus are drawn from current American English so as to be most useful for NLP applications used in the web-based environment. Perhaps most importantly, the MASC project is committed to a fully open model of distribution, without restriction, for all data and annotations produced or contributed. As such, MASC is the first large-scale, open, community-based effort to create a much needed language resource for NLP.

THE CORPUS

MASC is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word (and growing) corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions. 

Where licensing permits, data for inclusion in MASC is drawn from sources that have already been heavily annotated by others. So far, the first 80K increment of MASC data includes a 40K subset consisting of OANC data that has been previously annotated for PropBank predicate argument structures, Pittsburgh Opinion annotation (opinions, evaluations, sentiments, etc.), TimeML time and events, and several other linguistic phenomena. It also includes about 5K from the 10K Language Understanding (LU) Corpus that has been annotated by multiple groups for a wide variety of phenomena, including events and committed belief. The second 120K increment includes 5.5K words of Wall Street Journal texts that have been annotated by several projects, including Penn Treebank, PropBank, Penn Discourse Treebank, TimeML, and the Pittsburgh Opinion project. The remaining 280K of the corpus fills out the genres that are under-represented in the first portion and includes a few additional genres such as blogs and tweets.

ANNOTATIONS

The MASC project is itself producing annotations for portions of the corpus for WordNet senses and FrameNet frames and frame elements. To derive maximal benefit from the semantic information provided by these resources, the entire corpus is also annotated and manually-validated for shallow parses (noun and verb chunks) and named entities (person, location, organization, date and time). MASC I (82K) is also annotated in its entirely with Penn Treebank syntax. Several additional types of annotation have either been contracted by the MASC project or contributed from other sources. MASC II (available mid-summer 2012) includes seventeen different types of linguistic annotation:

FORMAT

All MASC annotations, whether contributed or produced in-house, are transduced to the Graph Annotation Framework (GrAF) defined by ISO TC37 SC4's Linguistic Annotation Framework (LAF). GrAF is an XML serialization of the LAF abstract model of annotations, which consists of a directed graph decorated with feature structures providing the annotation content. GrAF's primary role is to serve as a “pivot” format for transducing among annotations represented in different formats. However, because the underlying data structure is a graph, the GrAF representation itself can serve as the basis for analysis via application of graph-analytic algorithms such as common sub-tree detection.

The layering of annotations over MASC texts dictates the use of a stand-off annotation representation format, in which each annotation is contained in a separate document linked to the primary data. Each text in the corpus is provided in UTF-8 character encoding in a separate file, which includes no annotation or markup of any kind.

Each file is associated with a set of GrAF standoff files, one for each annotation type, containing the annotations for that text. In addition to the annotation types listed above, a document containing annotation for logical structure (titles, headings, sections, etc. down to the level of paragraph) is included.  Each text is also associated with a header document that provides appropriate metadata together with machine-processable information about associated annotations and inter-relations among the annotation layers. Contributed annotations are also included in their original format, where possible.

WORDNET SENSE ANNOTATIONS

A focus of the MASC project is to provide corpus evidence to support an effort to harmonize sense distinctions in WordNet and FrameNet. The WordNet and FrameNet teams have selected for this purpose 100 common polysemous words whose senses they will study in detail, and the MASC team is annotating occurrences of these words in the MASC.

As a first step, fifty occurrences of each word are annotated using the WordNet 3.0 inventory and analyzed for problems in sense assignment, after which the WordNet team may make modifications to the inventory if needed. The revised inventory (which will be released as WordNet 3.1) is then used to annotate 1000 occurrences. Because of its small size, MASC typically contains less than 1000 occurrences of a given word; the remaining occurrences are therefore drawn from the 15 million words of the OANC. Furthermore, the FrameNet team is also annotating one hundred of the 1000 sentences for each word with FrameNet frames and frame elements, providing direct comparisons of WordNet and FrameNet sense assignments in attested sentences.Note that several MASC texts have been fully annotated for FrameNet frames and frame elements, in addition to the WordNet-tagged sentences.

For convenience, the annotated sentences are provided as a stand-alone corpus, with the WordNet and FrameNet annotations represented in standoff files. Each sentence in this corpus is linked to its occurrence in the original text, so that the context and other annotations associated with the sentence may be retrieved. The sense annotation exercise is also being used as a base for an extensive inter-annotator agreement study.

AVAILABILITY AND DISTRIBUTION

MASC is distributed without license or other restrictions from the American National Corpus website. It is also freely available from the Linguistic Data Consortium (LDC).

In addition to enabling download of the entire MASC, we provide a web application that allows users to select some or all parts of the corpus and choose among the available annotations via a web interface. Once generated, the corpus and annotation bundle is made available to the user for download. Thus, the MASC user need never deal directly with or see the underlying representation of the stand-off annotations, but gains all the advantages that representation offers. The following output formats are currently available:

•    in-line XML (XCES), suitable for use with the BNC’s XAIRA search and access interface and other XML-aware software;

•    token / part of speech, a common input format for general-purpose concordance software such as MonoConc, as well as the Natural Language Toolkit (NLTK);

•    CONLL IOB format, used in the Conference on Natural Language Learning shared tasks;

The ANC project also provides plugins for the General Architecture for Text Engineering (GATE) to input and/or output annotations in GrAF format; a “CAS Consumer” to enable using GrAF annotations in the Unstructured Information Management Architecture (UIMA), together with a moduel to output UIMA annotations in GrAF; and a corpus reader to import MASC/OANC data and annotations into the Natural Language Toolkit (NLTK).

Finally, the ANC project provides an API that enables access and manipulation of GrAF annotations directly from Java programs. The API also provides a renderer for GrAF annotations that generates input to the open source GraphViz graph visualization application.

PURPOSE        HISTORY AND PARTICIPANTS        THE CORPUS       

ANNOTATIONS        FORMAT       

WORDNET SENSE ANNOTATIONS        AVAILABILITY AND DISTRIBUTION

HISTORY AND PARTICIPANTS

The mandate for MASC was established at an NSF-sponsored workshop in 2006, which was attended by computational linguistics researchers from the US and Europe and major funders. The workshop report identified the need for a corpus containing a wide variety of genres and including reliable annotations for a range of linguistic phenomena. As a result of the workshop, a proposal was submitted to the Computing Resource Infrastructure (CRI) program of the National Science Foundation to create a manually annotated sub-corpus drawn from the American National Corpus. The project, funded in 2008, involves Vassar College (PI: Nancy Ide, ANC), Columbia University (Rebecca Passonneau), and the International Computer Science Institute (ICSI) (Collin Baker, FrameNet). The WordNet project (Christiane Fellbaum) provides consulting to the project.