The Manually Annotated Sub-Corpus (MASC) project has been established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. The project is providing appropriate data and annotations to serve as the base for a community-wide annotation effort, together with an infrastructure that enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or transduced to any of a variety of other formats. The MASC project’s aim is to offset some of the high costs of producing high quality linguistic annotations via a distribution of effort, and to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. It also provides data from a much wider variety of genres than existing multiply-annotated corpora of English, and all of the data in the corpus are drawn from current American English so as to be most useful for NLP applications used in the web-based environment. Perhaps most importantly, the MASC project is committed to a fully open model of distribution, without restriction, for all data and annotations produced or contributed. As such, MASC is the first large-scale, open, community-based effort to create a much needed language resource for NLP.
HISTORY AND PARTICIPANTS
The mandate for MASC was established at an NSF-sponsored workshop in 2006, which was attended by computational linguistics researchers from the US and Europe and major funders. The workshop report identified the need for a corpus containing a wide variety of genres and including reliable annotations for a range of linguistic phenomena. As a result of the workshop, a proposal was submitted to the Computing Resource Infrastructure (CRI) program of the National Science Foundation to create a manually annotated sub-corpus drawn from the American National Corpus. The project, funded in 2008, involves Vassar College (PI: Nancy Ide, ANC), Columbia University (Rebecca Passonneau), and the International Computer Science Institute (ICSI) (Collin Baker, FrameNet), and Princeton University (Christiane Fellbaum, WordNet).
AVAILABILITY AND DISTRIBUTION
In addition to enabling download of the entire MASC, we provide the ANC2Go web service, which allows users to select some or all parts of the corpus and choose among the available annotations via a web interface. Once generated, the corpus and annotation bundle is made available to the user for download. Thus, the MASC user need never deal directly with or see the underlying representation of the stand-off annotations, but gains all the advantages that representation offers. The following output formats are currently available:
- in-line XML (XCES), suitable for use with the BNC’s XAIRA search and access interface and other XML-aware software;
- token / part of speech, a common input format for general-purpose concordance software such as MonoConc, as well as the Natural Language Toolkit (NLTK);
- CONLL IOB format, used in the Conference on Natural Language Learning shared tasks;
- W3C’s Resource Description Framework (RDF/OWL).
The ANC project also provides plugins for the General Architecture for Text Engineering(GATE) to input and/or output annotations in GrAF format; a “CAS Consumer” to enable using GrAF annotations in the Unstructured Information Management Architecture (UIMA), together with a moduel to output UIMA annotations in GrAF; and a corpus reader to import MASC/OANC data and annotations into the Natural Language Toolkit (NLTK).
Finally, the ANC project provides an API that enables access and manipulation of GrAF annotations directly from Java programs. The API also provides a renderer for GrAF annotations that generates input to the open source GraphViz graph visualization application.