15em
We now provide a web application (soon to be a web service) that allows uers to select the texts and annotations they want and obtain them in any of several different formats.
MASC I consists of approximately 82,000 words drawn from the OANC. The corpus includes manual annotations for WordNet senses and fulltext FrameNet frame annotations, and validated annotations for token and sentence boundaries, part of speech, noun chunks, verb chunks, named entities, and Penn Treebank syntactic annotations. The corpus includes texts from the Language Understanding Corpus, and many of the LU Corpus annoations are also included in MASC. In addition, about half of the corpus was annotated in the Unified Linguistic Annotation (ULA) project, and annotations for opinion, PropBank, and TimeML are either included in MASC I or forthcoming. All annotations, both in-house and contributed, are in LAF/GrAF format and can therefore be merged or combined using the ANC Tool and transduced to other formats using ANC2Go.
A beta version of the OANC Ngram Search Engine, created by Satoshi Sekine using his Linguistic Knowledge Discovery Tool, is available. We will be porting the engine to its permanent home on the ANC server this summer.
With funding from an IBM UIMA Innovation Award, we have developed tools to enable import and export of annotations in GrAF format in UIMA.
The ANC has been written up in national newspapers.
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.
When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). The corpus will also include an "opportunistic" component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts (and, where available, annotations) possible.
The ANC has so far released 22 million words of American English, which is available from the Linguistic Data Consortium--please consult the LDC Catalog entry. The ANC has also released an "Open" portion of the full ANC consisting of approximately 15 million words, which is freely available for download. All ANC and OANC data include annotations for word and sentence boundaries, part of speech (4 tagsets), and noun and verb chunks. Parts of the corpus are annotated for additional linguistic features.
The ANC is actively soliciting contributions of written texts and spoken transcripts in American English that were produced in or after 1990, to be included in the ANC and OANC.
Native speakers of American English who have produced documents of any kind (including college student essays, blogs, poetry, fiction, email, etc.) are invited to become a part of linguistic history by contributing your texts to the ANC!
Authors can consult the frequently asked questions page to learn more about how the data will be used, and why you should consider contributing your work to the ANC.
Those who have developed corpora of post-1989 American English for any purpose are also encouraged to contribute their unrestricted data.
We also seek annotations for linguistic features of any kind on all or part of the ANC/OANC and linguistic information (word lists, etc.) derived from it, for free distribution and use.
ANC annotations in Linguistic Annotation Format (LAF/GrAF) developed by ISO TC37 SC4, and a version of the ANC Tool that handles data in this format.
Named entity annotations for the entire OANC produced by the BBN tagger.
The American National Corpus project has received support from the ANC Consortium, the TalkBank project, the Department of Chinese, Translation, and Linguistics at the City University of Hong Kong, and the National Science Foundation.
The ANC also acknowledges the following, who have provided software and/or support for ANC development: