15em 7em

first release second release open anc
about contents encoding frequency data using xaira bugs & caveats
obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
home overview masc I download
annotations software source code frequency data publications contributor's FAQ
project people anc mailing list contact us site map

What's New

ANC2Go

We now provide a web application (soon to be a web service) that allows uers to select the texts and annotations they want and obtain them in any of several different formats.

The First Release of the Manually Annotated Sub-Corpus (MASC)

MASC I consists of approximately 82,000 words drawn from the OANC. The corpus includes manual annotations for WordNet senses and fulltext FrameNet frame annotations, and validated annotations for token and sentence boundaries, part of speech, noun chunks, verb chunks, named entities, and Penn Treebank syntactic annotations. The corpus includes texts from the Language Understanding Corpus, and many of the LU Corpus annoations are also included in MASC. In addition, about half of the corpus was annotated in the Unified Linguistic Annotation (ULA) project, and annotations for opinion, PropBank, and TimeML are either included in MASC I or forthcoming. All annotations, both in-house and contributed, are in LAF/GrAF format and can therefore be merged or combined using the ANC Tool and transduced to other formats using ANC2Go.

OANC NGram Search Engine

A beta version of the OANC Ngram Search Engine, created by Satoshi Sekine using his Linguistic Knowledge Discovery Tool, is available. We will be porting the engine to its permanent home on the ANC server this summer.

Tools to use OANC and MASC in UIMA

With funding from an IBM UIMA Innovation Award, we have developed tools to enable import and export of annotations in GrAF format in UIMA.

Become an ANC FACEBOOK fan!

facebook

ANC in the News

The ANC has been written up in national newspapers.

The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.

When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). The corpus will also include an "opportunistic" component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts (and, where available, annotations) possible.

ANC Status

The ANC has so far released 22 million words of American English, which is available from the Linguistic Data Consortium--please consult the LDC Catalog entry. The ANC has also released an "Open" portion of the full ANC consisting of approximately 15 million words, which is freely available for download. All ANC and OANC data include annotations for word and sentence boundaries, part of speech (4 tagsets), and noun and verb chunks. Parts of the corpus are annotated for additional linguistic features.

Contribute Data and Annotations to the ANC

Left arrowCONTRIBUTE TEXTS

The ANC is actively soliciting contributions of written texts and spoken transcripts in American English that were produced in or after 1990, to be included in the ANC and OANC.

Native speakers of American English who have produced documents of any kind (including college student essays, blogs, poetry, fiction, email, etc.) are invited to become a part of linguistic history by contributing your texts to the ANC!

Authors can consult the frequently asked questions page to learn more about how the data will be used, and why you should consider contributing your work to the ANC.

Those who have developed corpora of post-1989 American English for any purpose are also encouraged to contribute their unrestricted data.

Left arrowCONTRIBUTE ANNOTATIONS AND DERIVED DATA

We also seek annotations for linguistic features of any kind on all or part of the ANC/OANC and linguistic information (word lists, etc.) derived from it, for free distribution and use.

Coming Soon

ANC annotations in Linguistic Annotation Format (LAF/GrAF) developed by ISO TC37 SC4, and a version of the ANC Tool that handles data in this format.

Named entity annotations for the entire OANC produced by the BBN tagger.

Acknowledgements

The American National Corpus project has received support from the ANC Consortium, the TalkBank project, the Department of Chinese, Translation, and Linguistics at the City University of Hong Kong, and the National Science Foundation.

The ANC also acknowledges the following, who have provided software and/or support for ANC development:

Gate logo