15em 7em
first release second release open anc
about contents encoding frequency data using xaira bugs & caveats
obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
home overview masc I download
annotations software source code frequency data publications contributor's FAQ
project people anc mailing list contact us site map

The ANCProject


The American National Corpus (ANC) project is fostering the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language.

The availability of a corpus of American English will significantly contribute to language and linguistic research, development of language understanding computer applications (e.g., language translation and search and retrieval software), compilation of reference works such as dictionaries and thesauri, as well as provide a rich national resource for use in education at all levels.

The ANC will contain a core corpus of at least 100 million words, including both written and spoken (transcripts) data comparable across genres to the BNC. The genres in the ANC will be expanded to include "new" types of language data that have become available in recent years, such as web blogs and web pages, chats, email, and rap music lyrics. In addition to the core 100 million words, the ANC will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of data possible.

A consortium of publishers of American English dictionaries and companies with interests in language processing was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project.

In fall, 2003, the ANC produced its First Release of over 11 million words of American English. This and all future relases of ANC data are distributed by the Linguistic Data Consortium (LDC).

All ANC data is distributed by the LDC for a nominal ($75) charge, for non-commercial research purposes. Commercial use is limited to members of the ANC Consortium (ANCC) until fall, 2008. New commercial members can join the ANCC at any time.