The American National Corpus (ANC) project is fostering the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language.
Since the ANC Second Release of 22 million words of data through the Linguistic Data Consortium (LDC) in 2005, the ANC project has committed to including only fully open data in the corpus and distributing all data and annotations freely from our website, as well as through the LDC. A fifteen million word subset of the ANC Second Release now constitute the Open American National Corpus, which is downloadable for any use from this website. The ANC project currently holds about 40 million additional words of open data, which will be processed for inclusion in the OANC when funding for its production becomes available.
The OANC is a collaborative development project that relies on contributions of data and annotations from the linguistics and natural language processing communities as well as the public at large.
The goal for the OANC is to contain a core corpus of at least 100 million words, including both written and spoken (transcripts) data comparable across genres to the BNC. The genres in the OANC also include “new” types of language data that has become available in recent years, such as web blogs and web pages, tweets, chats, email, and rap music lyrics. In addition to the core 100 million words, the OANC will include an additional component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of data possible.
Unlike the BNC, the OANC is annotated for multiple linguistic phenomena, including logical structure, word and sentence boundary, lemma and part-of-speech (for several different tag sets), shallow parse (noun and verb chunks), and named entities (person, organization, location, date). All annotations are automatically produced and unvalidated. A 500,000 word subset of the OANC, the Manually Annotated Sub-Corpus (MASC), includes these and other annotations for a wide range of linguistic phenomena that have been either manually produced or hand validated.
A consortium of publishers of American English dictionaries and companies with interests in language processing provided a set of materials for inclusion in the ANC First and Second Releases and provided initial financial support for the project. The ANC Project has also received support from the National Science Foundation, the TalkBank Project, and the Department of Chinese, Translation, and Linguistics at the City University of Hong Kong. The project has also received technical support from the developers of the General Architecture for Text Engineering (GATE).