Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.
The American National Corpus now owns the anc.org domain name! Our web address is now www.anc.org.
We would like to thank the Animal News Center for transferring the domain to us. In gratitude, the American National Corpus project has made a donation to the Humane Society of the United States in the name of the "other" ANC.
October 22, 2008: Version 1.2.5 of the ANC Tool is now available. The new version fixes a problem that prevented it from starting on Mac OS X.
July 24, 2008: Version 1.2.3 of the ANC Tool is now available. The new version includes better support for selecting the Unicode character encoding, a few bug fixes, and (experimental) NLTK output.
The open portion of the ANC (approximately 15 million words of text, with annotations) is now available for download.
Frequency counts for the second release are now available and can be downloaded here.
Both sets of annotations can be downloaded from our annotations page.
The ANC, in collaboration with the FrameNet project, WordNet, and Columbia University, has received a grant from the National Science Foundation to produce a balanced sub-corpus of the ANC that is manually annotated for WordNet senses, FrameNet frames, and validated for word and sentence boundaries, part of speech, noun chunks, and verb chunks.
The ANC has been awarded an IBM UIMA Innovation Grant to port the ANC to UIMA and provide information with all ANC annotations that conform to UIMA Type Definitions.
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.
When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). The corpus will also include an "opportunistic" component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts (and, where available, annotations) possible.
The ANC has so far released 22 million words of American English, which is available from the Linguistic Data Consortium--please consult the LDC Catalog entry. The ANC has also released an "Open" portion of the full ANC consisting of approximately 15 million words, which is freely available for download. All ANC and OANC data include annotations for word ands sentence boundaries, part of speech (4 tagsets), and noun and verb chunks. Parts of the corpus are annotated for additional linguistic features.
The ANC is actively soliciting contributions of written texts and spoken transcripts in American English that were produced in or after 1990, to be included in the ANC and OANC.
Those who have for any purpose developed corpora of post-1989 American English are encouraged to contribute their unrestricted data to be included in the ANC. Authors can consult the frequently asked questions page to learn more about how the data will be used, and why you should consider contributing your work to the ANC.
We also seek annotations for linguistic features of any kind on all or part of the ANC/OANC and linguistic information (word lists, etc.) derived from it, for free distribution and use.
ANC annotations in Linguistic Annotation Format (LAF/GrAF) developed by ISO TC37 SC4, and a version of the ANC Tool that handles data in this format.
New output options for the ANC Tool, including UIMA.
The First Release of the Manually Annotated Sub-Corpus (MASC) is scheduled for the end of 2008. The corpus consists of approximately 120,000 words drawn from the OANC and data that will be included in the next release of OANC data. The latter data include the publicly available portions of the Language Understanding Corpus that has been annotated by several projects and will be distributed by the LDC. About half of the corpus will also be annotated for the following from the work of the Unified Linguistic Annotation (ULA) project, as the annoations become available: Penn Treebank-style syntactic annotations, PropBank, NomBank, TimeML, and opinion annotations. All annotations, both in-house and contributed, will be in LAF/GrAF format and can therefore be merged or combined using the ANC Tool.
The ANC has been written up in national newspapers.
The American National Corpus project has received support from the ANC Consortium, the TalkBank project, the Department of Chinese, Translation, and Linguistics at the City University of Hong Kong, and the National Science Foundation.
The ANC also acknowledges the following, who have provided software and/or support for ANC development: