|American National Corpus Project|
The first release of the American National Corpus contains over 10,000,000 words of written and spoken American English, annotated for lemma and part of speech. It is available for research and education for a nominal licensing fee from the Linguistic Data Consortium. Commercial users can obtain the corpus and gain rights to use it in commercial products by joining the ANC Consortium.
Please consult the LDCCatalog entry for the ANC First Release
THE DATA | ENCODING CONVENTIONS | FILE STRUCTURE | KNOWN BUGS AND CAVEATS
The First Release of the ANC is a beta version
The texts included in the first 10 million words of the ANC are those that were first received. Therefore the corpus is not balanced. There has been no hand-validation of the XML tagging or the part of speech annotation tags.Headers are minimal, although they contain fairly complete information concerning domain, subdomain, subject, audience, and medium.Check the list of known bugs and caveats for a description of the limitations we are currently aware of.
One of the aims of releasing this first 10 million words is to get feedback from the community about its structure and annotation, so that modifications can be made, if necessary, for the final release of the full 100 million words. We therefore invite comments and bug reports from the community of ANC users. Please contact firstname.lastname@example.org .