Contributing Annotations to the ANC

Annotations contributed to the ANC will be rendered into stand-off format, with links into to the original data. The stand-off documents will be freely downloadable from the ANC website. The ANC data itself must be acquired from the Linguistic Data Consortium (LDC). Scripts to merge the annotations with the ANC data are available on this website and on the CD containing the ANC distributed by LDC.

Derived data, such as word lists or databases, will be freely distributed on this site.

To contribute annotations or other data derived from the ANC, contact anc@cs.vassar.edu.

Contributing Documents to the ANC

The ANC will provide a massive body of language data in contemporary American English, similar to the British National Corpus (BNC) produced ten years ago. This corpus will enable dictionary makers, linguists, and developers of language understanding software to analyze the ways in which Americans typically use the English language, and to appropriately represent that usage in dictonaries and other reference works and academic studies of linguistic phenomena, and to be able to handle American usage in web search engines, translation machines, and other language processing software.

To this end, the ANC invites contributions of language data, including published and unpublished written and spoken (i.e., transcriptions) documents of all genres, including fiction, non-fiction, poetry, newspapers, magazines, journals, pamphlets, diaries etc., as well as web-based language data such as blogs, web pages, and email, and other less comoon genres such as rap lyrics.

Note that the ANC project has not enjoyed the funding and contribution of language data that projects such as the BNC relied on for their completion. Instead, we depend for our success on contributions of individuals like you to provide us with enough data to construct a representative sample of English as written and spoken by Americans today. In turn, your contribution will help to define "American English" for decades to come.

Criteria

The American National Corpus includes written and spoken (i.e., transcriptions) materials that fulfill the following requirements:

If you have any doubt or questions about the suitability of your contribution, please do not hesitate to contact us and we will get back to you right away.

It should be noted that contributing a document to the ANC does not guarantee it will be included in the final corpus. Furthermore, it is likely that no document will be included in its entirety, because the final corpus is intended to provide a representative sample of different genres. Therefore, relatively lengthy texts will be sampled by extracting three or four non-contiguous segments, for example, chapters 1,2,4,5,8,9 from a book. We may choose not to include a document in the corpus at all if there is some doubt that the author is a native speaker of American English, or if we are unable, for technical reasons, to extract meaningful information from the documents (more on this below).

Although we may not include your document in the ANC in its entirety, we prefer that you grant us rights to reproduce the entire content in the corpus, especially if it is short. We do, however, enable contributors to specify that a contributed document cannot be included in its entirety. Please consult the Frequently Asked Questions page to learn why granting us the right to include your entire document does not put you in danger of others reproducing or "stealing" your work.

If you still have questions or concerns, do not hesitate to contact us.

Document Format

We accept documents in almost any format. However, because of the massive amount of data we are processing, it is essential that we process documents automatically rather than by hand. In our case, "processing" means rendering the document in an XML format, where, ideally, titles, headings, words in italics, etc. are marked with specific tags identifying them as such. So, in addition to needing texts that are easy to process, we prefer texts in which things such as titles and italicized words are clearly identified. Any document produced with a word processor or marked up in HTML as a web page will usually contain this information (1) if the markup is, where possible, descriptive rather than presentational (i.e., tags that say what the content is rather than how it should look, as when you use <em> (emphasis) instead of <i> for italic); and (2) if markup is used consistently.

The following are some rules of thumb concerning formats. In general, the easier a document is to process, the more likely we will be able to use it in the ANC. Documents that are very difficult to process automatically will likely not be included in the ANC, so we ask that if you have a choice, please submit your document in a format as near the top of the following list as possible:

Your document(s) will be very easy to process if

Your document(s) will be relatively easy to process if

Your document(s) will be harder to process if

Your document(s) will be hard for us to process if

Procedure

Once you have ascertained that your document(s) satisfy the criteria for inclusion in the ANC, do the following: