oanc   masc   other


texts   annotations
derived data

The Open American National Corpus

The Open American National Corpus (OANC) is a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. All data and annotations are fully open and unrestricted for any use.

Available Data and Annotations

OANC : 15 million words of contemporary American English with automatically-produced annotations for a variety of linguistic phenomena.

MASC : 500,000 words of OANC data equally distributed over 19 genres of American English, with manully produced or validated annotations for several layers of linguistic phenomena.



Contribute Text, Annotations, and Derived Data

OANC and MASC are collaborative development resources that rely on contributions of data and annotations from the linguistics and natural language processing communities as well as the public at large.

We solicit contributions of written texts and spoken transcripts in American English that were produced in or after 1990 to be included in the OANC and/or MASC.

Native speakers of American English (Am I a Native Speaker?) who have produced documents of any kind (including college student essays, blogs, poetry, fiction, email, etc.) are invited to become a part of linguistic history by contributing these materials to the OANC/MASC. Authors can consult the frequently asked questions page to learn more about how the data will be used, and why you should consider contributing your work to the OANC.

Those who have developed corpora of post-1989 American English for any purpose are also encouraged to contribute their unrestricted data. We also ask users to contribute annotations for linguistic features of any kind on all or part of the OANC and/or MASC and contribute derived data such as word lists, etc. derived from OANC/MASC, for free distribution and use.