ANC2Go | Open American National Corpus

ANC2Go is a web service that allows users to create a “customized corpus” from OANC and MASC data and annotations.

ANC2Go allows the user to specify the following:

corpus content: all or a subset of texts included in either of the OANC and MASC.
annotations: all or a subset of available annotations.
output format: output format for the corpus:
- XML: Annotations with inline XML tags.
- word+POS: Produces word [separator] POS-tag (user can specify the separator) for Token annotations. Suitable for use with concordancing software and for input to tools such as the Stanford Parser, OpenNLP tools, etc.
- NLTK tagged format: Produces word [separator] POS-tag for Token annotations, readable with the NLTK tagged corpus reader.
- CoNLL: IOB format used in shared tasks by the Conference on Natural Language Learning.
- UIMA CAS: Input to the UIMA system.
- COMING SOON: RDF: Linked data format for the Semantic Web.

Using ANC2Go

To generate a custom corpus, do the following:

Access the ANC2Go interface
Enter your email address in the boxes provided. An email with the URI where your custom corpus is sent to this address upon completion of processing.
In the left panel, select the subset of texts you want to include in your corpus, or click Select All at the bottom of the list for the entire corpus. Note: currently, only MASC is available through ANC2Go–OANC will be available soon.
Choose the desired output format by clicking on one of the tabs across the top of the right pane.
Choose the desired annotation(s) from the provided lists.

Note that the MASC annotations include three different tokenizations:

Tokens produced by GATE’s ANNIE tokenizer, which are the base for most MASC annotations.
Tokens produced by the Penn Treebank project, which form the base for the Penn Treebank syntax annotations
Tokens produced by the FrameNet project, which form the base of the FrameNet semantic role annotations.

The annotations you choose dictate the tokenization that will be included in the output. For example, if you choose Penn Treebank syntax you will get Penn Treebank tokens by default. Because the various annotations are dependent on the underlying tokenization, you cannot, for example, use GATE tokens with Penn Treebank syntax, or FrameNet tokens with verb chunks.

The NLTK and Word+POS formats allow you to choose the tokenization you prefer. If you want token and POS only in any of the other formats, you can achieve this by selecting the radio button for Tokens under the tab for that format.

Click Start at the bottom left of the page.
You should receive an email within a few minutes (it could be longer if you request a large corpus and multiple annotations) providing a download link for obtaining your corpus.

Caveats:

At present, tokens are always included in the output, regardless of which additional annotations are selected. The ability to exclude tokens will be included in a future release.
XML is not guaranteed to be well-formed if there are overlapping XML elements.