]> ANC First release
 American National Corpus Project
ANC Home ANC Consortium Linguistic Data Consortium

AMERICAN NATIONAL CORPUS FIRST RELEASE

Known Bugs and Caveats


THE DATA | ENCODING CONVENTIONSFILE STRUCTURE | FIRST RELEASE | ANC HOME

Conformance to standards and best practice

The ANC has been created with the intention of adhering to the extent possible to existing and emerging standards and "best practices" for markup and the representation of language resources and their annotations. These include W3C markup and data representation standards and the recommendations of the International Standards Organization (ISO) sub-committee for language resources (ISO TC37 SC4). While this ensures that the ANC is at the state-of-the-art and (we hope) will be processable by a wide variety of available tools and web-based applications, it also means that we are dealing with recommendations that have yet to be finalized (and therefore might change) and relying on processing capabilities that are not yet widely implemented. For this reason, a number of encoding choices have been made with an eye toward enabling immediate use of the ANC, while at the same time providing for adaptation to standards and practices that emerge in the future:

The original goal when creating the schemas was to define a series of model groups, use these groups to define types, and then declare elements using these types. Schemas would then be modified and customized by redefining the models groups, which would change the types and, therefore, the element definitions. For example, the only real difference between the xcesDoc.xsd schema and the xcesMerged.xsd schema is that xcesMerged.xsd allows <tok> elements in the model group for sentence level content. Using xsd:redefine, the xcesMerged schema should only need to add a defintion of the <tok> element and redefine the model group for sentence level content, rather than copying the entire xcesDoc schema.

It is expected that the next version, the XCES schemas will use xsd:redefine.

Validation

The ANC First Release data has been validated using the XSV schema validator. However, it is possible that a few invalid files escaped notice.

Automatic processing

All of the markup and annotation in the first release of the ANC was produced entirely automatically from data in a variety of formats:

For gross logical structure (down to the level of paragraph), text parts such as title, author, section head, footnotes, quotations, lists, etc. may or may not be marked as such, depending upon whether or not this information was differentiated in some systematic way in the original format. In general, our algorithm assumed that the presence of a carriage return signalled the beginning of a paragraph unless otherwise indicated, and therefore elements unrecognizable as any other type of element are typically encloded in <p> tags.

At this time, sentence markup is included in the primary texts (although this maybe changed in the final version of the corpus). All sentence markup was automatically produced by the sentence splitter included in the Gate system, and there are occasional errors, usually due to the presence of abbreviations in mid-sentence. The sentence splitter also puts punctuation appearing after the terminating period (e.g., closing quotation mark, closing parenthesis) outside the sentence boundary.

Part-of-speech tagging was done automatically at Northern Arizona University using the Biber tagger, with no hand-validation. The annotations include both Biber's part-of-speech tags and lemmas. The Biber tagger has an average accuracy similar to most taggers (95% or higher).

Due to some inconsistencies in the way clitics (e.g., "don't", let's") are treated by the Biber tagger (some treated as two separate words, some as one with extra information), a few words in the corpus are not annotated for part of speech because they were skipped during post-processing. Words that have not been tagged have not been included in the word counts for the corpus.

Gold standard sub-corpus

The ANC Project has obtained funds from the U.S. National Science Foundation to hand-validate both the structural markup and part of speech annotation in a 10 million word subset of the ANC in order to create a "gold standard" corpus. The gold standard corpus will be balanced along the same lines as the entire 100 million+ words of the ANC, and will therefore not coincide exactly with the 10 million words in the first release. The remaining portion of the corpus will be corrected to the extent possible using automated means and whatever level of hand-validation can be accomplished under our budget.

Funky characters in the data

Note on the OUP and Berlitz files: The original Quark Express files contained characters that were encoded with numeric codes that may or may not correspond to an International Standards Organization (ISO) hexidecimal character. Where possible these values have been converted to the corresponding entity, for example 0x00E9 becomes &eacute;. When the numeric values did not correspond to any known character defined in the ISO standard, the value is encoded as a numeric entity--e.g., the value 0x1A is encoded as &#x001A;.


THE DATA | ENCODING CONVENTIONSFILE STRUCTURE FIRST RELEASE | ANC HOME

Copyright 2003 American National Corpus Project. All rights reserved.