The ANC has been created with the intention of adhering to the extent possible to existing and emerging standards and "best practices" for markup and the representation of language resources and their annotations. These include W3C markup and data representation standards and the recommendations of the International Standards Organization (ISO) sub-committee for language resources (ISO TC37 SC4). While this ensures that the ANC is at the state-of-the-art and (we hope) will be processable by a wide variety of available tools and web-based applications, it also means that we are dealing with recommendations that have yet to be finalized (and therefore might change) and relying on processing capabilities that are not yet widely implemented. For this reason, a number of encoding choices have been made with an eye toward enabling immediate use of the ANC, while at the same time providing for adaptation to standards and practices that emerge in the future:
The ANC Second Release data has been validated using the Xerces 2 Java Parser. However, it is possible that a few invalid files escaped notice.
All of the markup and annotation in the first release of the ANC was produced entirely automatically from data in a variety of formats:
For gross logical structure (down to the level of paragraph), text parts such as title, author, section head, footnotes, quotations, lists, etc. may or may not be marked as such, depending upon whether or not this information was differentiated in some systematic way in the original format. In general, our algorithm assumed that the presence of a carriage return signalled the beginning of a paragraph unless otherwise indicated, and therefore elements unrecognizable as any other type of element are typically encloded in <p> tags.
All sentence markup was automatically produced by the sentence splitter included in the Gate system, and there are occasional errors, usually due to the presence of unrecognized abbreviations in mid-sentence. The sentence splitter also puts punctuation appearing after the terminating period (e.g., closing quotation mark, closing parenthesis) outside the sentence boundary.
Part-of-speech tagging was done automatically at Northern Arizona University using the Biber tagger, with no hand-validation. The annotations include both Biber's part-of-speech tags and lemmas. The Biber tagger has an average accuracy similar to most taggers (95% or higher). Additional part of speech tags have also been generated with the Hepple part of speech tagger included in Gate.
Due to some inconsistencies in the way clitics (e.g., "don't", let's") are treated by the Biber tagger (some treated as two separate words, some as one with extra information), a few words in the corpus are not annotated for part of speech because they were skipped during post-processing. Words that have not been tagged have not been included in the word counts for the corpus.
The ANC Project has obtained funds from the U.S. National Science Foundation to hand-validate both the structural markup and part of speech annotation in a 10 million word subset of the ANC in order to create a "gold standard" corpus. The gold standard corpus will be balanced along the same lines as the entire 100 million+ words of the ANC, and will therefore not coincide exactly with the 20 million words in the second release. The remaining portion of the corpus will be corrected to the extent possible using automated means and whatever level of hand-validation can be accomplished under our budget.
Note on the OUP and Berlitz files: The original Quark Express files contained characters that were encoded with numeric codes that may or may not correspond to an International Standards Organization (ISO) hexidecimal character. Where possible these values have been converted to the corresponding entity, for example 0x00E9 becomes é. When the numeric values did not correspond to any known character defined in the ISO standard, the value is encoded as a numeric entity--e.g., the value 0x1A is encoded as .