|American National Corpus Project|
Conformance to standards and best practice
The ANC has been created with the intention of adhering to the extent possible to existing and emerging standards and "best practices" for markup and the representation of language resources and their annotations. These include W3C markup and data representation standards and the recommendations of the International Standards Organization (ISO) sub-committee for language resources (ISO TC37 SC4). While this ensures that the ANC is at the state-of-the-art and (we hope) will be processable by a wide variety of available tools and web-based applications, it also means that we are dealing with recommendations that have yet to be finalized (and therefore might change) and relying on processing capabilities that are not yet widely implemented. For this reason, a number of encoding choices have been made with an eye toward enabling immediate use of the ANC, while at the same time providing for adaptation to standards and practices that emerge in the future:
- The ANC First Release data is provided in both a stand-off and merged format. While stand-off annotation is widely accepted as the preferred format, few processors at present handle it. Note that the stand-off format allows for multiple part-of-speech annotations of the ANC data, several of which will be provided in the near future. A script to produce a merged form of the corpus based on any one of the additional annotations will be provided.
- Although XInclude is the obvious means by which to include header fragments, we have so far relied on XLink for this purpose because XInclude is not yet a final recommendation and is therefore not implemented in XML processing software. Later releases of the ANC may use XInclude.
- Schema locations in the ANC data are currently explicitly specified to assume the schemas exist in /ANC (an ANC directory directly contained within the root directory), rather than using a relative URI. This was done to avoid problems resulting from the fact that XML processors differ in their determination of the location a relative URI is considered to be relative to: when validating a document against a schema s that includes or imports other schemas with a relative URI, some processors will take that URI to be relative to the location of the XML document, and others will take it to be relative to the location of schema s.
- At the time the XCES schemas were developed, there was inconsistent support for xsd:redefine among XML validators, and it was impossible to specify a set of schemas that all, or even most, XML validators would accept. Therefore, the current set of schemas use the "cut and paste" method of inheritance; that is, when redefining a model group, a copy of the schema is made and the relevant groups are modified.
The original goal when creating the schemas was to define a series of model groups, use these groups to define types, and then declare elements using these types. Schemas would then be modified and customized by redefining the models groups, which would change the types and, therefore, the element definitions. For example, the only real difference between the xcesDoc.xsd schema and the xcesMerged.xsd schema is that xcesMerged.xsd allows <tok> elements in the model group for sentence level content. Using xsd:redefine, the xcesMerged schema should only need to add a defintion of the <tok> element and redefine the model group for sentence level content, rather than copying the entire xcesDoc schema.
It is expected that the next version, the XCES schemas will use xsd:redefine.
The ANC First Release data has been validated using the XSV schema validator. However, it is possible that a few invalid files escaped notice.
All of the markup and annotation in the first release of the ANC was produced entirely automatically from data in a variety of formats:
- Switchboard : ASCII texts
- CalHome : ASCII texts
- Charlotte : ASCII texts
- Berlitz : Quark Express
- Slate : XML generated automatically from HTML
- OUP texts: Quark Express
For gross logical structure (down to the level of paragraph), text parts such as title, author, section head, footnotes, quotations, lists, etc. may or may not be marked as such, depending upon whether or not this information was differentiated in some systematic way in the original format. In general, our algorithm assumed that the presence of a carriage return signalled the beginning of a paragraph unless otherwise indicated, and therefore elements unrecognizable as any other type of element are typically encloded in <p> tags.
At this time, sentence markup is included in the primary texts (although this maybe changed in the final version of the corpus). All sentence markup was automatically produced by the sentence splitter included in the Gate system, and there are occasional errors, usually due to the presence of abbreviations in mid-sentence. The sentence splitter also puts punctuation appearing after the terminating period (e.g., closing quotation mark, closing parenthesis) outside the sentence boundary.
Part-of-speech tagging was done automatically at Northern Arizona University using the Biber tagger, with no hand-validation. The annotations include both Biber's part-of-speech tags and lemmas. The Biber tagger has an average accuracy similar to most taggers (95% or higher).
Due to some inconsistencies in the way clitics (e.g., "don't", let's") are treated by the Biber tagger (some treated as two separate words, some as one with extra information), a few words in the corpus are not annotated for part of speech because they were skipped during post-processing. Words that have not been tagged have not been included in the word counts for the corpus.
Gold standard sub-corpus
The ANC Project has obtained funds from the U.S. National Science Foundation to hand-validate both the structural markup and part of speech annotation in a 10 million word subset of the ANC in order to create a "gold standard" corpus. The gold standard corpus will be balanced along the same lines as the entire 100 million+ words of the ANC, and will therefore not coincide exactly with the 10 million words in the first release. The remaining portion of the corpus will be corrected to the extent possible using automated means and whatever level of hand-validation can be accomplished under our budget.
Funky characters in the data
Note on the OUP and Berlitz files: The original Quark Express files contained characters that were encoded with numeric codes that may or may not correspond to an International Standards Organization (ISO) hexidecimal character. Where possible these values have been converted to the corresponding entity, for example 0x00E9 becomes é. When the numeric values did not correspond to any known character defined in the ISO standard, the value is encoded as a numeric entity--e.g., the value 0x1A is encoded as .