Conformance to standards and best practice
The ANC has been created with the intention of adhering to the extent possible to existing and emerging standards and “best practices” for markup and the representation of language resources and their annotations. These include W3C markup and data representation standards and the recommendations of the International Standards Organization (ISO) sub-committee for language resources (ISO TC37 SC4). While this ensures that the ANC is at the state-of-the-art and (we hope) will be processable by a wide variety of available tools and web-based applications, it also means that we are dealing with recommendations that have yet to be finalized (and therefore might change) and relying on processing capabilities that are not yet widely implemented. For this reason, a number of encoding choices have been made with an eye toward enabling immediate use of the ANC, while at the same time providing for adaptation to standards and practices that emerge in the future:
- The ANC Second Release data is provided with stand-off annotations. While stand-off annotation is widely accepted as the preferred format, few processors at present handle it. A program to produce a merged form of the corpus based on any of the additional annotations is provided.
- Although XInclude is the obvious means by which to include header fragments, we have so far relied on XLink for this purpose. Later releases of the ANC may use XInclude.
- The files produced by the merge tool will validate with the XCES schemas if the logical annotation set is used. The files produced by merging the other annotation sets may not validate. In particular, the tags used to annotate noun and verb chunks are not defined by the XCES.
- Currently, the parser that is used to merge the content and standoff annotations does not correctly handle overlapping hierarchies. However, a new parser will be released shortly that will allow you to output the overlapping hierarchies as-is (for importing into applications that can handle overlapping hierarchies directly, say), to convert overlapping hierarchines into properly nested elements, or to use HORSE milestones.
The ANC Second Release data has been validated using the Xerces 2 Java Parser. However, it is possible that a few invalid files escaped notice.
All of the markup and annotation in the first release of the ANC was produced entirely automatically from data in a variety of formats:
- Switchboard : ASCII texts
- CallHome : ASCII texts
- Charlotte : ASCII texts
- Micase : XML
- 911 Report : PDF
- Berlitz : Quark Express
- Biomed : XML
- Buffy : HTML
- Hargraves : Word file
- Eggan : Word file
- ICIC : ASCII text
- NY Times : SGML
- OUP : Quark Express
- PLOS : XML
- Slate : HTML
- Verbatim : SGML
- Web data : HTML
For gross logical structure (down to the level of paragraph), text parts such as title, author, section head, footnotes, quotations, lists, etc. may or may not be marked as such, depending upon whether or not this information was differentiated in some systematic way in the original format. In general, our algorithm assumed that the presence of a carriage return signalled the beginning of a paragraph unless otherwise indicated, and therefore elements unrecognizable as any other type of element are typically encloded in <p> tags.
All sentence markup was automatically produced by the sentence splitter included in the Gate system, and there are occasional errors, usually due to the presence of unrecognized abbreviations in mid-sentence. The sentence splitter also puts punctuation appearing after the terminating period (e.g., closing quotation mark, closing parenthesis) outside the sentence boundary.
Part-of-speech tagging was done automatically at Northern Arizona University using the Biber tagger, with no hand-validation. The annotations include bothBiber’s part-of-speech tags and lemmas. The Biber tagger has an average accuracy similar to most taggers (95% or higher). Additional part of speech tags have also been generated with the Hepple part of speech tagger included in Gate.
Due to some inconsistencies in the way clitics (e.g., “don’t”, let’s”) are treated by the Biber tagger (some treated as two separate words, some as one with extra information), a few words in the corpus are not annotated for part of speech because they were skipped during post-processing. Words that have not been tagged have not been included in the word counts for the corpus.
Gold standard sub-corpus
The ANC Project has obtained funds from the U.S. National Science Foundation to hand-validate both the structural markup and part of speech annotation in a 10 million word subset of the ANC in order to create a “gold standard” corpus. The gold standard corpus will be balanced along the same lines as the entire 100 million+ words of the ANC, and will therefore not coincide exactly with the 20 million words in the second release. The remaining portion of the corpus will be corrected to the extent possible using automated means and whatever level of hand-validation can be accomplished under our budget.
Funky characters in the data
Note on the OUP and Berlitz files: The original Quark Express files contained characters that were encoded with numeric codes that may or may not correspond to an International Standards Organization (ISO) hexidecimal character. Where possible these values have been converted to the corresponding entity, for example 0x00E9 becomes é. When the numeric values did not correspond to any known character defined in the ISO standard, the value is encoded as a numeric entity–e.g., the value 0x1A is encoded as .