Document Encoding
The ANC Second Release uses an early(now obsolete) version of the ISO LAF/GrAF representation format. Each logical document in the ANC is conceptually a single XML document that conforms to the XCES xcesDoc.xsd schema. Physically, the primary data and its annotations are stored in multiple XML documents that form a directed graph referencing regions of primary data (and potentially, regions defined over other annotations as well). The nodes of the graph are virtual, located between each character in the primary data. Edges defined over the nodes in the graph are labeled with feature structures containing annotation information associated with the data region defined by the edge.
Each logical document in the ANC Second Release consists of the following files:
filename.anc | An XCES header that specifies the location of the content and standoff annotation files. |
filename.txt | The primary data (content) of the document. |
filename-logical.xml | Standoff markup for the logical structure of the document |
filename-s.xml | Standoff markup for sentence boundaries |
filename-hepple.xml | Standoff markup for Hepple (Penn) part of speech tags. |
filename-biber.xml | Standoff markup for Biber part of speech tags. |
filename-np.xml | Standoff markup for noun chunks. |
filename-vp.xml | Standoff markup for verb chunks. |
Primary data is encoded in UTF-16; all other information, including the header and all annotations, are encoded in UTF-8.
The representation format that separates primary data and annotations offers considerable flexibility for ANC use; in particular:
- users can create a single XML document containing the primary data with the user’s choice of annotations in-line (often needed for use with existing tools)–see below for more information
- the primary text can be used with no markup or annotations if desired (which is commonly the case for concordance generation, etc.)
- the user can choose to deal with a particular annotation set independent of the text (e.g to generate statistics for POS taggers or parsers)
- the ANC can include annotations of many different types, or several versions of a single annotation type (e.g., multiple part of speech taggings) without encountering compatibility problems
- The ANC project can distribute annotations independent of the text via download from the ANC site; because the annotations contain links to the original data, any user who has obtained the ANC from the LDC can use the annotations with the corpus.
Creating a single XML document containing text and annotations
The ANC stand-off format provides flexibility for the creators and users of the ANC, but in many cases users will want to use the corpus with annotations in-line. We provide the “ANC Merge Tool” that enableds users to easily generate a single XML document containing the primary data and any of the user’s choice of the annotations contained in the variousstand-off documents.The tool can be downloaded from the ANC tools page, which also provides a description of its use.
Standoff Annotations
The edge set(s) of an annotation graph are represented in one or more standoff annotation files. Each standoff annotation file includes a series of annotations consisting of one or more features, represented in XML with <struct> and <feat> tags respectively. Each <struct> specifies an edge (i.e., range of primary data) with from and to attributes that reference nodes in the node set of the primary data. For example, given the following text taken from the file written/non-fiction/OUP/Berk/ch7.txt:
In this chapter, I take up dilemmas that today’s parents face…
We have an assumed node between each character:
1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | | | | |I|n| |t|h|i|s| |c|h|a|p|t|e|r|,| |I| |t|a|k|e| |u|p| |d|i|l|e|m|m|a|s| |t|h|a|t| |t|o|d|a|y|'|s| |p|a|r|e|n|t|s| |f|a|c|e|
Edges in the graph are then defined in the standoff annotation files:
ch7-logical.xml
<?xml version="1.0" encoding="UTF-8"?> <cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4"> <struct type="cesDoc" from="0" to="65865"> <feat name="xmlns" value="http://www.xces.org/schema/2003"/> <feat name="version" value="1.0.4"/> </struct> <struct type="text" from="1" to="65864"/> <struct type="body" from="2" to="65863"/> <struct type="div" from="3" to="65862"> <feat name="type" value="article"/> <feat name="xml:lang" value="en-US"/> </struct> <struct type="p" from="4" to="719"> <feat name="id" value="p1"/> </struct> ... </cesAna>ch7-s.xml
<?xml version="1.0" encoding="UTF-8"?> <cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4"> <struct type="s" from="4" to="92"> <feat name="id" value="p1s1"/> </struct> <struct type="s" from="93" to="200"> <feat name="id" value="p1s2"/> </struct> <struct type="s" from="201" to="718"> <feat name="id" value="p1s3"/> </struct> ... </cesAna>ch7-hepple.xml
<cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4"> <struct type="tok" from="4" to="6"> <feat name="base" value="in"/> <feat name="msd" value="IN"/> </struct> <struct type="tok" from="7" to="11"> <feat name="msd" value="DT"/> <feat name="base" value="this"/> <feat name="affix" value=" "/> </struct> <struct type="tok" from="12" to="19"> <feat name="base" value="chapter"/> <feat name="msd" value="NN"/> </struct> ... </cesAna>ch7-np.xml
<cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4"> <struct type="NounChunk" from="7" to="19"/> <struct type="NounChunk" from="21" to="22"/> <struct type="NounChunk" from="31" to="39"/> ... </cesAna> etc.