15em 7em
first release second release open anc
about contents encoding frequency data using xaira bugs & caveats
obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
home overview masc I download
annotations software source code frequency data publications contributor's FAQ
project people anc mailing list contact us site map

Document Encoding

The ANC Second Release uses the proposed XCES Markup for Standoff Annotations. Each logical document in the ANC is conceptually a single XML document that conforms to the XCES xcesDoc.xsd schema. Physically, the primary data and its annotations are stored in multiple XML documents that form a directed graph referencing regions of primary data (and potentially, regions defined over other annotations as well). The nodes of the graph are virtual, located between each character in the primary data. Edges defined over the nodes in the graph are labeled with feature structures containing annotation information associated with the data region defined by the edge.

Each logical document in the ANC Second Release consists of the following files:

filename.anc An XCES header that specifies the location of the content and standoff annotation files.
filename.txt The primary data (content) of the document.
filename-logical.xml Standoff markup for the logical structure of the document
filename-s.xml Standoff markup for sentence boundaries
filename-hepple.xml Standoff markup for Hepple (Penn) part of speech tags.
filename-biber.xml Standoff markup for Biber part of speech tags.
filename-np.xml Standoff markup for noun chunks.
filename-vp.xml Standoff markup for verb chunks.

Primary data is encoded in UTF-16; all other information, including the header and all annotations, are encoded in UTF-8.

The representation format that separates primary data and annotations offers considerable flexibility for ANC use; in particular:

Creating a single XML document containing text and annotations

The ANC stand-off format provides flexibility for the creators and users of the ANC, but in many cases users will want to use the corpus with annotations in-line. We provide the "ANC Merge Tool" that enableds users to easily generate a single XML document containing the primary data and any of the user's choice of the annotations contained in the variousstand-off documents.The tool can be downloaded from the ANC tools page, which also provides a description of its use.

Standoff Annotations

The edge set(s) of an annotation graph are represented in one or more standoff annotation files. Each standoff annotation file includes a series of annotations consisting of one or more features, represented in XML with <struct> and <feat> tags respectively. Each <struct> specifies an edge (i.e., range of primary data) with from and to attributes that reference nodes in the node set of the primary data. For example, given the following text taken from the file written/non-fiction/OUP/Berk/ch7.txt:

    In this chapter, I take up dilemmas that today's parents face...

We have an assumed node between each character:

                    1                   2                   3                   4                   5                   6
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
| | | | |I|n| |t|h|i|s| |c|h|a|p|t|e|r|,| |I| |t|a|k|e| |u|p| |d|i|l|e|m|m|a|s| |t|h|a|t| |t|o|d|a|y|'|s| |p|a|r|e|n|t|s| |f|a|c|e| 

Edges in the graph are then defined in the standoff annotation files:

ch7-logical.xml

<?xml version="1.0" encoding="UTF-8"?>
<cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4">
<struct type="cesDoc" from="0" to="65865">
<feat name="xmlns" value="http://www.xces.org/schema/2003"/>
<feat name="version" value="1.0.4"/>
</struct>
<struct type="text" from="1" to="65864"/>
<struct type="body" from="2" to="65863"/>
<struct type="div" from="3" to="65862">
<feat name="type" value="article"/>
<feat name="xml:lang" value="en-US"/>
</struct>
<struct type="p" from="4" to="719">
<feat name="id" value="p1"/>
</struct> ... </cesAna>

ch7-s.xml

<?xml version="1.0" encoding="UTF-8"?>
<cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4">
<struct type="s" from="4" to="92">
<feat name="id" value="p1s1"/>
</struct>
<struct type="s" from="93" to="200">
<feat name="id" value="p1s2"/>
</struct>
<struct type="s" from="201" to="718">
<feat name="id" value="p1s3"/>
</struct> ... </cesAna>

ch7-hepple.xml

<cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4">
<struct type="tok" from="4" to="6">
<feat name="base" value="in"/>
<feat name="msd" value="IN"/>
</struct>
<struct type="tok" from="7" to="11">
<feat name="msd" value="DT"/>
<feat name="base" value="this"/>
<feat name="affix" value=" "/>
</struct>
<struct type="tok" from="12" to="19">
<feat name="base" value="chapter"/>
<feat name="msd" value="NN"/>
</struct> ... </cesAna>

ch7-np.xml

<cesAna xmlns="http://www.xces.org/schema/2003" version="1.0.4">
<struct type="NounChunk" from="7" to="19"/>
<struct type="NounChunk" from="21" to="22"/>
<struct type="NounChunk" from="31" to="39"/>
... </cesAna> etc.