Encoding Conventions

THE DATA | FILE STRUCTURE | KNOWN BUGS | FIRST RELEASE | ANC HOME

ANC encoding conventions
 Header
 Data (stand-off version)
 Format for written data
 Format for spoken data
 Annotation
 Data (merged version)
Morpho-syntactic (part-of-speech) annotation

ANC Encoding Conventions

The ANC is encoded in XML, conformant to the XML Corpus Encoding Standard (XCES) schemas for primary data and annotations. The XCES schemas are included on the CD with the ANC First Release data. The XCES is compliant with emerging standards for representing data, including various W3C standards (e.g., XPointer for inter-document linking). We expect that in the near future, the XCES and the ANC will be updated/augmented to accommodate additional emerging standards such as the Resource Definition Framework(RDF) and the recommendations of the International Standards Organization (ISO) sub-committee for language resources (ISO TC37 SC4).

The texts in the corpus are marked to the level of the paragraph, and within paragraphs, for sentence boundaries. Following XCES recommendations, a "stand-off" annotation strategy is followed, meaning that annotations are contained in a separate XML document linked to the original. The associated annotation files identify word (token) boundaries and provide the morpho-syntactic description (part of speech) and lemma for each token in the corpus. Because few processors handle stand-off annotation at this time, a "merged" version of the corpus is also provided, in which each token is explicitly marked with <tok> tags, and part-of-speech and lemma are given as the values of msd and base attributes, respectively.

All primary ANC documents currently contain sentence boundary markup, which was done to make some types of processing easier. However, we realize that sentence markup is a type of linguistic annotation that may vary depending on the particular linguistic theory and/or processing software applied to the data. We are hoping for user feedback in order to determine whether or not to include sentence markup in the primary data in the final release of the ANC.

Header

The header contains information about the provenance of the data, the creators and distributors, tag usage, and information concerning domain, subdomain, subject, audience, and medium. These categories follow the classification scheme used for the British National Corpus but contain some more specific information as well. The header also contains a link to the document containing the annotations for the data, as in the following example:

<annotations>
 <annotation type="content" ann.loc="HistoryDublin.xml" />
 <annotation type="part of speech" ann.loc="HistoryDublin-ana.xml" />
</annotations>

Header files for spoken data contain additional information concerning the speakers and the situation under which the dialogue occurred. This information is contained in the <profileDesc> element within the header; for example:

<profileDesc>
 <textClass>
 <subject>CLOTHING AND DRESS</subject>
 <audience>Adult</audience>
 <medium>Spoken</medium>
 </textClass>
 <particDesc>
 <person age="1956" id="spkr1020" role="caller" sex="F">
 Dialect : NORTH MIDLAND, Side : A, Education : 2,
 Partition : DN2</person>
 <person age="1962" id="spkr1044" role="callee" sex="F">
 Dialect : SOUTH MIDLAND,Side : B, Education : 1,
 Partition : UNC</person>
 </particDesc>
 <settingDesc>
 <setting who="spkr1020 spkr1044">
 <time>910304 1218</time>
 <activity>
 THE TOPIC IS CLOTHING. PLEASE FIND OUT HOW
 THE OTHER CALLER TYPICALLY DRESSES FOR WORK.
 HOW MUCH VARIATION IS THERE FROM DAY TO DAY?
 HOW MUCH VARIATION IS THERE FROM SEASON TO SEASON?
 </activity>
 </setting>
</settingDesc>
</profileDesc>

Data (stand-off version)

The data file contains the text or speech transcription, marked down to the level of paragraph and within paragraphs, for sentences. Individual words or strings may additionally be marked with information about the original font; for example, a word that was italicized in the original will be marked with <hi rend="ital">.

Format for written data

The following is a sample of the first few lines of a data file for written text (numbered on the left for reference):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?xml version="1.0" encoding="utf-8"?>
<doc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xmlns:xlink="http://www.w3.org/1999/xlink" 
 xmlns="http://www.xces.org/schema/2003" 
 version="1.0" 
 xsi:schemaLocation="http://www.xces.org/schema/2003
 /ANC/xcesDoc.xsd">
<xcesHeader xlink:href="HistoryDublin-header.xml"/>
<text>
 <body>
 <div type="chapter">
 
 <s id="p1s1">A Brief History</s>

 
 
 <s id="p2s1">Celtic Ireland</s>
 
 
 <s id="p3s1">
 Ireland has been inhabited since very 
 ancient times, but Irish history really 
 begins withthe arrival of the Celts around 
 the 6th century b.c.</s>
 <s id="p3s2">
 Ireland&#8217;s first documented invasion.</s>
 <s id="p3s3">
 They brought with them iron weapons and chariots and 
 codes of custom and conduct that quickly became 
 dominant in the country.</s>

 <s id="p3s4">This is the period of myths and legends, 
 later romanticized by Irish writers, that 
 still exercise their power today.</s>
 
 ...
Line 8 : reference to the header file. This brings in the contents of the header file as a part of the logical document containing the data.

Lines 12-14: automatic processing marks the title as a paragraph containing a single sentence. This occurs when there is no consistent identification of titles and similar elements in the original encoding. The final release will correct this sort of inaccuracy to the extent feasible.

Line 23: incorrect sentence boundary, caused by the abbreviation in the middle of the sentence. To be corrected in later releases.

Line 25: the entity ’ represents the apostrophe. Most browsers will interpret the code and display the corresponding character.

Format for spoken data

Spoken data is marked for turn (<t>)and utterance (). Turn signifies a change in speaker; utterances are (roughly) the same as sentences. Each <turn> element includes an attribute identifying the speaker, using the code specified in the header on the side attribute of the corresponding <person> element.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?xml version="1.0" encoding="UTF-8"?>
<doc xsi:schemaLocation="http://www.xces.org/schema/2003 
 /ANC/xcesSpoken.xsd" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 version="1.0" 
 xmlns="http://www.xces.org/schema/2003" 
 xmlns:xlink="http://www.w3.org/1999/xlink">
 <xcesHeader xlink:href="sw2001-ms98-a-trans-header.xml"/>
 <body>
 	<turn who="B" id="b1">
 	 start="0.000000" end="2.655625" id="sw2001B-0001">

 okay hi
 	</turn>
 	<turn who="A" id="a1">
 	 

 hi um yeah i'd like to talk about how you dress 
 for work and and um what do you normally what type 
 of outfit do you normally have to wear
 	</turn>
 	<turn who="B" id="b2">
 

 well i work in uh corporate control so we have to 
 dress kind of nice so i usually wear skirts and 
 sweaters in the winter time slacks i guess 
 and in the summer just dresses
 	</turn>
 	<turn who="A" id="a2">
 		

 um-hum
 	</turn>
 	<turn who="B" id="b3">
 		

 we can't even well we're not even really supposed 
 to wear jeans very often
 		
 so it really doesn't vary that much from season to 
 season since the office is kind of you know always 
 the same temperature
 	</turn>
 	<turn who="A" id="a3">
 		

 and is
 		
 right right is there is there um any is there a like 
 a code of dress where you work do they ask
 	</turn>	
	 ... 
Annotation file

The annotation file contains the part-of-speech and lemma for each word in the text. Annotations are linked to the primary data using a version of the W3C XPointer syntax to identify the string to which the annotation applies. Note that XPointer syntax is not yet stable; the standard has in fact been modified since we adopted its proposed syntax. Later versions of the ANC corpus will adapt to further changes, and a script will be provided to make the change to earlier versions of the corpus.

The following sample of the annotation files applies to the first few lines of the primary text above. Note that the annotation files for spoken data are in this same format, except that "chunks" are indicated to refer to utterances rather than sentences.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?xml version="1.0" encoding="UTF-8"?> 
<ana xmlns="http://www.xces.org/schema/2003" version="1.0"

 xmlns:xlink="http://www.w3.org/1999/xlink"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.xces.org/schema/2003 /ANC/xcesAna.xsd">
 <chunklist xml:base="HistoryDublin.xml">
 <chunk type="sentence" xml:base="#p1s1">

 <tok xlink:href="xpointer(string-range('', 0, 1))">
 <msd>at++++</msd>
 <base>a</base>
 </tok>

 <tok xlink:href="xpointer(string-range('', 2, 7))">
 <msd>jj+atrb+++</msd>
 <base>brief</base>
 </tok>

 <tok xlink:href="xpointer(string-range('', 8, 15))">
 <msd>nn++++</msd>
 <base>history</base> 
 </tok>
 </chunk>

 <chunk type="sentence" xml:base="#p2s1">
 <tok xlink:href="xpointer(string-range('', 0, 6))">
 <msd>jj+atrb+++</msd>
 <base>celtic</base>

 </tok>
 <tok xlink:href="xpointer(string-range('', 7, 14))">
 <msd>np++++</msd>
 <base>ireland</base>

 </tok> 
 </chunk>
 <chunk type="sentence" xml:base="#p3s1">
 <tok xlink:href="xpointer(string-range('', 0, 7))">
 <msd>np++++</msd>

 <base>ireland</base>
 </tok>
 <tok xlink:href="xpointer(string-range('', 8, 11))">
 <msd>vbz+hvz+aux++</msd>

 <base>have</base>
 </tok>
 <tok xlink:href="xpointer(string-range('', 12, 16))">
 <msd>vprf+ben+aux+xvbnx+</msd>

 <base>be</base>
 </tok>
 <tok xlink:href="xpointer(string-range('', 17, 26))">
 <msd>vpsv++agls+xvbnx+</msd>

 <base>inhabit</base>
 </tok>
 ...
Line 6 : gives the URI of the primary data file. All XPointer links are assumed to have this string as a prefix.

Lines 7: the <chunk> element identifies the beginning of a string marked as a sentence in the original text, and provides the id reference to the particular string.

Line 8: the <tok> element identifies the beginning of a string marked as a token (word) in the original text, and provides the character offset from the beginning of the sentence for the particular string.

Line 9: the <msd> element contains the morpho-syntactic description (part-of-speech plus other information) for the token

Line 9: the <base> element contains the base form (lemma) for the token
Data file (merged version)

The merged data format includes the part-of-speech annotation within the data file itself. As in the stand-off version, there is a pointer to the corresponding header file.

The following gives the merged version of the text shown above.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE doc SYSTEM "/ANC/ISOents.dtd"> 
<doc xmlns:xlink="http://www.w3.org/1999/xlink"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns="http://www.xces.org/schema/2003"
 xsi:schemaLocation="http://www.xces.org/schema/2003 /ANC/xcesMerged.xsd"

 version="1.0"> <xcesHeader
 xlink:href="HistoryDublin-header.xml"></xcesHeader> <text> 
<body>
 <div type="chapter"> 
 
 <s id="p1s1">

 <tok msd="at++++" base="a">A</tok> 
 <tok msd="jj+atrb+++" base="brief">Brief</tok> 
 <tok msd="nn++++" base="history">History</tok></s> 
 
 
 <s id="p2s1">

 <tok msd="jj+atrb+++" base="celtic">Celtic</tok> 
 <tok msd="np++++" base="ireland">Ireland</tok></s> 
 

 
 <s id="p3s1">
 <tok msd="np++++" base="ireland">Ireland</tok> 
 <tok msd="vbz+hvz+aux++" base="have">has</tok> 
 <tok msd="vprf+ben+aux+xvbnx+" base="be">been</tok> 
 <tok msd="vpsv++agls+xvbnx+" base="inhabit">inhabited</tok> 
 <tok msd="in++++" base="since">since</tok> 
 <tok msd="ql+amp+++" base="very">very</tok>
 <tok msd="jj+atrb+++" base="ancient">ancient</tok> 
 <tok msd="nns++++" base="time">times</tok>, 
 <tok msd="cc+cls+++" base="but">but</tok> 
 <tok msd="jj+atrb+++" base="irish">Irish</tok> 
 <tok msd="nn++++" base="history">history</tok> 
 <tok msd="rb+emph+++" base="really">really</tok> 
 <tok msd="vbz++++" base="begin">begins</tok>
 <tok msd="in++++" base="with">with</tok> 
 <tok msd="ati++++" base="the">the</tok> 
 <tok msd="nn++++" base="arrival">arrival</tok> 
 <tok msd="in++++" base="of">of</tok> 
 <tok msd="ati++++" base="the">the</tok>

 <tok msd="np+++??+" base="celts">Celts</tok> 
 <tok msd="in++++" base="around">around</tok> 
 <tok msd="ati++++" base="the">the</tok> 
 <tok msd="cd++++" base="6th">6th</tok> 
 <tok msd="nn++++" base="century">century</tok> 
 <tok msd="rb++++" base="b.c">b.c</tok>.
 </s>

 ...
Morpho-syntactic (part-of-speech) annotation

The morpho-syntactic tags used in this version of the ANC are those of the Biber tagger (see Biber Tag Descriptions for a full description of the tagset). We will also provide alternative morpho-syntactic annotations used by the Brill-like tagger in the GATE system (see the Gate tag descriptions), which are very similar to the Penn Treebank tags, as well as annotations using the C5 and C7 versions of the CLAWS tagset used to tag the British National Corpus. These additional annotation files will be downloadable from this website, together with a script to create a merged tag version for these tagsets (as described above) from the annotation files and the corpus (which must first be obtained from LDC).

THE DATA | FILE STRUCTURE | KNOWN BUGS | FIRST RELEASE | ANC HOME