]>
![]() |
American National Corpus Project | ||
|
|
ANC file structure
File structure for the stand-off version
File structure for the merged version
Note on file naming conventions
Header structure
ANC File Structure
See also the README file included with the distribution viaCD.
The ANC data is contained in two directories, each containing a version of the ANC data:
- standoff : contains the corpus in stand-off annotation form, where part of speech annotations are stored in documents separate from the primary data
- merged : contains the corpus in merged form, where part of speech annotation is included in the primary data
File structure for the stand-off version
Each sub-corpus is contained in a separate directory in the standoff directory:
- Switchboard : contains 30 sub-directories grouping data by "conversation id"
- Callhome
- Charlotte
- Berlitz
- NYTimes : contains 16 sub-directories grouping data by day of publication in July 2002 (e.g., 01 for July 1)
- Slate
Each of the CallHome, Charlotte, Berlitz, and Slate directories, and each of the sub-directories in the Switchboard and NYTimes directories, contains three files for each text in the corresponding sub-corpus. They are named as follows (where [name] is the name associated with the document):
- [name]-header.xml : the header file, which contains the XCES header for the text
- [name].xml :the file containing the data itself
- [name]-ana.xml :the annotation file
See Encoding Conventions for descriptions and examples of the contents of these files.
File structure for the merged version
The file structure for the merged format is the same as for the stand-off format, except that no annotation file is present. For each text, then, there are two files:
- [name]-header.xml : the header file, which contains the XCES header for the text
- [name].xml :the file containing the data and the part of speech and lemma annotations
Note on file naming conventions
We have retained the file names of the data as received by the ANC for the Switchboard, CallHome, Charlotte, New York Times, and Slate data. Filenames for the Berlitz and OUP data were created by the ANC. Berlitz filenames reflect the section type (e.g., "HandR" for "Hotels and Restaurants") and the geographic region that is the subject of the document. OUP data file names consist of the author's name followed by the chapter number.
For the final release of the ANC, filenames will be normalized across all sub-corpora.
Header Structure
The ANC First Release includes a header for the entire corpus, as well as headers for each text in the corpus.
The individual text headers contain much redundant information; for this reason, the common portions of the header (the respStmt and publicationStmt) have been provided only once in separate files. Within the individual headers, XLINK is used to point to the relevant files at the point where the information in them should appear. (Note: According to W3C standards, XINCLUDE is the appropriate mechanisms for this type of inclusion. However, very few XML processors currently support XINCLUDE and we have therefore relied on XLINK for this release.)
The following shows the first few lines of a typical header. Links to the common portions are highlighted.
<?xml version="1.0" encoding="utf-8"?>
<header xmlns="http://www.xces.org/schema/2003" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" creator="KBS" version="1.0" date.created="2003-09-12" xsi:schemaLocation="http://www.xces.org/schema/2003 /ANC/xcesHeader.xsd">
<fileDesc>
<titleStmt>
<title>A Stitch in Time : Chapter 1</title>
<respStmtLink xlink:href="/ANC/respStmt.xml" />
</titleStmt>
<publicationStmtLink xlink:href="/ANC/publicationStmt.xml" />
<sourceDesc> ...