]> ANC First Release
 American National Corpus Project
ANC Home ANC Consortium Linguistic Data Consortium

AMERICAN NATIONAL CORPUS FIRST RELEASE

ANC File Structure


THE DATA ENCODING CONVENTIONS | KNOWN BUGS | FIRST RELEASE | ANC HOME

ANC file structure
   File structure for the stand-off version
   File structure for the merged version
   Note on file naming conventions
   Header structure


ANC File Structure

See also the README file included with the distribution viaCD.

The ANC data is contained in two directories, each containing a version of the ANC data:

File structure for the stand-off version

Each sub-corpus is contained in a separate directory in the standoff directory:

Each of the CallHome, Charlotte, Berlitz, and Slate directories, and each of the sub-directories in the Switchboard and NYTimes directories, contains three files for each text in the corresponding sub-corpus. They are named as follows (where [name] is the name associated with the document):

See Encoding Conventions for descriptions and examples of the contents of these files.

File structure for the merged version

The file structure for the merged format is the same as for the stand-off format, except that no annotation file is present. For each text, then, there are two files:

Note on file naming conventions

We have retained the file names of the data as received by the ANC for the Switchboard, CallHome, Charlotte, New York Times, and Slate data. Filenames for the Berlitz and OUP data were created by the ANC. Berlitz filenames reflect the section type (e.g., "HandR" for "Hotels and Restaurants") and the geographic region that is the subject of the document. OUP data file names consist of the author's name followed by the chapter number.

For the final release of the ANC, filenames will be normalized across all sub-corpora.

Header Structure

The ANC First Release includes a header for the entire corpus, as well as headers for each text in the corpus.

The individual text headers contain much redundant information; for this reason, the common portions of the header (the respStmt and publicationStmt) have been provided only once in separate files. Within the individual headers, XLINK is used to point to the relevant files at the point where the information in them should appear. (Note: According to W3C standards, XINCLUDE is the appropriate mechanisms for this type of inclusion. However, very few XML processors currently support XINCLUDE and we have therefore relied on XLINK for this release.)

The following shows the first few lines of a typical header. Links to the common portions are highlighted.

 

<?xml version="1.0" encoding="utf-8"?>
<header
xmlns="http://www.xces.org/schema/2003" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" creator="KBS" version="1.0" date.created="2003-09-12" xsi:schemaLocation="http://www.xces.org/schema/2003 /ANC/xcesHeader.xsd">
<fileDesc>
<titleStmt>
<title>A Stitch in Time : Chapter 1</title>
<respStmtLink xlink:href="/ANC/respStmt.xml" />
</titleStmt>
<publicationStmtLink xlink:href="/ANC/publicationStmt.xml" />
<sourceDesc> ...

THE DATA ENCODING CONVENTIONS | KNOWN BUGS | FIRST RELEASE | ANC HOME

Copyright 2003American National Corpus Project. All rights reserved.