]> First release contents
 American National Corpus Project
ANC Home ANC Consortium Linguistic Data Consortium

AMERICAN NATIONAL CORPUS FIRST RELEASE

The Data


ENCODING CONVENTIONS FILE STRUCTURE | KNOWN BUGS | FIRST RELEASE | ANC HOME

The table below summarizes the contents of the ANC First Release:

Text type
Text name
No. of texts
No. of words
Contributor
Spoken
Callhome
24
50,494
LDC
Spoken
Switchboard
2320
3,056,062
LDC
Spoken
Charlotte Narrative
95
117,832
Project MORE
TOTAL SPOKEN
3,224,388
 
Written
New York Times
4148
3,207,272
LDC
Written
Berlitz Travel Guides
101
514,021
Langensheidt Publishers
Written
Slate Magazine
4694
4,338,498
Microsoft
Written
Various non-fiction
27
224,037
Oxford University Press
TOTAL WRITTEN
8,283,828
 
TOTAL CORPUS SIZE
11,508,216
 

Spoken Data
   CallHome
   Switchboard
   Charlotte Narratives
Written Data
   New York Times
   Berlitz Travel Guides
   Slate Magazine
   Various non-fiction

Spoken Data

CallHome

The CallHome component of the ANC First Release includes transcripts and documentation files for 24 unscripted telephone conversations between native speakers of English. The transcripts cover a contiguous 10 minute segment of each call, comprising 50,494 words.

The 24 transcripts are a subset of the full CallHome corpus available from LDC. The transcripts are time-stamped by speaker turn for alignment with the speech signal included in the LDC CallHome corpus. Complete auditing information on the speakers represented in the transcripts is included in the header file associated with each transcript, as well as in the on-line documentation for the LDC full corpus. The LDC documentation also describes the transcription conventions and format of the CallHome corpus.

Each file in the ANC CallHome sub-corpus is named with the same identifier referenced in the LDC on-line documentation.


Switchboard

The Switchboard component of the ANC First Release includes the transcriptions of the LDC Switchboard corpus. It consists of 2320 spontaneous conversations averaging 6 minutes in length and comprising about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.

NOTE: In the LDC Switchboard corpus, each "side" of a conversation is contained in a separate document. In the ANC version, the two sides of the conversation have been merged (based on timestamps) so that each document in the ANC Switchboard sub-corpus contains a complete conversation representing utterances by each side in turn.

The Switchboard manual describes the entire corpus, including the audio files. The transcribed component is described in section 4. Speaker identification and demographic information for each speaker are provided in the header file for each text. the classifications are as follows:

Dialect: South Midland, Western, North Midland, Northern, Southern, NYC, Mixed, New England.

Age group: 20-29, 30-39, 40-49, 50-59, 60-69.

Gender: Male, Female.

Education: 0=less than high school, 1=less than college, 2=college, 3=more than college, 9=unknown.

The Switchboard manual provides infromation on the distribution of each catagory among the speakers in the corpus.


Charlotte Narratives

The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities. Information on speaker age and gender is included in the header for each transcript.


Written Data

New York Times

The New York Times component of the ANC First Release consists of over 4000 articles from the New York Times newswire, for each of the odd-numbered days in July, 2002. The articles for each given day are contained in a sub-directory named by the date (01, 03, 05, 07, 09, 11, etc.). This data has not been released previously, and is not a part of the New York Times data already available from LDC.

The <subject> element in the header associated with each text indicates the topic of the article (e.g., sports, business, entertainment); see the complete list of NY Times subject categories.


Berlitz Travel Guides

Several Berlitz Travel Guides written by and for Americans were contributed by Langensheidt Publishers. The ANC First Release contains only a portion of the contributed Travel Guides; the remainder of the sub-corpus will be included in a later release.

The Berlitz sub-corpus is split into separate files by country/city and section.

Section
Filename prefix
No. of Files
Countries/Cities
Hotels and Restaurants
HandR
14
HA HK IB IS IS JA JE LD LV LI LO MA MD ML
History
History
22
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP JE LD LV MA MD ML MC
Where to Go
WhereTo
21
DU ED EG FW FR GR HA HK IB IN IS IB IT JP JE LD LA MA MD ML MC
What to Do
WhatTo
21
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP LD LV LA MA ML MC
Useful Expressions
UsefulExp
1
JP
Jungle
Jungle
1
MC
Introduction
Intro
19
DU ED EG FW FR GR HK IB IN IS IB IT JA JP JE LD LV LA MA

Key to country and city names:DU=Dublin, ED=Edinburgh, EG=Egypt, FWI=FWI, FR=France, GR=Greece, HA=Hawaii, HK=HongKong, IB=Ibiza, IN=India, IS=Israel, IB=Istanbul, IT=Italy, JA=Jamaica, JP=Japan, JE=Jerusalem, LD=LakeDistrict, LV=LasVegas, LI=Lisbon, LA=LosAngeles, MA=Madeira, MD=Madrid, ML=Malaysia, MC=Mallorca

Although the countries and cities for which each section exists largely overlap, variation occurs because we did not receive the section for a given country or city, and/or the section is irrelevant (e.g., Jungle); or the data in the section for a given country or city consisted almost entirely of non-textual materials (e.g., Hotels and Restaurants often contained mainly prices).


Slate Magazine

Slate Magazine is an on-line publication including short articles on topics of current interest, including News and Politics, Arts, Business, Sports, Technology, Travel, Food, etc. The ANC Slate sub-corpus contains 4694 articles from the Slate archives published between 1996 and 2000.


Various non-fiction (OUP)

The OUP sub-corpus of the ANC First Release contains about a quarter million words of non-fiction drawn from five Oxford University Press publications authored by Americans.

Author
Title
Domain
Chapters
Abernathy A Stitch in Time textile industry
1,2,3,6,7,8,9,14,15
Berk Awakening Children's Minds: How Parents and Teachers Can Make a Difference child development
1,3,4,7
Fletcher Our Secret Constitution : How Lincoln Redefined American Democracy American constitution
1,2.5,6,9,10
Kauffman Investigations general biology
1,4,5,6,7,10
Rybczinski The Look of Architecture architecture
1,2,3


ENCODING CONVENTIONS FILE STRUCTURE | KNOWN BUGS | FIRST RELEASE | ANC HOME

Copyright 2003American National Corpus Project. All rights reserved.