The Data

The table below summarizes the contents of the ANC Second Release:

Spoken
Corpora
Domain
No. files
No. words
callhome telephone
24
52,532
charlotte face to face
93
198,295
micase academic discourse
50
593,288
switchboard telephone
2,307
3,019,477
Spoken Totals 
2,474
3,863,592
Written
Corpora
Domain
No. files
No. words
911 report government, technical
17
281,093
berlitz travel guides
179
1,012,496
biomed technical
837
3,349,714
buffy blog
143
3,093,075
hargraves fiction
106
405,195
eggan fiction
1
61,746
icic letters
245
91,318
nytimes newspaper
4,148
3,625,687
oup non-fiction
45
330,524
plos technical
252
409,280
slate journal
4,531
4,238,808
verbatim journal
32
582,384
web data government
285
1,048,792
Written Totals 
10,821
18,530,112
Corpus Totals 
13,295
22,393,704

Summary of ANC Annotations Supplied

Corpus Header Logical Sentence Hepple Biber Noun chunks Verb chunks
911report X X X X X X X
berlitz1 X X X X X X X
berlitz2 X X X X X X X
biomed X X X X X X X
buffy X X X X   X X
callhome X X X X X X X
charlotte X X X X X X X
eggan X X X X   X X
hargraves X X X X X X X
icic X X X X X X X
micase X X X X X X X
nytimes X X X X X X X
oup X X X X X X X
plos X X X X   X X
slate X X X X X X X
switchboard X X X X X X X
verbatim X X X X X X X
websites X X X X   X X

Spoken Data
   CallHome
   Switchboard
   Charlotte Narratives
   Micase
Written Data
   911 Report
   Berlitz Travel Guides
   Buffy the Vampire Slayer
   Public Library of Science
   Biomed Central
   Fiction
   ICIC
   New York Times
   Various Non-fiction
   Verbatim
   Government Web Sites

Spoken Data

Callhome

The CallHome component of the ANC Second Release includes transcripts and documentation files for 24 unscripted telephone conversations between native speakers of American English. The transcripts cover a contiguous 10 minute segment of each call, comprising 50,494 words.

The 24 transcripts are a subset of the full CallHome corpus available from LDC. The transcripts are time-stamped by speaker turn for alignment with the speech signal included in the LDC CallHome corpus. Complete auditing information on the speakers represented in the transcripts is included in the header file associated with each transcript, as well as in the on-line documentation for the LDC full corpus. The LDC documentation also describes the transcription conventions and format of the CallHome corpus.

Each file in the ANC CallHome sub-corpus is named with the same identifier referenced in the LDC on-line documentation.


Switchboard

The Switchboard component of the ANC Second Release includes the transcriptions of the LDC Switchboard corpus. It consists of 2320 spontaneous conversations averaging 6 minutes in length and comprising about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.

NOTE: In the LDC Switchboard corpus, each "side" of a conversation is contained in a separate document. In the ANC version, the two sides of the conversation have been merged (based on timestamps) so that each document in the ANC Switchboard sub-corpus contains a complete conversation representing utterances by each side in turn.

The Switchboard manual describes the entire corpus, including the audio files. The transcribed component is described in section 4. Speaker identification and demographic information for each speaker are provided in the header file for each text. the classifications are as follows:

Dialect: South Midland, Western, North Midland, Northern, Southern, NYC, Mixed, New England.

Age group: 20-29, 30-39, 40-49, 50-59, 60-69.

Gender: Male, Female.

Education: 0=less than high school, 1=less than college, 2=college, 3=more than college, 9=unknown.

The Switchboard manual provides information on the distribution of each catagory among the speakers in the corpus.


Charlotte Narratives

The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities. Information on speaker age and gender is included in the header for each transcript.


Micase

The ANC Second Release contains 50 transcipts from the Michigan Corpus of Acadamic Spoken English. Information on speaker age, gender and role is included in the header for each transcript.


Written Data

911 Report

The ANC Second Release contains the full text of the report released on July 22, 2004 by The National Commission on Terrorist Attacks Upon the United States.


Berlitz Travel Guides

Several Berlitz Travel Guides written by and for Americans were contributed by Langensheidt Publishers.

The Berlitz sub-corpus is split into separate files by country/city and section.

Files from the first release

Section
Filename suffix
No. of Files
Countries/Cities
Hotels and Restaurants
HandR
14
HA HK IB IS IS JA JE LD LV LI LO MA MD ML
History
History
22
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP JE LD LV MA MD ML MC
Where to Go
WhereTo
21
DU ED EG FW FR GR HA HK IB IN IS IB IT JP JE LD LA MA MD ML MC
What to Do
WhatTo
21
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP LD LV LA MA ML MC
Jungle
Jungle
1
MC
Introduction
Intro
19
DU ED EG FW FR GR HK IB IN IS IB IT JA JP JE LD LV LA MA

Key to country and city names: DU=Dublin, ED=Edinburgh, EG=Egypt, FWI=FWI, FR=France, GR=Greece, HA=Hawaii, HK=HongKong, IB=Ibiza, IN=India, IS=Israel, IB=Istanbul, IT=Italy, JA=Jamaica, JP=Japan, JE=Jerusalem, LD=LakeDistrict, LV=LasVegas, LI=Lisbon, LA=LosAngeles, MA=Madeira, MD=Madrid, ML=Malaysia, MC=Mallorca

New files included in the second release.

Section Filename suffix No. files Countries / Cities
History
History
25
AL AM AT BS BA BC BJ BE BU BM CF CA CI CC CN CO CB CR CU NP NO PL PT PR VA
Introduction
Intro
4
AL AM AT BS
What To Do
WhatToDo
24
AL AM AT BS BA BC BJ BE BM BU CF CI CC CN CO CB CR CU NP PA PL PT PR VA
Where To Go
WhereToGo
25
AL AM AT BS BA BC BJ BE BM BO BU CF CA CI CC CN CO CR CB CU NP PA PT PR VA

Key to country and city names: AL=Algarve, AM=Amsterdam, AT=Athens, BS=Bahamas, BA=Bali, BC=Barcelona, BJ=Beijing, BE=Berlin, BM=Bermuda, BO=Boston, BU=Budapest, CF=California, CA=Canada, CI=CanaryIslands, CC=Cancun, CN=China, CO=Costa del Sol, CB=Costa Blanca, CR=Crete, CU=Cuba, NP=Nepal, NO=New Orleans, PA=Paris, PL=Poland, PT=Portugal, PR=Puerto Rico, VA=Puerto Vallarta,


Buffy The Vampire Slayer

The Buffy corpus contains slightly over 3 million words from the Buffistas.org web forums (blog), written between March 2003 and May 2004


PLOS

The Public Library of Science is an on-line, public domain journal consisting of scientific and medical literature.The ANC Second Release includes articles written by American authors taken from PLoS Medicine (2004-2005) and PLoS Biology (2003-2005). In addition to technical articles, PLoS journals include editorials, commentaries, book reviews, and essays. The PLoS headers contain relatively extensive information about the documents, authors, and domain, which was reproduced from the full headers provided with the data.


Biomed

The ANC Second Release includes technical articles by American authors drawn from BioMed Central, which publishes open access, peer-reviewed biomedical research articles.


Fiction

Ferd Eggan
The Story Continues.An online serial novel.
Orin Hargraves
Dead Man's Effects A novel set mainly in London's Docklands in the 1990s, includes some dialogue in British dialect.
The Old Windrow Place A contemporary novel of spiritual growth and reckoning with the past.
Morocco Pentagraph Five stories of varying length, set in Morocco.
Mental Arithmatic 

ICIC

The Indiana Center for Intercultural Communication corpus of Philanthropic Fundraising Discourse corpus consists of fundraising texts, including case statements, annual reports grant proposals, and direct mail letters.


Slate Magazine

Slate Magazine is an on-line publication including short articles on topics of current interest, including News and Politics, Arts, Business, Sports, Technology, Travel, Food, etc. The ANC Slate sub-corpus contains 4694 articles from the Slate archives published between 1996 and 2000.


New York Times

The New York Times component of the ANC Second Release consists of over 4000 articles from the New York Times newswire, for each of the odd-numbered days in July, 2002. The articles for each given day are contained in a sub-directory named by the date (01, 03, 05, 07, 09, 11, etc.). This data has not been released previously, and is not a part of the New York Times data already available from LDC.

The <subject> element in the header associated with each text indicates the topic of the article (e.g., sports, business, entertainment); see the complete list of NY Times subject categories.


Various non-fiction (OUP)

The OUP sub-corpus of the ANC First Release contains a quarter million words of non-fiction drawn from five Oxford University Press publications authored by Americans.

Author
Title
Domain
Chapters
Abernathy A Stitch in Time textile industry
1,2,3,6,7,8,9,14,15
Berk Awakening Children's Minds: How Parents and Teachers Can Make a Difference child development
1,3,4,7
Fletcher Our Secret Constitution : How Lincoln Redefined American Democracy American constitution
1,2.5,6,9,10
Kauffman Investigations general biology
1,4,5,6,7,10
Rybczinski The Look of Architecture architecture
1,2,3
CastroChicano Folklore folklore A, B, C, L, M, N, O,P, Q, R, V, W, Y, Z


Verbatim

Verbatim is a "magazine of language and linguistics for a person without a Ph.D", containing articles about linguistics and language use. The ANC Second Release contains 32 issues of Verbatim from 1990 to 1996.


Government Web Sites

Materials in this portion of the ANC Second Release were drawn from public domain government websites, and include reports, speeches, letters, press releaases, etc. from the websites of the Environmental Protection Agency, the General Accounting Office, the Japan US Friendship Commission, the Legal Services Corporation, the National Center for Injury Prevention and Control, and the Postal Rate Commission.