Data
Spoken
|
|||
Corpora
|
Domain
|
No. files
|
No. words
|
callhome | telephone |
24
|
52,532
|
charlotte | face to face |
93
|
198,295
|
micase | academic discourse |
50
|
593,288
|
switchboard | telephone |
2,307
|
3,019,477
|
Spoken Totals |
2,474
|
3,863,592
|
|
Written
|
|||
Corpora
|
Domain
|
No. files
|
No. words
|
911 report | government, technical |
17
|
281,093
|
berlitz | travel guides |
179
|
1,012,496
|
biomed | technical |
837
|
3,349,714
|
buffy | blog |
143
|
3,093,075
|
hargraves | fiction |
106
|
405,195
|
eggan | fiction |
1
|
61,746
|
icic | letters |
245
|
91,318
|
nytimes | newspaper |
4,148
|
3,625,687
|
oup | non-fiction |
45
|
330,524
|
plos | technical |
252
|
409,280
|
slate | journal |
4,531
|
4,238,808
|
verbatim | journal |
32
|
582,384
|
web data | government |
285
|
1,048,792
|
Written Totals |
10,821
|
18,530,112
|
|
Corpus Totals |
13,295
|
22,393,704
|
Annotations
Sub-corpus | Header | Logical | Sentence | Hepple | Biber | Noun chunks | Verb chunks |
---|---|---|---|---|---|---|---|
911report | X | X | X | X | X | X | X |
berlitz1 | X | X | X | X | X | X | X |
berlitz2 | X | X | X | X | X | X | X |
biomed | X | X | X | X | X | X | X |
buffy | X | X | X | X | X | X | |
callhome | X | X | X | X | X | X | X |
charlotte | X | X | X | X | X | X | X |
eggan | X | X | X | X | X | X | |
hargraves | X | X | X | X | X | X | X |
icic | X | X | X | X | X | X | X |
micase | X | X | X | X | X | X | X |
nytimes | X | X | X | X | X | X | X |
oup | X | X | X | X | X | X | X |
plos | X | X | X | X | X | X | |
slate | X | X | X | X | X | X | X |
switchboard | X | X | X | X | X | X | X |
verbatim | X | X | X | X | X | X | X |
websites | X | X | X | X | X | X |
Detailed Data Description
Spoken Data
Callhome
The CallHome component of the ANC Second Release includes transcripts and documentation files for 24 unscripted telephone conversations between native speakers of American English. The transcripts cover a contiguous 10 minute segment of each call, comprising 50,494 words.
The 24 transcripts are a subset of the full CallHome corpus available from LDC. The transcripts are time-stamped by speaker turn for alignment with the speech signal included in the LDC CallHome corpus. Complete auditing information on the speakers represented in the transcripts is included in the header file associated with each transcript, as well as in the on-line documentation for the LDC full corpus. The LDC documentation also describes the transcription conventions and format of the CallHome corpus.
Each file in the ANC CallHome sub-corpus is named with the same identifier referenced in the LDC on-line documentation.
Switchboard
The Switchboard component of the ANC Second Release includes the transcriptions of the LDC Switchboard corpus. It consists of 2320 spontaneous conversations averaging 6 minutes in length and comprising about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.
NOTE: In the LDC Switchboard corpus, each “side” of a conversation is contained in a separate document. In the ANC version, the two sides of the conversation have been merged (based on timestamps) so that each document in the ANC Switchboard sub-corpus contains a complete conversation representing utterances by each side in turn.
The Switchboard manual describes the entire corpus, including the audio files. The transcribed component is described in section 4. Speaker identification and demographic information for each speaker are provided in the header file for each text. the classifications are as follows:
Dialect: South Midland, Western, North Midland, Northern, Southern, NYC, Mixed, New England.
Age group: 20-29, 30-39, 40-49, 50-59, 60-69.
Gender: Male, Female.
Education: 0=less than high school, 1=less than college, 2=college, 3=more than college, 9=unknown.
The Switchboard manual provides information on the distribution of each catagory among the speakers in the corpus.
Charlotte Narratives
The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities. Information on speaker age and gender is included in the header for each transcript.
Micase
The ANC Second Release contains 50 transcipts from the Michigan Corpus of Acadamic Spoken English. Information on speaker age, gender and role is included in the header for each transcript.
Written Data
911 Report
The ANC Second Release contains the full text of the report released on July 22, 2004 by The National Commission on Terrorist Attacks Upon the United States.
Berlitz Travel Guides
Several Berlitz Travel Guides written by and for Americans were contributed by Langensheidt Publishers.
The Berlitz sub-corpus is split into separate files by country/city and section.
Files from the first release
Section
|
Filename suffix
|
No. of Files
|
Countries/Cities
|
Hotels and Restaurants |
HandR
|
14
|
HA HK IB IS IS JA JE LD LV LI LO MA MD ML
|
History |
History
|
22
|
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP JE LD LV MA MD ML MC |
Where to Go |
WhereTo
|
21
|
DU ED EG FW FR GR HA HK IB IN IS IB IT JP JE LD LA MA MD ML MC
|
What to Do |
WhatTo
|
21
|
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP LD LV LA MA ML MC
|
Jungle |
Jungle
|
1
|
MC
|
Introduction |
Intro
|
19
|
DU ED EG FW FR GR HK IB IN IS IB IT JA JP JE LD LV LA MA
|
Key to country and city names: DU=Dublin, ED=Edinburgh, EG=Egypt, FWI=FWI, FR=France, GR=Greece, HA=Hawaii, HK=HongKong, IB=Ibiza, IN=India, IS=Israel, IB=Istanbul, IT=Italy, JA=Jamaica, JP=Japan, JE=Jerusalem, LD=LakeDistrict, LV=LasVegas, LI=Lisbon, LA=LosAngeles, MA=Madeira, MD=Madrid, ML=Malaysia, MC=Mallorca
New files included in the second release.
Section | Filename suffix | No. files | Countries / Cities |
---|---|---|---|
History |
History
|
25
|
AL AM AT BS BA BC BJ BE BU BM CF CA CI CC CN CO CB CR CU NP NO PL PT PR VA |
Introduction |
Intro
|
4
|
AL AM AT BS |
What To Do |
WhatToDo
|
24
|
AL AM AT BS BA BC BJ BE BM BU CF CI CC CN CO CB CR CU NP PA PL PT PR VA |
Where To Go |
WhereToGo
|
25
|
AL AM AT BS BA BC BJ BE BM BO BU CF CA CI CC CN CO CR CB CU NP PA PT PR VA |
Key to country and city names: AL=Algarve, AM=Amsterdam, AT=Athens, BS=Bahamas, BA=Bali, BC=Barcelona, BJ=Beijing, BE=Berlin, BM=Bermuda, BO=Boston, BU=Budapest, CF=California, CA=Canada, CI=CanaryIslands, CC=Cancun, CN=China, CO=Costa del Sol, CB=Costa Blanca, CR=Crete, CU=Cuba, NP=Nepal, NO=New Orleans, PA=Paris, PL=Poland, PT=Portugal, PR=Puerto Rico, VA=Puerto Vallarta,
Buffy The Vampire Slayer
The Buffy corpus contains slightly over 3 million words from the Buffistas.org web forums (blog), written between March 2003 and May 2004
PLOS
The Public Library of Science is an on-line, public domain journal consisting of scientific and medical literature.The ANC Second Release includes articles written by American authors taken from PLoS Medicine (2004-2005) and PLoS Biology (2003-2005). In addition to technical articles, PLoS journals include editorials, commentaries, book reviews, and essays. The PLoS headers contain relatively extensive information about the documents, authors, and domain, which was reproduced from the full headers provided with the data.
Biomed
The ANC Second Release includes technical articles by American authors drawn from BioMed Central, which publishes open access, peer-reviewed biomedical research articles.
Fiction
Ferd Eggan The Story Continues. An online serial novel.
Orin Hargraves Dead Man’s Effects A novel set mainly in London’s Docklands in the 1990s, includes some dialogue in British dialect. The Old Windrow Place A contemporary novel of spiritual growth and reckoning with the past. Morocco Pentagraph Five stories of varying length, set in Morocco. Mental Arithmatic
ICIC
The Indiana Center for Intercultural Communication corpus of Philanthropic Fundraising Discourse corpus consists of fundraising texts, including case statements, annual reports grant proposals, and direct mail letters.
Slate Magazine
Slate Magazine is an on-line publication including short articles on topics of current interest, including News and Politics, Arts, Business, Sports, Technology, Travel, Food, etc. The ANC Slate sub-corpus contains 4694 articles from the Slate archives published between 1996 and 2000.
New York Times
The New York Times component of the ANC Second Release consists of over 4000 articles from the New York Times newswire, for each of the odd-numbered days in July, 2002. The articles for each given day are contained in a sub-directory named by the date (01, 03, 05, 07, 09, 11, etc.). This data has not been released previously, and is not a part of the New York Times data already available from LDC.
The <subject> element in the header associated with each text indicates the topic of the article (e.g., sports, business, entertainment); see the complete list of NY Times subject categories.
Various non-fiction (OUP)
The OUP sub-corpus of the ANC First Release contains a quarter million words of non-fiction drawn from five Oxford University Press publications authored by Americans.
Author
|
Title
|
Domain
|
Chapters
|
Abernathy | A Stitch in Time | textile industry |
1,2,3,6,7,8,9,14,15
|
Berk | Awakening Children’s Minds: How Parents and Teachers Can Make a Difference | child development |
1,3,4,7
|
Fletcher | Our Secret Constitution : How Lincoln Redefined American Democracy | American constitution |
1,2.5,6,9,10
|
Kauffman | Investigations | general biology |
1,4,5,6,7,10
|
Rybczinski | The Look of Architecture | architecture |
1,2,3
|
Castro | Chicano Folklore | folklore | A, B, C, L, M, N, O,P, Q, R, V, W, Y, Z |
Verbatim
Verbatim is a “magazine of language and linguistics for a person without a Ph.D”, containing articles about linguistics and language use. The ANC Second Release contains 32 issues of Verbatim from 1990 to 1996.
Government Web Sites
Materials in this portion of the ANC Second Release were drawn from public domain government websites, and include reports, speeches, letters, press releaases, etc. from the websites of the Environmental Protection Agency, the General Accounting Office, the Japan US Friendship Commission, the Legal Services Corporation, the National Center for Injury Prevention and Control, and the Postal Rate Commission.