Spoken
|
|||
Name
|
Domain
|
No. files
|
No. words
|
charlotte | face to face |
93
|
198,295
|
switchboard | telephone |
2,307
|
3,019,477
|
Spoken Totals |
2,410
|
3,217,772
|
|
Written
|
|||
Name
|
Domain
|
No. files
|
No. words
|
911 report | government, technical |
17
|
281,093
|
berlitz | travel guides |
179
|
1,012,496
|
biomed | technical |
837
|
3,349,714
|
eggan | fiction |
1
|
61,746
|
icic | letters |
245
|
91,318
|
oup | non-fiction |
45
|
330,524
|
plos | technical |
252
|
409,280
|
slate | journal |
4,531
|
4,238,808
|
verbatim | journal |
32
|
582,384
|
web data | government |
285
|
1,048,792
|
Written Totals |
6424
|
11,406,155
|
|
Corpus Totals |
8,832
|
14,623,927
|
Spoken Data
Charlotte Narratives
The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities. Information on speaker age and gender is included in the header for each transcript.
Switchboard
The Switchboard component includes the transcriptions of the LDC Switchboard corpus. It consists of 2320 spontaneous conversations averaging 6 minutes in length and comprising about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.
NOTE: In the LDC Switchboard corpus, each “side” of a conversation is contained in a separate document. In the ANC version, the two sides of the conversation have been merged (based on timestamps) so that each document in the ANC Switchboard sub-corpus contains a complete conversation representing utterances by each side in turn.
The Switchboard manual describes the entire corpus, including the audio files. The transcribed component is described in section 4. Speaker identification and demographic information for each speaker are provided in the header file for each text. the classifications are as follows:
Dialect: South Midland, Western, North Midland, Northern, Southern, NYC, Mixed, New England.
Age group: 20-29, 30-39, 40-49, 50-59, 60-69.
Gender: Male, Female.
Education: 0=less than high school, 1=less than college, 2=college, 3=more than college, 9=unknown.
The Switchboard manual provides information on the distribution of each catagory among the speakers in the corpus.
Written Data
911 Report
The OANC contains the full text of the report released on July 22, 2004 by The National Commission on Terrorist Attacks Upon the United States.
Berlitz Travel Guides
Several Berlitz Travel Guides written by and for Americans were contributed by Langensheidt Publishers.
The Berlitz sub-corpus is split into separate files by country/city and section.
Files from the first release
Section
|
Filename suffix
|
No. of Files
|
Countries/Cities
|
Hotels and Restaurants |
HandR
|
14
|
HA HK IB IS IS JA JE LD LV LI LO MA MD ML
|
History |
History
|
47
|
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP JE LD LV MA MD ML MC AL AM AT BS BA BC BJ BE BU BM CF CA CI CC CN CO CB CR CU NP NO PL PT PR VA |
Where to Go |
WhereTo
|
46
|
DU ED EG FW FR GR HA HK IB IN IS IB IT JP JE LD LA MA MD ML MC AL AM AT BS BA BC BJ BE BM BO BU CF CA CI CC CN CO CR CB CU NP PA PT PR VA
|
What to Do |
WhatTo
|
46
|
DU ED EG FW FR GR HA HK IB IN IS IB IT JA JP LD LV LA MA ML MC AL AM AT BS BA BC BJ BE BM BU CF CI CC CN CO CB CR CU NP PA PL PT PR VA
|
Jungle |
Jungle
|
1
|
MC
|
Introduction |
Intro
|
23
|
DU ED EG FW FR GR HK IB IN IS IB IT JA JP JE LD LV LA MA AL AM AT BS
|
Key to country and city names: DU=Dublin, ED=Edinburgh, EG=Egypt, FWI=FWI, FR=France, GR=Greece, HA=Hawaii, HK=HongKong, IB=Ibiza, IN=India, IS=Israel, IB=Istanbul, IT=Italy, JA=Jamaica, JP=Japan, JE=Jerusalem, LD=LakeDistrict, LV=LasVegas, LI=Lisbon, LA=LosAngeles, MA=Madeira, MD=Madrid, ML=Malaysia, MC=Mallorca, AL=Algarve, AM=Amsterdam, AT=Athens, BS=Bahamas, BA=Bali, BC=Barcelona, BJ=Beijing, BE=Berlin, BM=Bermuda, BO=Boston, BU=Budapest, CF=California, CA=Canada, CI=CanaryIslands, CC=Cancun, CN=China, CO=Costa del Sol, CB=Costa Blanca, CR=Crete, CU=Cuba, NP=Nepal, NO=New Orleans, PA=Paris, PL=Poland, PT=Portugal, PR=Puerto Rico, VA=Puerto Vallarta
PLOS
The Public Library of Science is an on-line, public domain journal consisting of scientific and medical literature.The OANC includes articles written by American authors taken from PLoS Medicine (2004-2005) and PLoS Biology (2003-2005). In addition to technical articles, PLoS journals include editorials, commentaries, book reviews, and essays. The PLoS headers contain relatively extensive information about the documents, authors, and domain, which was reproduced from the full headers provided with the data.
Biomed
Technical articles by American authors drawn from BioMed Central, which publishes open access, peer-reviewed biomedical research articles.
Fiction
Ferd Eggan
The Story Continues. An online serial novel
ICIC
The Indiana Center for Intercultural Communication corpus of Philanthropic Fundraising Discourse corpus consists of fundraising texts, including case statements, annual reports grant proposals, and direct mail letters.
Slate Magazine
Slate Magazine is an on-line publication including short articles on topics of current interest, including News and Politics, Arts, Business, Sports, Technology, Travel, Food, etc. The ANC Slate sub-corpus contains 4694 articles from the Slate archives published between 1996 and 2000.
Various non-fiction (OUP)
The OUP sub-corpus contains a quarter million words of non-fiction drawn from five Oxford University Press publications authored by Americans.
Author
|
Title
|
Domain
|
Chapters
|
Abernathy | A Stitch in Time | textile industry |
1,2,3,6,7,8,9,14,15
|
Berk | Awakening Children’s Minds: How Parents and Teachers Can Make a Difference | child development |
1,3,4,7
|
Fletcher | Our Secret Constitution : How Lincoln Redefined American Democracy | American constitution |
1,2.5,6,9,10
|
Kauffman | Investigations | general biology |
1,4,5,6,7,10
|
Rybczinski | The Look of Architecture | architecture |
1,2,3
|
Castro | Chicano Folklore | folklore | A, B, C, L, M, N, O,P, Q, R, V, W, Y, Z |
Verbatim
Verbatim is a “magazine of language and linguistics for a person without a Ph.D”, containing articles about linguistics and language use. The ANC Second Release contains 32 issues of Verbatim from 1990 to 1996.