- Wall Street Journal, September 12, 2008
- Chicago Tribune, March 25, 2004
- New York Times Magazine, August 18, 2002
ANC in the Wall Street Journal
Making Every Word Count:
Computers and the Web Complicate Vital Research on Frequently Used Language
By Carl Bialitz
Wall Street Journal
September 12, 2008
If you’re like me, you’ve wasted time taking online quizzes like the one my friend challenged me to take: Name the 100 most frequently used English words in five minutes. (I got 45.)
You could waste all the time you’d like, as Top 100 word lists abound. Word-frequency rankings are part — albeit just a sliver — of the vast output from studies of language corpora, or large collections of written and sometimes spoken text. Researchers parse such data to help make sense of our ever-evolving language.
But the results of these rankings differ widely. Taking a snapshot of English in all its diverse incarnations is devilishly tricky and expensive. Computers and the Internet can make research simpler. But they also add to the challenge because they can distort language patterns.
Tension between size, cost and representativeness runs through all corpus research, raising questions about its quantitative findings. Transcripts of university lectures and television programs are favored sources for spoken language, but they can differ markedly from private chatter. And speech, in general, diverges from writing. “People don’t say ‘yes’ anymore in interviews,” Alison Duguid, a linguist at the University of Siena, Italy, offers by way of example. “They say ‘absolutely.’”
English can look very different, when viewed through different prisms. “The” is the universal ranking champion, but “be” might place second or 22nd, depending on whether all conjugations, such as “is” and “was,” get counted. “I” was the most commonly used word in 11,700 10-minute conversations recorded in 2002 and 2003. It appeared 984,359 times, according to David Graff, the lead programmer analyst for the Linguistic Data Consortium at the University of Pennsylvania, which maintains corpora. “You” was runner-up, appearing 702,941 times.
In a collection of newspaper articles from the same time period, “I” ranked 30th and “you” ranked 43rd. “Yeah,” “um,” “uh” and “uh-huh” also made the Top 100 in conversations, but not in newspapers.
The proper construction of corpora matters to a lot of people. Dictionary publishers use corpora to determine the most-common definitions for versatile words. Literature researchers need them to compare the work of a given author with the norms for language. Linguists use them to track the introduction of new words (“Facebook”) and the diminution of older ones (“britches”).
Microsoft uses corpora to help correct misspellings in its Word software. It has licensed over one trillion words of English text in each of the past two years, and bolsters its collection with emails exchanged on its Hotmail program, with identifying details removed, according to a spokeswoman. “Text corpora is the lifeblood of most of our development and testing processes,” says Mike Calcagno, general manager of the Microsoft group that manages Word.
Computers have spawned a burst of activity in the field. But even computers don’t suffice for the daunting task of word collecting and counting. Brown University’s one-million-word corpus was considered adequate in the 1960s. Today, the 100-million-word British National Corpus is considered small — and dated — because it preceded the Internet era, and other sources of new language.
It’s easy to build bigger collections using the Web, but that gives short shrift to genres that don’t often make it online, notably fiction. It also ignores spoken words, which are underrepresented in corpora because they are so much harder and more expensive to collect.
Without enough spoken-language data, subtleties may not emerge. “The word ‘rife’ only occurs in negative contexts,” says Anne O’Keeffe, a linguist at Mary Immaculate College, the University of Limerick, Ireland. “We are never rife with money,” despite that affliction’s appeal.
In assembling the British National Corpus, it cost the same to collect 10 million spoken words as to collect 50 million in written text, says Lou Burnard. He worked in the early 1990s on building the corpus, which included the recorded conversations of 200 Britons. “It would be great to do another BNC, but we don’t have the funding,” he adds.
The intended American counterpart to the BNC has similar problems. The American National Corpus, an array of text including the 9/11 Commission Report and Berlitz travel guides, contains a mere 22 million words.
This newspaper is remembered fondly by linguists for donating a large chunk of its archives in the late 1980s and early 1990s for corpus research. The Wall Street Journal’s oeuvre was an imperfect representation of English, however. For one thing, the financial sense of “stock” predominated over meanings tied to livestock and soup.
“It is really crucial that you have a corpus that is well-balanced,” says Princeton University linguist Christiane Fellbaum.
In the years since, the Web has eclipsed the Journal as the go-to repository of words. It is now the primary source for Oxford University Press’s corpus for the Oxford English Dictionary, which once relied on the BNC. John Mansfield, who works on developing Web sources for Oxford, agrees that the Web is short on fiction and conversational English. Otherwise, he says, “you’ve just got an incredible diversity of every kind of text.”
Nancy Ide, chairwoman of the computer-science department at Vassar College, who manages the American National Corpus, points out a major failing of Web-based corpora: Without copyright permission for all of this text, researchers can’t share and analyze it fully. Also, it’s difficult to isolate American English from British English and other variants online.
Oxford has devised precise, if arbitrary, targets for Web categories. Blogs get about the same share as law, science, business and medicine combined. “Blog” itself, incidentally, merits nary a mention in corpora assembled a decade ago.
Potentially skewed results for corpora have caused any number of headaches. Even guides for English teachers often don’t reflect changes in the language. “The Reading Teacher’s Book of Lists” tracks frequently used words based on a corpus from the early 1970s, says Edward Fry, a co-author of the book and a retired educational psychologist at Rutgers University. “Computer,” for one, is not on the list.
ANC in the Chicago Tribune
Linguists hunt and study words in their natural habitat
By Nathan Bierma
Special to the Tribune
March 25, 2004
Sometimes language lovers sound as if they’re on a safari. They talk about observing words in their natural habitat and studying their behavior in herds.
With the first release of the American National Corpus, an annotated body of over 10 million words, linguists can hunt like never before.
“Up until now, linguists were kind of like Victorian bug hunters,” says Erin McKean, the Chicago-based senior editor of U.S. dictionaries for Oxford University Press and board member of the American National Corpus. “We’d go out with our nets and we’d catch some butterflies and we’d chloroform them and pin them to cards and put them in a drawer.”
“But now, when people are really studying an ecosystem — and English is like an ecosystem — what they do is, they take a representative square area and report everything that’s there: every bug, every plant, every leaf,” she said. “And now with the corpus, we can do that for English.”
If the dictionary is like the drawer with bugs on cards, the corpus is the jungle. The ANC collects blocks of text from newspapers, books and conversations so words and phrases can be viewed in their natural habitat — that is, in an American English context.
Readers can search the collection by word, phrase, part of speech or type of source and find their quarry used in a sentence or paragraph.
For students learning English as a second language, a corpus — Latin for body — can help teach idioms and tendencies in a way dictionaries cannot, as ANC users around the world have already discovered.
“I hear from language teacher trainers in Egypt, Germany, Japan and Sweden who are really excited to have these data available to them, so they can go in and look at aspects of conversation,” said Randi Reppen, English professor at Northern Arizona University and Project Manager for the ANC.
The ANC could also be used by advertising copywriters in search of resonant slogans, or by computer programmers to make automated customer service hotlines sound more natural, McKean said.
The ANC’s initial release last October, available on CD-ROM for $75 at www.americannationalcorpus.org, contains 11.5 million words. About one-fourth of the collection is made up of spoken English, including transcribed phone conversations from volunteers who were given phone cards in exchange for being recorded.
The rest of the corpus is written text contributed by The New York Times, the online magazine Slate, Langenscheidt travel guides and books from Oxford University Press on architecture and Abraham Lincoln.
“We want writers to want to be part of the American National Corpus,” McKean said. “We’re hoping to have an ANC logo that authors can have their publishers put on their books, as a way of saying, `My work is influencing the study of the English language.’”
By the end of 2005, the ANC, which last year received a grant from the National Science Foundation, hopes to release 100 million words — 90 million written, 10 million spoken — evenly balanced among sources as diverse as town meetings, medical journals and novels.
“It’s hard to take one area and say, `This is English,’” Reppen said. “By having different types of writing and speaking situations, the corpus gives a better picture for language researchers, teachers and learners.”
Until now, such seekers of untamed English have relied on other corpora such as the British National Corpus, a collection of 100 million words of British English released 10 years ago. But in the last 10 years, new technology has made formatting samples of text faster and cheaper.
“We’re lucky that we’re doing it today,” McKean said. “This is something that would have been insane to do in the 1950s and was barely possible in the 1980s when the British National Corpus [started].”
Meanwhile, demand for corpora has grown in the field of computational linguistics, which uses computer programs to analyze the structure of language.
“The motivation for the ANC came from the fact that many computational linguists were using the BNC to gather statistics about syntactic patterns, [when in fact] British English and American English are not alike in several ways,” said Nancy Ide, professor of computer science at Vassar College and Technical Director of the ANC.
Another new wrinkle in corpus linguistics is the Internet. The ANC plans to add e-mails, message boards and Web sites to its collection. McKean has already gotten permission from her message board of fellow “Buffy the Vampire Slayer” fans to use their posts for the ANC.
Copyright (c) 2004, Chicago Tribune
ANC in the New York Times
The American National Corpus (ANC) project figured prominently in the August 18, 2002 On Language column, featured in the New York Times Magazine.The article “Corpus Linguistics”, written by John Rosenthal, gives a good overview of how linguists can use corpora to describe current usage. Several quotations from the ANC Project Manager, Randi Reppen, are included:
Reppen is the project manager for the American National Corpus, a huge undertaking sponsored by a consortium of publishers, software companies and academics, including Pearson, Microsoft, Sony and the Universities of California, Colorado and Pennsylvania, among many others. When it is completed, the corpus will contain more than 100 million words, chosen from a broad selection of contemporary written and spoken texts — everything from books, magazines and newspapers to face-to-face conversations in drugstores and Laundromats that have been recorded and transcribed by researchers. Based on a similar corpus of British English created in 1994, the American Corpus will provide a definitive portrait of how the English language is used in the United States today.
The first installment of 10 million words is scheduled for release this fall and will be available to anybody with Internet access. Say, for example, you’re writing advertising copy, and you want to know whether most people still use ”I couldn’t care less” or opt instead for the easier (but nonsensical) ”I could care less.” You’ll simply hop on the Web, enter the phrase ”could care less” and count the occurrences in the corpus. Then you’ll do the same for ”couldn’t care less” and compare the number of hits. ”You could choose to limit your search to spoken language or to newspapers or even to academic writing,” Reppen says.
The article incorrectly indicates that the ANC will be “available to anybody with Internet access”. However, while the ANC will indeed be web-accessible, access to the corpus for development of commercial products (dictionaries and other reference publications, language-aware software, etc.) is restricted to ANC Consortium members until the year 2007. Commercial users who are not members of the ANC Consortium can gain access before 2007 by joining the consortium at any time. For the purposes of academic research and education, the ANC will be broadly available from the University of Pennsylvania’s Linguistic Data Consortium for a nominal fee covering part of the costs of distribution.