The Data
There are two versions of the frequency data files, one sorted by lemma and the other sorted by frequency count. The files are available as zip archives or UTF-8 text files.
Written
Spoken
Written & Spoken
File Format
The frequency files consist of four columns separated by TAB characters. The four columns are:
- Word – the word as it appears in the text.
- Lemma – the word’s lemma.
- POS – the Penn part of speech tag for the word.
- Count – the number of occurrences in the second release.
Token Counts
Frequency counts are also available for word types, that is, the surface form of the word as it appears in the text without considering part of speech or lemma. Each file contains three columns:
- Token – the word as it appears in the text.
- Count – the number of times the token appears.
- Ratio – the frequency ratio for the word.
There are 239,208 unique tokens in the second release and 22,164,985 tokens in total for an overall Type Token Ratio of 0.010792.
Methodology
The frequency information includes counts for any token that has been assigned a part of speech tag by the part of speech tagger. Therefore, tokens such as the possessive ‘s are counted as a “word”. The frequency counts were generated by reading the standoff annotation files for the Penn part of speech tags to obtain the lemma, part of speech, and the start and end offsets of the word in the text. The occurrence of the word was then extracted from the content and stored in the triple { type, lemma, part of speech }. Unique triples were then counted to obtain the frequency counts.
Known Problems
The accuracy of the frequency counts is dependent on the accuracy of the tokenization. We note the the following issues:
- mdash – Several documents use a pair of hyphens (-) to represent the mdash. When there is not whitespace on either side of the mdash, the tokenizer mistakenly classifies the entire string as a hyphenated word. To account for this, when a token of the form word1–word2 was encounterd, two triples were created: { word1, word1, UNC } and { word2, word2, UNC }(where UNC is the part of speech tag for unclassified).
- Numbers tagged as nouns are included in the frequency counts (for example, “727” as in “Boeing 727”).
- Sequences of characters that are not “words” are counted. For example, many scientific papers include strings of gene sequences of the form(a|c|g|t)*. Similarily, spoken and informal written texts (blogs etc.) contain strings representing vocal sounds, for example: aaaaahhh, aaarrrgghhhhh, etc.