MASC Sentence Corpus | Open American National Corpus

A focus of the MASC project is to provide corpus evidence to support an effort to harmonize sense distinctions in WordNet and FrameNet. To support this effort, portions of the MASC data, together with data from the Open American National Corpus (OANC), have been annotated for WordNet senses and FrameNet frames. The WordNet and FrameNet annotated sentences are provided as a stand-alone “sentence corpus”, with the WordNet and FrameNet annotations represented in standoff files. Each sentence in this corpus is also linked to its occurrence in the original text in MASC or the OANC, so that the context and other annotations associated with the sentence may be retrieved if desired.

The MASC Sentence corpus is downloadable from the MASC Downloads page in GrAF (for compatibility with other MASC annotations) and in a tab-separated format.

Methodology

Word selection

The WordNet and FrameNet teams selected 114 common polysemous words to study in detail. The tagging was performed in a series of rounds, with approximately 10 words per round. Tagging for all rounds except round 1 was performed using SATANiC, a tagging interface developed within the MASC project.

Rounds 1-3, 5, and 6-10 include 10 words; Rounds 4 and 11 include 13 and 11, respectively . The set of words were chosen so as to balance part-of-speech, and to represent relatively polysemous words for rounds 1 and 2 (average number of senses per word=9.5), not so polysemous words for rounds 3,4 and 6-10 (average number of senses per word ~= 5) , and a mix of very polysemous words with words having few senses for round 5 (average number of senses per word = 6.4). Round 11 was more focused on adjectives in order to provide data for improvement of the WordNet adjective inventory.

Tagging

As a first step, 50 sentences containing instances of each word (restricted to a given part-of-speech) were selected from the MASC subset of the OANC, drawn equally from each of the genre-specific portions of the corpus. These instances were annotated by 4-6 taggers, depending on the round, using the WordNet 3.0 inventory. The taggers and the WordNet team then reviewed the WordNet sense inventory to determine whether the inventory needed revision, and for an in-depth study of inter-annotator agreement.

The revised inventory, which has since been released as WordNet 3.1, was then used to annotate 1000 occurrences of each word in its sentence context. Because of its small size, MASC typically contains less than 1000 occurrences of a given word; the remaining sentences are therefore drawn from the 15 million words of the OANC. In cases where 1000 occurrences did not exist in the MASC and OANC, fewer occurrences were tagged. The full set of occurrences for each word was tagged by at least one tagger. A 100-sentence subset of the 1000 sentences for each word was annotated by at least two taggers, to serve as a cross-validation sample for inter-annotator agreement statistics.

Inter-annotator agreement studies

We have performed extensive inter-annotator agreement studies on the MASC sense annotations, reported in Passonneau et al., 2009 and Passonneau et al., 2010. In addition to interannotator agreement statistics, the release documents the words annotated in each round, the sense labels for each word, the sentences for each word, and the annotator or annotators for each sense assignment to each word in context. For the multiply annotated data in rounds 2-4, we include raw tables for each word in the form expected by Ron Artstein’s calculate_alpha.pl perl script, so that the agreement numbers can be regenerated. In round 5, we used a script that applies a distance metric that is not yet available for release.

Round 1 was a small pilot study to test the WordNet sense inventory revision process and gather information about the task before commencing the full study, involving only 50 occurrences of 10 words tagged by two annotators. No inter-annotator agreement was computed for this data.

Annotation Guidelines

Christiane Fellbaum prepared the annotation guidelines based on her previous word sense annotation projects. The most recent version is in the file tagging.guidelines.v3.doc, which is included in the “doc” directory with the distribution.

Annotation tool

The Sense Annotation Tool for the American National Corpus (SATANiC) was developed during the course of the MASC word sense annotation project by Keith Suderman, and updated several times. A screenshot of the tool appears in (Passonneau et al., 2009). The current version displays the WordNet sense glosses for each word, plus four additional options:

Glob
No senses is appropriate
Wrong part of speech
Not enough context is available

Glob is used to identify collocations, as defined in the annotation guidelines.

FrameNet annotations

The FrameNet team has annotated the one hundred sentence cross-validation sample for each word with FrameNet frames and frame elements, providing direct comparisons of WordNet and FrameNet sense assignments in attested sentences. Note that several MASC texts have been fully annotated for FrameNet frames and frame elements, in addition to the WordNet-tagged sentences.

Publications

The following publications describe the corpus, IAA studies, and comparisons of WordNet and FrameNet based on the MASC Sentence Corpus:

References to the corpus should cite:

Passonneau, R., Baker, C., Fellbaum, C., Ide, N. (2012). The MASC Word Sense Sentence Corpus. Proceedings of the Eighth Language Resources and Evaluation Conference, Istanbul.

Other publications:

Passonneau, R., Salleb-Aoussi, A., Ide, N. (2009). Making Sense of Word Sense Variation. Semantic Evaluations: Recent Achievements and Future Directions. NAACL-HLT 2009 Workshop, Boulder, Colorado, USA.
Passonneau, R., Salleb-Aoussi, A., Bhardwaj, V., and Ide, N. (2010). Word Sense Annotation of Polysemous Words by Multiple Annotators . Proceedings of the Seventh Language Resources and Evaluation Conference (LREC 2010), Valletta, Malta.
Bhardwaj, V., Passonneau, R., Salleb-Aouissi, A., Ide, N. (2010). Anveshan: A Framework for Analysis of Multiple Annotators’ Labeling Behavior. Proceedings of the Fourth Linguistic Annotation Workshop (LAW IV), held in conjunction with the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.
de Melo, G., Baker, C.F., Ide, N., Passonneau, R., Fellbaum, C. (2012). Empirical Comparisons of MASC Word Sense Annotations. Proceedings of the Eighth Language Resources and Evaluation Conference, Istanbul.