WordNet - FrameNet Annotations

 

A focus of the MASC project is to provide corpus evidence to support an effort to harmonize sense distinctions in WordNet and FrameNet. Therefore, portions of the MASC data have been annotated for WordNet senses and FrameNet frames.

Methodology

Word selection

The WordNet and FrameNet teams selected ~100 common polysemous words to study in detail. The tagging was performed in a series of rounds, with approximately 10 words per round. Tagging for all rounds except round 1 was performed using SATANiC, a tagging interface developed within the MASC project.

For Rounds 1-3 and 5, there were 10 words; for Round 4 there were 13. The set of words were chosen so as to balance part-of-speech, and to represent relatively polysemous words for rounds 1 and 2 (average number of senses per word=9.5), not so polysemous words for rounds 3 (average number of senses per word=4.8) and 4 (average number of senses per word=5.0), and a mix of very polysemous words with words having few senses for round 5 (average number of senses per word = 6.4).

Tagging

As a first step, 50 sentences containing instances of each word (restricted to a given part-of-speech) were selected from the MASC subset of the OANC, drawn equally from each of the genre-specific portions of the corpus. These instances were annotated by 4-6 taggers, depending on the round, using the WordNet 3.0 inventory.  The taggers and the WordNet team then reviewed the WordNet sense inventory to determine whether the inventory needed revision, and for an in-depth study of inter-annotator agreement.

The revised inventory, which will be released as WordNet 3.1, was then used to annotate 1000 occurrences of each word in its sentence context. Because of its small size, MASC typically contains less than 1000 occurrences of a given word; the remaining sentences are therefore drawn from the 15 million words of the OANC. In cases where 1000 occurrences did not exist in the MASC and OANC, fewer occurrences were tagged. The full set of occurrences for each word was tagged by at least one tagger. A 100-sentence subset of the 1000 sentences for each word was annotated by at least two taggers, to serve as a cross-validation sample for inter-annotator agreement statistics.

Inter-annotator agreement studies

We have performed extensive inter-annotator agreement studies on the MASC sense annotations, reported in Passonneau et al., 2009 and Passonneau et al., 2010. In addition to interannotator agreement statistics, the release documents the words annotated in each round, the sense labels for each word, the sentences for each word, and the annotator or annotators for each sense assignment to each word in context. For the multiply annotated data in rounds 2-4, we include raw tables for each word in the form expected by Ron Artstein's calculate_alpha.pl perl script, so that the agreement numbers can be regenerated. In round 5, we used a script that applies a distance metric that is not yet available for release.

Round 1 was a small pilot study to test the WordNet sense inventory revision process and gather information about the task before commencing the full study, involving only 50 occurrences of 10 words tagged by two annotators. No inter-annotator agreement was computed for this data.

Annotation Guidelines

Christiane Fellbaum prepared the annotation guidelines based on her previous word sense annotation projects.  The most recent version is in the file tagging.guidelines.v3.doc, which is included in the "doc" directory with the distribution.

Annotation tool

The Sense Annotation Tool for the American National Corpus (SATANiC) was developed during the course of the MASC word sense annotation project by Keith Suderman, and updated several times. A screenshot of the tool appears in (Passonneau et al., 2009).  The current version displays the WordNet sense glosses for each word, plus four additional options:

  1. Glob

  2. No senses is appropriate

  3. Wrong part of speech

  4. Not enough context is available

Glob is used to identify collocations, as defined in the annotation guidelines.

FrameNet annotations

The FrameNet team is annotating the one hundred sentence cross-validation sample for each word with FrameNet frames and frame elements, providing direct comparisons of WordNet and FrameNet sense assignments in attested sentences. Note that several MASC texts have been fully annotated for FrameNet frames and frame elements, in addition to the WordNet-tagged sentences.

Distribution

For convenience, the WordNet and FrameNet annotated sentences are provided as a stand-alone “sentence corpus”, with the WordNet and FrameNet annotations represented in standoff files. Each sentence in this corpus is also linked to its occurrence in the original text in MASC or the OANC, so that the context and other annotations associated with the sentence may be retrieved if desired. The WordNet-FrameNet sentence corpus can be downloaded here.