Each MASC text is in a separate primary data document, with an associated header file and segmentation file. All annotations are contained in separate annotation files whose names indicate the contents. The following lists the files and describes their contents.
For an overview of the LAF/GrAF corpus structure and headers, see Ide, N., Suderman, K. (forthcoming). The Linguistic Annotation Framework: A Standard for Annotation Interchange and Merging. Language Resources and Evaluation.
MASC Corpus Header
resource-header.xml : The MASC corpus header included in the distribution contains information concerning the provenance, compilations procedures, organization, and content of the corpus. The following sections of the header are referred to by text and annotation files:
- Category class declaration (element name: ClassDecl) : provides the category labels used to identify major and minor genre in text document headers.
- File structure (element name: fileStruct) : provides information about all files, their naming conventions, and dependencies among annotation files.
- Annotation spaces (element name: annotationSpaces) : identifies the annotations sets used in the annotation files and their naming conventions, as well as the conceptual annotation “layer” to which each set belongs.
- Annotation (element name: annotationDecl) : identifies an annotation type included in each annotation set, indicates the method by which it was produced, and provides a brief description and link to the documentation and project that is responsible for creation for the annotation.
- Groups (element name: groups) : identifies different annotation layers or other selected portions in the corpus and their naming conventions.
- Media (element name: media) : identifies the formats of files included in the corpus and annotations and indicates the suffix for files conforming to that format.
- Anchors (element name: anchorType): defines the mechanisms for pointing into primary data, depending on media type.
MASC Text (document) Headers
<file-name>.hdr contains the header for the document , which provides information about the provenance of the text and specifies the medium (UTF-8 text for all MASC documents), genre, sub-genre, domain, and subject information (element name: textClass), as well as information about the location of the primary text file and all associated annotation files, for human or machine consumption.
MASC Annotation Headers
The header for each annotation file is included in the annotation file itself, rather than being provided as a separate XML document. Annotation headers indicate the location of all related documents and dependencies upon other annotation file. This latter information is used by some ANC tools to ensure that all required documents are loaded into GATE, UIMA, etc. Annotation headers also provide information about the types of anchors used in the annotation (character offsets for text), the annotation sets used.
All MASC annotations are in GrAF format and linked to regions defined over the primary data or to other annotations.
PRIMARY DATA FILES
Primary Data file
<file-name>.txt : The primary data in UTF-8 character encoding.
<file-name>-seg.xml : All MASC documents are associated with a base segmentation file that contains the minimal set of “regions” defined over the primary data. The regions represent the finest granularity required to define the various different tokenizations over the primary data.
ANNOTATION FILES ASSOCIATED WITH ALL MASC TEXTS
Token / part-of-speech
Two different tokenization and part of speech files are included for each text in MASC I:
<file-name>-penn.xml : tokens automatically produced by GATE’s ANNIE tokenizer, manually corrected, with lemma and part-of-speech annotation using the Penn tagset. **These tags have not been fully hand-corrected**.
<file-name>-s.xml : Sentence regions defined over the primary data, produced automatically by the GATE ANNIE sentence splitter and manually validated.
** Note that the Penn Treebank syntactic annotation files also contain sentence boundaries.
<file-name>-nc.xml : noun chunks produced automatically by an enhanced version of GATE’s noun phrase chunker and manually corrected. See the noun chunk validation guidelines. Noun chunks point to tokens in the associated “penn-tok” file for that text.
<file-name>-vc.xml : verb chunks produced automatically by an enhanced version of GATE’s verb group chunker and manually corrected. See the verb chunk validation guidelines. Verb chunks point to tokens in the associated “penn-tok” file for that text.
<file-name>-ne.xml : Named entity annotations for person, location, organization, and date produced by an in-house named entity recognizer implemented in GATE using the JAPE language. See the named entity validation guidelines. Named entity annotations point to tokens in the associated “ptb-tok” file for that text.
ANNOTATION FILES ASSOCIATED WITH SOME TEXTS
Penn Treebank syntax
<file-name>-ptb.xml : Syntactic annotation produced by the Penn Treebank project in their internal bracketed format and transduced to GrAF format for inclusion in MASC. Bottom-most nodes in the phrase structure trees point to tokens in the associated “ptb-tok” file for that text.
For texts including Penn Treebank annotations, a second tokenization file is provided:
<file-name>-ptb-tok.xml : tokenization and part-of-speech annotation provided by the Penn Treebank project, manually validated by that project.
FrameNet frame elements
<file-name>-fn.xml : Semantic role annotation produced by the Framenet project in their internal XML format and transduced to GrAF format for inclusion in MASC. Annotations point to tokens in the associated “fn-tok” file for that text, which include part-of-speech annotations using the Penn tagset.
For texts including FrameNet annotations, a third tokenization file is provided:
<file-name>-penn-fn-tok.xml : tokenization and part-of-speech annotation using Penn tags provided by the FrameNet project, manually validated by that project.
<file-name>-mpqa.xml : Opinion annotation produced by the Pittsburgh Opinion Annotation project using GATE. Annotations point to tokens in the associated “penn” file for that text (see above).
<file-name>-cb.xml : Annotation for committed beliefs produced for the LU Corpus by researchers at Carnegie-Mellon University using GATE.
<file-name>-event.xml : Annotation for events produced for the LU Corpus by researchers at Carnegie-Mellon University using GATE.
ANNOTATIONS FOR ALL TEXTS THAT IS NOT YET AVAILABLE
<file-name>-coRef.xml : coreference annotation produced automatically by an enhanced version of GATE’s pronominal and nominal coreferencers and manually corrected. See the coreference validation guidelines.
<file-name>-clause.xml : annotation of clause boundaries produced automatically by software developed at University “A.I. Cuza” in Iasi, Romania and manually corrected, plus nucleus/satellite annotation for all clauses.