MASC Structure

Each MASC text is in a separate primary data document, with an associated header file and segmentation file. All annotations are contained in separate annotation files whose names indicate the contents. The following lists the files and describes their contents.


HEADERS

MASC Corpus Header

MASC_1.0.2-corpus-header.xml : The MASC corpus header included in the distribution contains information concerning the provenance, compilations procedures, organization, and content of the corpus. The following sections of the header are referred to by text and annotation files:

  1. Category class declaration (element name: ClassDecl) : provides the category labels used to identify major and minor genre in text document headers.

  2. File structure (element name: fileStruct) : provides information about all files, their naming conventions, and dependencies among annotation files.

  3. Annotation sets (element name: annotationSets) : identifies the annotations sets used in the annotation files and their naming conventions, as well as the conceptual annotation “layer” to which each set belongs.

  4. Annotation (element name: annotation) : identifies an annotation type included in each annotation set, indicates the method by which it was produced, and provides a brief description and link to the documentation and project that is responsible for creation for the annotation.

  5. Layers (element name: layers) : identifies the different annotation layers in the corpus and their naming conventions.

  6. Media (element name: media) : identifies the formats of files included in the corpus and annotations and indicates the suffix for files conforming to that format.

MASC Text (document) Headers

<file-name>.anc contains the header for the document <file-name>, which provides information about the provenance of the text and specifies the medium (UTF-8 text for all MASC documents), genre, sub-genre, domain, and subject information (element name: textClass), as well as information about the location of the primary text file and all associated annotation files, for human or machine consumption.

MASC Annotation Headers

The header for each annotation file is included in the annotation file itself, rather than being provided as a separate XML document. Annotation headers indicate the location of all related documents and dependencies upon other annotation file. This latter information is used by some ANC tools to ensure that all required documents are loaded into GATE, UIMA, etc. Annotation headers also provide information about the types of anchors used in the annotation (character offsets for text), the annotation sets used.

All MASC annotations are in GrAF format and linked to regions defined over the primary data or to other annotations.


PRIMARY DATA FILES

Primary Data file

<file-name>.txt : The primary data in UTF-8 character encoding.


Segmentation file

<file-name>-seg.xml : All MASC documents are associated with a base segmentation file that contains the minimal set of “regions” defined over the primary data. The regions represent the finest granularity required to define the various different tokenizations over the primary data.



Annotation files associated with all MASC I texts


Token / part-of-speech

Two different tokenization and part of speech files are included for each text in MASC I:

<file-name>-penn.xml : tokens automatically produced by GATE’s ANNIE tokenizer, manually corrected, with lemma and part-of-speech annotation using the Penn tagset. **These tags have not been fully hand-corrected**.

<file-name>-ptb-tok.xml : tokenization and part-of-speech annotation provided by the Penn Treebank project, manually validated by that project.


Sentence boundary

<file-name>-s.xml : Sentence regions defined over the primary data, produced automatically by the GATE ANNIE sentence splitter and manually validated.

** Note that the Penn Treebank syntactic annotation files also contain sentence boundaries.


Shallow parse

<file-name>-nc.xml : noun chunks produced automatically by an enhanced version of GATE’s noun phrase chunker and manually corrected. See the noun chunk validation guidelines. Noun chunks point to tokens in the associated “ptb-tok” file for that text.

<file-name>-vc.xml : verb chunks produced automatically by an enhanced version of GATE’s verb group chunker and manually corrected. See the verb chunk validation guidelines. Verb chunks point to tokens in the associated “ptb-tok” file for that text.


Named entities

<file-name>-ne.xml : Named entity annotations for person, location, organization, and date produced by an in-house named entity recognizer implemented in GATE using the JAPE language. See the named entity validation guidelines. Named entity annotations point to tokens in the associated “ptb-tok” file for that text.


Penn Treebank syntax

<file-name>-ptb.xml : Syntactic annotation produced by the Penn Treebank project in their internal bracketed format and transduced to GrAF format for inclusion in MASC. Bottom-most nodes in the phrase structure trees point to tokens in the associated “ptb-tok” file for that text. See the notes on Penn Treebank annotation conversion (coming soon).


Annotation files associated with some texts


FrameNet frame elements

<file-name>-fn.xml : Semantic role annotation produced by the Framenet project in their internal XML format and transduced to GrAF format for inclusion in MASC. Annotations point to tokens in the associated “fn-tok” file for that text, which include part-of-speech annotations using the Penn tagset.

For texts including FrameNet annotations, a third tokenization file is provided:

<file-name>-fn-tok.xml : tokenization and part-of-speech annotation using Penn tags provided by the FrameNet project, manually validated by that project.


MPQA opinion

<file-name>-mpqa.xml : Opinion annotation produced by the Pittsburgh Opinion Annotation project using GATE. Annotations point to tokens in the associated “penn” file for that text (see above).


Committed belief

<file-name>-cb.xml : Annotation for committed beliefs produced for the LU Corpus by researchers at Carnegie-Mellon University using GATE.


Event

<file-name>-event.xml : Annotation for events produced for the LU Corpus by researchers at Carnegie-Mellon University using GATE.