At present, MASC includes seventeen different types of linguistic annotation (* = in production; ** currently available in original format only):
|Annotation type||No. words|
|POS (Penn Treebank)||506768|
|Named Entities (person, org, loc, date)||506768|
|Penn Treebank syntax||506768|
|Clause boundaries, nucleus/satellite distinctions, discourse markers||*506768|
|FrameNet frames/frame elements||39160|
All MASC annotations, whether contributed or produced in-house, are transduced to the Graph Annotation Framework (GrAF) defined by ISO TC37 SC4’s Linguistic Annotation Framework (LAF). GrAF is an XML serialization of the LAF abstract model of annotations, which consists of a directed graph decorated with feature structures providing the annotation content. GrAF’s primary role is to serve as a “pivot” format for transducing among annotations represented in different formats.
The layering of annotations over MASC texts dictates the use of a stand-off annotation representation format, in which each annotation is contained in a separate document linked to the primary data. Each text in the corpus is provided in UTF-8 character encoding in a separate file, which includes no annotation or markup of any kind.
Each file is associated with a set of GrAF standoff files, one for each annotation type, containing the annotations for that text.
In addition, each text is also associated with a header document that provides appropriate metadata together with machine-processable information about associated annotations and inter-relations among the annotation layers. See Corpus structure and format for a detailed description.
Contributed annotations are also distributed in their original format, if different from GrAF.