ANC Tool

The ANC Tool is a Java program that transduces OANC and MASC texts and their GrAF standoff annotations into several different formats suitable for use with other systems and tools. ANC2Go provides the same functionality as the ANC Tool as a web service; the ANC Tool is intended for those who wish to run the transduction processes on their own machines.

The ANC GrAF Tool can be downloaded in a number of formats

Note: Generating the Mac OS X and Windows executables from the Java jar files is still experimental. If the native executables appear to hang or quit unexpectedly see the instructions below for running the jar file from the command line with increased memory settings.

Output formats

The following output formats are supported.

  • Inline XML
    Converts texts with standoff GrAF annotations into inline XML encoded according to the XCES.
  • Word with part of speech
    Converts texts with token annotations in GrAF to words with part of speech tags in a format compatible with programs such as MonoConc/MonoConc Pro. The format is word separator POS tag, where the separator is specified by the user. For example, if an underscore is specified as the separator, the result is word_NN.
  • Word with part of speech (WordSmith)
    Converts texts with token annotations in GrAF to word with part of speech tag using the input format for the WordSmith Concordancer.
  • NLTK POS tagged corpus
    Converts texts with token annotations in GrAF to the format required for input to NLTK using the NLTK  TaggedCorpusReader
  • CoNLL
    Converts GrAF annotations into the CoNLL IOB format
  • UIMA CAS
    Converts GrAF annotations into a UIMA CAS document that can be loaded directly into UIMA. See the UIMA Tools page for more information.
NOTE: To import and export GrAF annotations in GATE, use the GATE plugins available from  the GATE Tools page.

 

Installation

Download the ANCTool and unzip the file to any convenient location.

Running

The ANCTool is an executable Java application that can be executed on most operating system by double clicking on the jar file. However, it is recommended that the jar file be started from the command line with the following command:

    java -Xmx500M -jar ANCTool-x.y.z-jar.jar

Where x.y.z is the version number (i.e. 1.2.6). The -Xmx500M option increases the amount of memory Java will make available to the ANC Tool. If the ANC Tool appears to hang or quits abruptly run the jar file from the command line and increase the memory size.

The first time you run the ANC Tool you will be asked to select the ANC home directory. This is the root directory of your ANC installation and should include (at least) the ANC’s data directory.

Using

The XML Tab

  1. Select an input directory containing the ANC files to process. The program will recurse through all directories rooted at the input directory and process all the ANC files found.
  2. Select an output directory. The XML files that are created will be placed in the output directory. If the Copy directory structure check box is selected the directory structure of the input directory will be mirrored and directories will be created as needed. Otherwise all files will be created in the output directory. The default is to copy the directory structure. It is possible to select the the input directory as the output directory, but it is highly recommended that the input and output directories be separate directories.
  3. Select the annotations to include.
  4. Click the Process button.

The MonoConc Tab

  1. Select the input and output directory as above.
  2. Select the part of speech tags to include. The part of speech tags are the only annotations that can be included when generating text files.
  3. Select a separator character. This is the character that will be used to separate a word from its part of speech tag. The default character is the underscore. It is possible to use more than one character as the separator.
  4. Click the Process button.

The WordSmith Tab

  1. Select the input and output directory as above.
  2. Select the part of speech tags to include. The part of speech tags are the only annotations that can be included when generating text files.
  3. Click the Process button.