STAC

Educe is a library for working with a variety of discourse corpora. This tutorial aims to show what using educe would be like when working with the STAC corpus.

We’ll be working with a tiny fragment of the corpus included with educe. You may find it useful to symlink your larger copy from the STAC distribution and modify this tutorial accordingly.

Installation

git clone https://github.com/irit-melodi/educe.git
cd educe
pip install -r requirements.txt

Note: these instructions assume you are running within a virtual environment. If not, and if you have permission denied errors, replace pip with sudo pip.

Tutorial in browser (optional)

This tutorial can either be followed along with the command line and your favourite text editor, or embedded in an interactive webpage via iPython:

pip install ipython
cd tutorials
ipython notebook
# some helper functions for the tutorial below

def text_snippet(text):
    "short text fragment"
    if len(text) < 43:
        return text
    else:
        return "{0}...{1}".format(text[:20], text[-20:])

def highlight(astring, color=1):
    "coloured text"
    return("\x1b[3{color}m{str}\x1b[0m".format(color=color, str=astring))

Reading corpus files (STAC)

Typically, the first thing we want to do when working in educe is to read the corpus in. This can be a bit slow, but as we will see later on, we can speed things up if we know what we’re looking for.

from __future__ import print_function
import educe.stac

# relative to the educe docs directory
data_dir = '../data'
corpus_dir = '{dd}/stac-sample'.format(dd=data_dir)

# read everything from our sample
reader = educe.stac.Reader(corpus_dir)
corpus = reader.slurp(verbose=True)

# print a text fragment from the first ten files we read
for key in corpus.keys()[:10]:
    doc = corpus[key]
    print("[{0}] {1}".format(key, doc.text()[:50]))
Slurping corpus dir [99/100]
[s1-league2-game1 [05] unannotated None]  199 : sabercat : anyone any clay? 200 : IG : nope
[s1-league2-game1 [13] units hjoseph]  521 : sabercat : skinnylinny 522 : sabercat : som
[s1-league2-game1 [10] units hjoseph]  393 : skinnylinny : Shall we extend? 394 : saberc
[s1-league2-game1 [11] discourse hjoseph]  450 : skinnylinny : Argh 451 : skinnylinny : How
[s1-league2-game1 [10] unannotated None]  393 : skinnylinny : Shall we extend? 394 : saberc
[s1-league2-game1 [02] units lpetersen]  75 : sabercat : anyone has any wood? 76 : skinnyl
[s1-league2-game1 [14] units SILVER]  577 : sabercat : skinny 578 : sabercat : I need 2
[s1-league2-game3 [03] discourse lpetersen]  151 : amycharl : got wood anyone? 152 : sabercat
[s1-league2-game1 [10] discourse hjoseph]  393 : skinnylinny : Shall we extend? 394 : saberc
[s1-league2-game1 [12] units SILVER]  496 : sabercat : yes! 497 : sabercat : :D 498 : s
Slurping corpus dir [100/100 done]

Faster reading

If you know that you only want to work with a subset of the corpus files, you can pre-filter the corpus before reading the files.

It helps to know here that an educe corpus is a mapping from file id keys to Documents. The FileId tells us what makes a Document distinct from another:

  • document (eg. s1-league2-game1): in STAC, the game that was played (here, season 1, league 2, game 1)
  • subdocument (eg. 05): a mostly arbitrary subdivision of the documents motivated by technical constraints (overly large documents would cause our annotation tool to crash)
  • stage (eg. units, discourse, parsed): the kinds of annotations available in the document
  • annotator (eg. hjoseph): the main annotator for a document (gold standard documents have the distinguished annotators, BRONZE, SILVER, or GOLD)

NB: unfortunately we have overloaded the word “document” here. When talking about file ids, “document” refers to a whole game. But when talking about actual annotation objects an educe Document actually corresponds to a specific combination of document, subdocument, stage, and annotator

import re

# nb: you can import this function from educe.stac.corpus
def is_metal(fileid):
    "is this a gold standard(ish) annotation file?"
    anno = fileid.annotator or ""
    return anno.lower() in ["bronze", "silver", "gold"]

# pick out gold-standard documents
subset = reader.filter(reader.files(),
                       lambda k: is_metal(k) and int(k.subdoc) < 4)
corpus_subset = reader.slurp(subset, verbose=True)
for key in corpus_subset:
    doc = corpus_subset[key]
    print("{0}: {1}".format(key, doc.text()[:50]))
Slurping corpus dir [11/12]
s1-league2-game1 [01] units SILVER:  1 : sabercat : btw, are we playing without the ot
s1-league2-game1 [01] discourse SILVER:  1 : sabercat : btw, are we playing without the ot
s1-league2-game1 [02] discourse SILVER:  75 : sabercat : anyone has any wood? 76 : skinnyl
s1-league2-game3 [01] discourse BRONZE:  1 : amycharl : i made it! 2 : amycharl : did the
s1-league2-game1 [03] discourse SILVER:  109 : sabercat : well done! 110 : IG : More clay!
s1-league2-game3 [02] units BRONZE:  73 : sabercat : skinny, got some ore? 74 : skinny
s1-league2-game3 [01] units BRONZE:  1 : amycharl : i made it! 2 : amycharl : did the
s1-league2-game1 [02] units SILVER:  75 : sabercat : anyone has any wood? 76 : skinnyl
s1-league2-game3 [02] discourse BRONZE:  73 : sabercat : skinny, got some ore? 74 : skinny
s1-league2-game1 [03] units SILVER:  109 : sabercat : well done! 110 : IG : More clay!
s1-league2-game3 [03] discourse BRONZE:  151 : amycharl : got wood anyone? 152 : sabercat
s1-league2-game3 [03] units BRONZE:  151 : amycharl : got wood anyone? 152 : sabercat
Slurping corpus dir [12/12 done]
from educe.corpus import FileId

# pick out an example document to work with creating FileIds by hand
# is not something we would typically do (normally we would just iterate
# through a corpus), but it's useful for illustration
ex_key = FileId(doc='s1-league2-game3',
                subdoc='03',
                stage='units',
                annotator='BRONZE')
ex_doc = corpus[ex_key]
print(ex_key)
s1-league2-game3 [03] units BRONZE

Standing off

Most annotations in the STAC corpus are educe standoff annotations. In educe terms, this means that they (perhaps indirectly) extend the educe.annotation.Standoff class and provide a text_span() function. Much of our reasoning around annotations essentially consists of checking that their text spans overlap or enclose each other.

As for the text spans, these refer to the raw text saved in files with an .ac extension (eg. s1-league1-game3.ac). In the Glozz annotation tool, these .ac text files form a pair with their .aa xml counterparts. Multiple annotation files can point to the same text file.

There are also some annotations that come from 3rd party tools, which we will uncover later.

Documents and EDUs

A document is a sort of giant annotation that contains three other kinds of annotation

  • units - annotations that directly cover a span of text (EDUs, Resources, but also turns, dialogues)
  • relations - annotations that point from one annotation to another
  • schemas - annotations that point to a set of annotations

To start things off, we’ll focus on one type of unit-level annotation, the Elementary Discourse Unit

def preview_unit(doc, anno):
    "the default str(anno) can be a bit overwhelming"
    preview = "{span: <11} {id: <20} [{type: <12}] {text}"
    text = doc.text(anno.text_span())
    return preview.format(id=anno.local_id(),
                          type=anno.type,
                          span=anno.text_span(),
                          text=text_snippet(text))

print("Example units")
print("-------------")
seen = set()
for anno in ex_doc.units:
    if anno.type not in seen:
        seen.add(anno.type)
        print(preview_unit(ex_doc, anno))

print()
print("First few EDUs")
print("--------------")
for anno in filter(educe.stac.is_edu, ex_doc.units)[:4]:
    print(preview_unit(ex_doc, anno))
Example units
-------------
(1,34)      stac_1368693094      [paragraph   ] 151 : amycharl : got wood anyone?
(52,66)     stac_1368693099      [Accept      ] yep, for what?
(117,123)   stac_1368693105      [Refusal     ] no way
(189,191)   stac_1368693114      [Other       ] :)
(209,210)   stac_1368693117      [Counteroffer] ?
(659,668)   stac_1368693162      [Offer       ] how much?
(22,26)     asoubeille_1374939590843 [Resource    ] wood
(35,66)     stac_1368693098      [Turn        ] 152 : sabercat : yep, for what?
(0,266)     stac_1368693124      [Dialogue    ]  151 : amycharl : go...cat : yep, thank you

First few EDUs
--------------
(52,66)     stac_1368693099      [Accept      ] yep, for what?
(117,123)   stac_1368693105      [Refusal     ] no way
(163,171)   stac_1368693111      [Accept      ] could be
(189,191)   stac_1368693114      [Other       ] :)

TODO

Everything below this point should be considered to be in a scratch/broken state. It needs to ported over from its RST/DT considerations to STAC

To do:

  • standing off (ac/aa) - shared aa
  • layers (units/discourse)
  • working with relations and schemas
  • grabbing resources etc (example of working with unit level annotation)
  • synchronising layers (grabbing the dialogue act and relations at the same time)
  • external annotations (postags, parse trees)
  • working with hypergraphs (implementing _repr_png()_ would be pretty sweet)

Tree searching

The same span enclosure logic can be used to search parse trees for particular constituents, verb phrases. Alternatively, you can use the the topdown method provided by educe trees. This returns just the largest constituent for which some predicate is true. It optionally accepts an additional argument to cut off the search when it is clearly out of bounds.

Conclusion

In this tutorial, we’ve explored a couple of basic educe concepts, which we hope will enable you to extract some data from your discourse corpora, namely

  • reading corpus data (and pre-filtering)
  • standoff annotations
  • searching by span enclosure, overlapping
  • working with trees
  • combining annotations from different sources

The concepts above should transfer to whatever discourse corpus you are working with (that educe supports, or that you are prepared to supply a reader for).

Work in progress

This tutorial is very much a work in progress (last update: 2014-09-19). Educe is a bit of a moving target, so let me know if you run into any trouble!

See also

stac-util

Some of the things you may want to do with the STAC corpus may already exist in the stac-util command line tool. stac-util is meant to be a sort of Swiss Army Knife, providing tools for editing the corpus. The query tools are more likely to be of interest:

  • text: display text and edu/dialogue segmentation in a friendly way
  • graph: draw discourse graphs with graphviz (arrows for relations, boxes for CDUs, etc)
  • filter-graph: visualise instances of relations (eg. Question answer pair)
  • count: generate statistics about the corpus

See stac-util --help for more details.

External tool support

Educe has some support for reading data from outside the discourse corpus proper. For example, if you run the stanford corenlp parser on the raw text, you can read them back into educe-style ConstituencyTree and DependencyTree annotations. See educe.external for details.

If you have a part of speech tagger that you would like to use, the educe.external.postag module may be useful for representing the annotations that come out of it

You can also add support for your own tools by creating annotations that extend Standoff, directly or otherwise.