educe.pdtb package¶
Conventions specific to the Penn Discourse Treebank (PDTB) project
Subpackages¶
Submodules¶
educe.pdtb.corpus module¶
PDTB Corpus management (re-exported by educe.pdtb)
-
class
educe.pdtb.corpus.Reader(corpusdir)¶ Bases:
educe.corpus.ReaderSee educe.corpus.Reader for details
-
files(doc_glob=None)¶ Parameters: doc_glob (str, optional) – Glob expression for document (folder) names ; if None, it uses the wildcard ‘/‘ for folder names and file basenames.
-
slurp_subcorpus(cfiles, verbose=False)¶ See educe.rst_dt.parse for a description of RSTTree
-
-
educe.pdtb.corpus.id_to_path(k)¶ Given a fleshed out FileId (none of the fields are None), return a filepath for it following Penn Discourse Treebank conventions.
You will likely want to add your own filename extensions to this path
-
educe.pdtb.corpus.mk_key(doc)¶ Return an corpus key for a given document name
educe.pdtb.parse module¶
Standalone parser for PDTB files.
The function parse takes a single .pdtb file and returns a list of Relation, with the following subtypes:
| Relation | selection | features | sup? |
|---|---|---|---|
| ExplicitRelation | Selection | attr, 1 connhead | Y |
| ImplicitRelation | InferenceSite | attr, 2 conn | Y |
| AltLexRelation | Selection | attr, 2 semclass | Y |
| EntityRelation | InferenceSite | none | N |
| NoRelation | InferenceSite | none | N |
These relation subtypes are stitched together (and inherit members) from two or three components
- arguments: always arg1 and arg2; but in some cases, the arguments can have supplementary information
- selection: see either Selection or InferenceSite
- some features (see eg. ExplictRelationFeatures)
The simplest way to get to grips with this may be to try the parse function on some sample relations and print the resulting objects.
-
class
educe.pdtb.parse.AltLexRelation(selection, features, args)¶ Bases:
educe.pdtb.parse.Selection,educe.pdtb.parse.AltLexRelationFeatures,educe.pdtb.parse.Relation
-
class
educe.pdtb.parse.AltLexRelationFeatures(attribution, semclass1, semclass2)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.Arg(selection, attribution=None, sup=None)¶ Bases:
educe.pdtb.parse.Selection
-
class
educe.pdtb.parse.Attribution(source, type, polarity, determinacy, selection=None)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.Connective(text, semclass1, semclass2=None)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.EntityRelation(infsite, args)¶ Bases:
educe.pdtb.parse.InferenceSite,educe.pdtb.parse.Relation
-
class
educe.pdtb.parse.ExplicitRelation(selection, features, args)¶ Bases:
educe.pdtb.parse.Selection,educe.pdtb.parse.ExplicitRelationFeatures,educe.pdtb.parse.Relation
-
class
educe.pdtb.parse.ExplicitRelationFeatures(attribution, connhead)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.GornAddress(parts)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.ImplicitRelation(infsite, features, args)¶ Bases:
educe.pdtb.parse.InferenceSite,educe.pdtb.parse.ImplicitRelationFeatures,educe.pdtb.parse.Relation
-
class
educe.pdtb.parse.ImplicitRelationFeatures(attribution, connective1, connective2=None)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.InferenceSite(strpos, sentnum)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.NoRelation(infsite, args)¶ Bases:
educe.pdtb.parse.InferenceSite,educe.pdtb.parse.Relation
-
class
educe.pdtb.parse.PdtbItem¶ Bases:
object
-
class
educe.pdtb.parse.Relation(args)¶ Bases:
educe.pdtb.parse.PdtbItem-
arg1¶ TODO – TODO
-
arg2¶ TODO – TODO
-
-
class
educe.pdtb.parse.Selection(span, gorn, text)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.SemClass(klass)¶ Bases:
educe.pdtb.parse.PdtbItem
-
class
educe.pdtb.parse.Sup(selection)¶ Bases:
educe.pdtb.parse.Selection
-
educe.pdtb.parse.parse(path)¶ Retrieve the list of relations found in a single .pdtb file.
Parameters: path (str) – Path to the .pdtb file (?) Returns: relations – List of relations found. Return type: list of Relation
-
educe.pdtb.parse.parse_relation(s)¶ Parse a single relation or throw a ParseException.
-
educe.pdtb.parse.split_relations(s)¶
educe.pdtb.pdtbx module¶
PDTB in an adhoc (educe-grown) XML format, unfortunately not a standard, but a little homegrown language using XML syntax. I’ll call it pdtbx. No reason it can’t be used outside of educe.
Informal DTD:
- SpanList is attribute spanList in PDTB string convention
- GornAddressList is attribute gornList in PDTB string convention
- SemClass is attribute semclass1 (and optional attribute semclass2)
- in PDTB string convention
- text in <text> elements with usual XML escaping conventions
- args in <arg> elements in order (arg1 before arg2)
- implicitRelations can have multiple connectives
-
educe.pdtb.pdtbx.Relation_xml(itm)¶
-
educe.pdtb.pdtbx.Relations_xml(itms)¶
-
educe.pdtb.pdtbx.read_Relation(node)¶
-
educe.pdtb.pdtbx.read_Relations(node)¶
-
educe.pdtb.pdtbx.read_pdtbx_file(filename)¶
-
educe.pdtb.pdtbx.write_pdtbx_file(filename, relations)¶
educe.pdtb.ptb module¶
Alignment with the Penn Treebank
-
educe.pdtb.ptb.parse_trees(corpus, k, ptb)¶ Given an PDTB document and an NLTK PTB reader, return the PTB trees.
Note that a future version of this function will try to educify the trees as well, but for now things will be fairly rudimentary
-
educe.pdtb.ptb.reader(corpus_dir)¶ An instantiated NLTK BracketedParseCorpusReader for the PTB section relevant to the PDTB corpus.
Note that the path you give to this will probably end with something like parsed/mrg/wsj