educe.learning package¶
Submodules¶
educe.learning.csv module¶
educe.learning.edu_input_format module¶
This module implements a dumper for the EDU input format
See https://github.com/irit-melodi/attelo/blob/master/doc/input.rst
-
educe.learning.edu_input_format.dump_all(X_gen, y_gen, f, class_mapping, docs, instance_generator)¶ Dump a whole dataset: features (in svmlight) and EDU pairs.
Parameters: - X_gen (iterable of int arrays) – Feature vectors.
- y_gen (iterable of int) – Ground truth labels.
- f (str) – Output features file path
- class_mapping (dict(str, int)) – Mapping from label to int.
- docs (list of DocumentPlus) – Documents
- instance_generator (function from doc to iterable of pairs) – TODO
-
educe.learning.edu_input_format.dump_edu_input_file(docs, f)¶ Dump a dataset in the EDU input format.
Each document must have:
- edus: sequence of edu objects
- grouping: string (some sort of document id)
- edu2sent: int -> int or string or None (edu num to sentence num)
The EDUs must provide:
- identifier(): string
- text(): string
-
educe.learning.edu_input_format.dump_pairings_file(epairs, f)¶ Dump the EDU pairings
-
educe.learning.edu_input_format.labels_comment(class_mapping)¶ Return a string listing class labels in the format that attelo expects
-
educe.learning.edu_input_format.load_labels(f)¶ Read label set into a dictionary mapping labels to indices
educe.learning.keygroup_vectorizer module¶
This module provides ways to transform lists of PairKeys to sparse vectors.
-
class
educe.learning.keygroup_vectorizer.KeyGroupVectorizer¶ Bases:
objectTransforms lists of KeyGroups to sparse vectors.
-
vocabulary_¶ dict(str, int) – Vocabulary mapping.
-
fit_transform(vectors)¶ Learn the vocabulary dictionary and return instances
-
transform(vectors)¶ Transform documents to EDU pair feature matrix.
Extract features out of documents using the vocabulary fitted with fit.
-
educe.learning.keys module¶
Feature extraction keys.
A key is basically a feature name, its type, some help text.
We also provide a notion of groups that allow us to organise keys into sections
-
class
educe.learning.keys.Key(substance, name, description)¶ Bases:
objectFeature name plus a bit of metadata
-
classmethod
basket(name, description)¶ A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to int (collections.Counter would be a good bet for collecting these)
-
classmethod
continuous(name, description)¶ A key for fields that have range value (eg. numbers)
-
classmethod
discrete(name, description)¶ A key for fields that have a finite set of possible values
-
substance= None¶ see Substance
-
classmethod
-
class
educe.learning.keys.KeyGroup(description, keys)¶ Bases:
dictA set of related features.
Note that a KeyGroup can be used as a dictionary, but instead of using Keys as values, you use the key names
-
DEBUG= True¶
-
NAME_WIDTH= 35¶
-
one_hot_values_gen(suffix='')¶ Get a one-hot encoded version of this KeyGroups as a generator
suffix is added to the feature name
-
-
class
educe.learning.keys.MagicKey(substance, function)¶ Bases:
educe.learning.keys.KeySomewhat fancier variant of Key that is built from a function The goal of the magic key is to reduce the amount of boilerplate needed to define keys
-
classmethod
basket_fn(function)¶ A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to int (collections.Counter would be a good bet for collecting these)
-
classmethod
continuous_fn(function)¶ A key for fields that have range value (eg. numbers)
-
classmethod
discrete_fn(function)¶ A key for fields that have a finite set of possible values
-
classmethod
-
class
educe.learning.keys.MergedKeyGroup(description, groups)¶ Bases:
educe.learning.keys.KeyGroupA key group that is formed by fusing several key groups into one.
Note that for now all the keys in a merged group are lumped into the same object.
The help text tries to preserve the internal breakdown into the subgroups, however. It comes with a “level 1” section header, eg.
======================================================= big block of features =======================================================
-
class
educe.learning.keys.Substance¶ Bases:
objectThe kind of the variable represented by this key.
- continuous
- discrete
- string (for meta vars; you probably want discrete instead)
If we ever reach a point where we’re happy to switch to Python 3 wholesale, we should subclass Enum
-
BASKET= 4¶
-
CONTINUOUS= 1¶
-
DISCRETE= 2¶
-
STRING= 3¶
educe.learning.svmlight_format module¶
This module implements a dumper for the svmlight format
See sklearn.datasets.svmlight_format
-
educe.learning.svmlight_format.dump_svmlight_file(X_gen, y_gen, f, zero_based=True, comment=None, query_id=None)¶ Dump the dataset in svmlight file format.
educe.learning.util module¶
Common helper functions for feature extraction.
-
educe.learning.util.space_join(str1, str2)¶ join two strings with a space
-
educe.learning.util.tuple_feature(combine)¶ (a -> a -> b) -> ((current, cache, edu) -> a) -> (current, cache, edu, edu) -> b)
Combine the result of single-edu feature function to make a pair feature
-
educe.learning.util.underscore(str1, str2)¶ join two strings with an underscore
educe.learning.vocabulary_format module¶
This module implements a loader and dumper for vocabularies.
-
educe.learning.vocabulary_format.dump_vocabulary(vocabulary, f)¶ Dump the vocabulary as a tab-separated file.
-
educe.learning.vocabulary_format.load_vocabulary(f)¶ Read vocabulary file into a dictionary of feature name and index