Substitution mining

Mine substitutions with various mining models.

This module defines several classes and mixins to mine substitutions in the MemeTracker dataset with a series of different models.

Time, Source, Past and Durl together define how a substitution Model behaves. Interval is a utility class used internally in Model. The ClusterMinerMixin mixin builds on this definition of a substitution model to provide ClusterMinerMixin.substitutions() which iterates over all valid substitutions in a Cluster. Finally, mine_substitutions_with_model() brings ClusterMinerMixin and SubstitutionValidatorMixin (which checks for spam substitutions) together to mine for all substitutions in the dataset for a given Model.

class brainscopypaste.mine.ClusterMinerMixin[source]

Bases: object

Mixin for Clusters that provides substitution mining functionality.

This mixin defines the substitutions() method (based on the private _substitutions() method) that iterates through all valid substitutions for a given Model.

classmethod _substitutions(source, durl, model)[source]

Iterate through all substitutions from source to durl considered valid by model.

This method yields all the substitutions between source and durl when model allows for multiple substitutions.

Parameters:

source : Quote

Source for the substitutions.

durl : Url

Destination url for the substitutions.

model : Model

Model that validates the substitutions between source and durl.

substitutions(model)[source]

Iterate through all substitutions in this cluster considered valid by model.

Multiple occurrences of a sentence at the same url (url “frequency”) are ignored, so as not to artificially inflate results.

Parameters:

model : Model

Model for which to mine substitutions in this cluster.

Yields:

substitution : Substitution

All the substitutions in this cluster considered valid by model. When model allows for multiple substitutions between a quote and a destination url, each substitution is yielded individually. Any substitution yielded is attached to this cluster, so if you use this in a session_scope() substitutions will be saved automatically unless you explicitly rollback the session.

class brainscopypaste.mine.Durl[source]

Bases: enum.Enum

Type of quotes accepted as substitution destinations.

_member_type_

alias of object

all = <Durl.all: 1>

All quotes are potential destinations for substitutions.

exclude_past = <Durl.exclude_past: 2>

Excluded past rule: only quotes that do not appear in what Time and Past define as “the past” can be the destination of a substitution.

class brainscopypaste.mine.Interval(start, end)[source]

Bases: object

Time interval defined by start and end datetimes.

Parameters:

start : :class:datetime.datetime

The interval’s start (or left) bound.

end : :class:datetime.datetime

The interval’s end (or right) bound.

Raises:

Exception

If start is strictly after end in time.

Examples

Test if a datetime is in an interval:

>>> from datetime import datetime
>>> itv = Interval(datetime(2016, 7, 5, 12, 15, 5),
...                datetime(2016, 7, 9, 13, 30, 0))
>>> datetime(2016, 7, 8) in itv
True
>>> datetime(2016, 8, 1) in itv
False
_Interval__key()

Unique identifier for this interval, used to compute e.g. equality between two Interval instances.

class brainscopypaste.mine.Model(time, source, past, durl, max_distance)[source]

Bases: object

Substitution mining model.

A mining model is defined by the combination of one parameter for each of Time, Source, Past, Durl, and a maximum hamming distance between source string (or substring) and destination string. This class represents such a model. It defines a couple of utility functions used in ClusterMinerMixin (find_start() and past_surls()), and a validate() method which determines if a given substitution conforms to the model. Other methods, prefixed with an underscore, are utilities for the methods cited above.

Parameters:

time : Time

Type of time defining how occurrence bins of the model are positioned.

source : Source

Type of quotes that the model accepts as substitution sources.

past : Past

How far back does the model look for substitution sources.

durl : Durl

Type of quotes that the model accepts as substitution destinations.

max_distance : int

Maximum number of substitutions between a source string (or substring) and a destination string that the model will detect.

Raises:

Exception

If max_distance is more than half of MT_FILTER_MIN_TOKENS.

_Model__key()

Unique identifier for this model, used to compute e.g. equality between two Model instances.

_distance_start(source, durl)[source]

Get a (distance, start) tuple indicating the minimal distance between source and durl, and the position of source‘s substring that achieves that minimum.

This is in fact an alias for what the model considers to be valid transformations and how to define them, but provides proper encapsulation of concerns.

_ok(*args, **kwargs)[source]

Dummy method used when a validation should always pass.

_past(cluster, durl)[source]

Get an Interval representing what this model considers to be the past before durl.

See Time and Past to understand what this interval looks like. This method is memoized() for performance.

_validate_base(source, durl)[source]

Check that source has at least one occurrence in what this model considers to be the past before durl.

_validate_distance(source, durl)[source]

Check that source and durl differ by no more than self.max_distance.

_validate_durl(source, durl)[source]

Check that durl is an acceptable substitution destination occurrence for this model.

This method proxies to the proper validation method, depending on the value of self.durl.

_validate_durl_exclude_past(source, durl)[source]

Check that durl verifies the excluded past rule.

_validate_source(source, durl)[source]

Check that source is an acceptable substitution source for this model.

This method proxies to the proper validation method, depending on the value of self.source.

_validate_source_majority(source, durl)[source]

Check that source verifies the majority rule.

bin_span = datetime.timedelta(1)

Span of occurrence bins the model makes.

drop_caches()[source]

Drop the caches of all memoized() methods of the class.

find_start(source, durl)[source]

Get the position of the substring of source that achieves minimal distance to durl.

past_surls(cluster, durl)[source]

Get the list of all Urls that are in what this model considers to be the past before durl.

This method is memoized() for performance.

validate(source, durl)[source]

Test if potential substitutions from source quote to durl destination url are valid for this model.

This method is memoized() for performance.

Parameters:

source : Quote

Candidate source quote for substitutions; the substitutions can be from a substring of source.string.

durl : Url

Candidate destination url for the substitutions.

Returns:

bool

True if the proposed source and destination url are considered valid by this model, False otherwise.

class brainscopypaste.mine.Past[source]

Bases: enum.Enum

How far back in the past can a substitution find its source.

_member_type_

alias of object

all = <Past.all: 1>

The past is everything: substitution sources can be in any bin preceding the destination occurrence (which is an interval that can end at midnight before the destination occurrence when using Time.discrete).

last_bin = <Past.last_bin: 2>

The past is the last bin: substitution sources must be in the bin preceding the destination occurrence (which can end at midnight before the destination occurrence when using Time.discrete).

class brainscopypaste.mine.Source[source]

Bases: enum.Enum

Type of quotes accepted as substitution sources.

_member_type_

alias of object

all = <Source.all: 1>

All quotes are potential sources for substitutions.

majority = <Source.majority: 2>

Majority rule: only quotes that are the most frequent in the considered past bin can be the source of substitutions (note that several quotes in a single bin can have the same maximal frequency).

class brainscopypaste.mine.SubstitutionValidatorMixin[source]

Bases: object

Mixin for Substitution that adds validation functionality.

A non-negligible part of the substitutions found by ClusterMinerMixin are spam or changes we’re not interested in: minor spelling changes, abbreviations, changes of articles, symptoms of a deleted word that appear as substitutions, etc. This class defines the validate() method, which tests for all these cases and returns whether or not the substitution is worth keeping.

validate()[source]

Check whether or not this substitution is worth keeping.

class brainscopypaste.mine.Time[source]

Bases: enum.Enum

Type of time that determines the positioning of occurrence bins.

_member_type_

alias of object

continuous = <Time.continuous: 1>

Continuous time: bins are sliding, end at the destination occurrence, and start Model.bin_span before that.

discrete = <Time.discrete: 2>

Discrete time: bins are aligned at midnight, end at or before the destination occurrence, and start Model.bin_span before that.

brainscopypaste.mine._get_wordnet_words()[source]

Get the set of all words known by WordNet.

This is the set of all lemma names for all synonym sets in WordNet.

brainscopypaste.mine.mine_substitutions_with_model(model, limit=None)[source]

Mine all substitutions in the MemeTracker dataset conforming to model.

Iterates through the whole MemeTracker dataset to find all substitutions that are considered valid by model, and save the results to the database. The MemeTracker dataset must have been loaded and filtered previously, or an excetion will be raised (see Usage or cli for more about that). Mined substitutions are saved each time the function moves to a new cluster, and progress is printed to stdout. The number of substitutions seen and the number of substitutions kept (i.e. validated by SubstitutionValidatorMixin.validate()) are also printed to stdout.

Parameters:

model : Model

The substitution model to use for mining.

limit : int, optional

If not None (default), mining will stop after limit clusters have been examined.

Raises:

Exception

If no filtered clusters are found in the database, or if there already are some substitutions from model model in the database.