Filtering

Filter clusters and quotes to clean to MemeTracker dataset.

This module defines the ClusterFilterMixin mixin which adds filtering capabilities to Cluster, and the filter_clusters() function which uses that mixin to filter the whole MemeTracker dataset. A few other utility functions are also defined.

exception brainscopypaste.filter.AlreadyFiltered[source]

Bases: Exception

Exception raised when trying to filter a dataset that has already been filtered.

class brainscopypaste.filter.ClusterFilterMixin[source]

Bases: object

Mixin for Clusters adding the filter() method used in filter_clusters().

filter()[source]

Filter this Cluster and its children Quotes to see if they’re worth keeping.

First, iterate through all the children Quotes of the cluster, seeing if each one of them is worth keeping. A Quote is discarded if it has no urls, less than MT_FILTER_MIN_TOKENS, spans longer than MT_FILTER_MAX_DAYS, or is not in English. Any Quote that has none of those problems will be kept.

If after this filtering there are no Quotes left, or the Cluster made of the remaining Quotes still spans longer than MT_FILTER_MAX_DAYS, the cluster and all its quotes will be discarded and None is returned. If not, a new Cluster is created with cluster.filtered = True and cluster.id = original_cluster.id + filter_cluster_offset(). That new cluster points to copies of all the kept Quotes, with quote.filtered = True and quote.id = original_quote.id + filter_quote_offset(). All those models (new cluster and new quotes) should later be saved to the database (the method does not do it for you), e.g. by running this method inside a session_scope().

Returns:

cluster : Cluster or None

The filtered cluster pointing to filtered quotes, or None if it is to be discarded.

Raises:

AlreadyFiltered

If this cluster is already filtered (i.e. filtered is True).

brainscopypaste.filter._top_id(id)[source]

Get the smallest power of ten three orders of magnitude greater than id.

Used to compute filter_cluster_offset() and filter_quote_offset().

brainscopypaste.filter.filter_cluster_offset()[source]

Get the offset to add to filtered Cluster ids.

A filtered Cluster‘s id will be its original Cluster‘s id plus this offset. The function is memoized() since it is called so often.

brainscopypaste.filter.filter_clusters(limit=None)[source]

Filter the whole MemeTracker dataset by copying all valid Clusters and Quotes and setting their filtered attributes to True.

Iterate through all the MemeTracker Clusters, and filter each of them to see if it’s worth keeping. If a Cluster is to be kept, the function creates a copy of it and all of its kept Quotes, marking them as filtered. Progress of this operation is printed to stdout.

Once the operation finishes, a VACUUM and an ANALYZE operation are run on the database so that it recomputes its optimisations.

Parameters:

limit : int, optional

If not None, stop filtering after limit clusters have been seen (useful for testing purposes).

Raises:

AlreadyFiltered

If there are already some filtered Clusters or Quotes stored in the database (indicating another filtering operation has already been completed, or started and aborted).

brainscopypaste.filter.filter_quote_offset()[source]

Get the offset to add to filtered Quote ids.

A filtered Quote‘s id will be its original Quote‘s id plus this offset. The function is memoized() since it is called so often.