Filtering¶
Filter clusters and quotes to clean to MemeTracker dataset.
This module defines the ClusterFilterMixin
mixin which adds filtering
capabilities to Cluster
, and the filter_clusters()
function
which uses that mixin to filter the whole MemeTracker dataset. A few other
utility functions are also defined.
-
exception
brainscopypaste.filter.
AlreadyFiltered
[source]¶ Bases:
Exception
Exception raised when trying to filter a dataset that has already been filtered.
-
class
brainscopypaste.filter.
ClusterFilterMixin
[source]¶ Bases:
object
Mixin for
Cluster
s adding thefilter()
method used infilter_clusters()
.-
filter
()[source]¶ Filter this
Cluster
and its childrenQuote
s to see if they’re worth keeping.First, iterate through all the children
Quote
s of the cluster, seeing if each one of them is worth keeping. AQuote
is discarded if it has no urls, less thanMT_FILTER_MIN_TOKENS
, spans longer thanMT_FILTER_MAX_DAYS
, or is not in English. AnyQuote
that has none of those problems will be kept.If after this filtering there are no
Quote
s left, or theCluster
made of the remainingQuote
s still spans longer thanMT_FILTER_MAX_DAYS
, the cluster and all its quotes will be discarded and None is returned. If not, a newCluster
is created with cluster.filtered = True and cluster.id = original_cluster.id +filter_cluster_offset()
. That new cluster points to copies of all the keptQuote
s, with quote.filtered = True and quote.id = original_quote.id +filter_quote_offset()
. All those models (new cluster and new quotes) should later be saved to the database (the method does not do it for you), e.g. by running this method inside asession_scope()
.Returns: cluster :
Cluster
or NoneThe filtered cluster pointing to filtered quotes, or None if it is to be discarded.
Raises: AlreadyFiltered
If this cluster is already filtered (i.e.
filtered
is True).
-
-
brainscopypaste.filter.
_top_id
(id)[source]¶ Get the smallest power of ten three orders of magnitude greater than id.
Used to compute
filter_cluster_offset()
andfilter_quote_offset()
.
-
brainscopypaste.filter.
filter_cluster_offset
()[source]¶ Get the offset to add to filtered
Cluster
ids.A filtered
Cluster
‘s id will be its originalCluster
‘s id plus this offset. The function ismemoized()
since it is called so often.
-
brainscopypaste.filter.
filter_clusters
(limit=None)[source]¶ Filter the whole MemeTracker dataset by copying all valid
Cluster
s andQuote
s and setting their filtered attributes to True.Iterate through all the MemeTracker
Cluster
s, and filter each of them to see if it’s worth keeping. If aCluster
is to be kept, the function creates a copy of it and all of its keptQuote
s, marking them as filtered. Progress of this operation is printed to stdout.Once the operation finishes, a VACUUM and an ANALYZE operation are run on the database so that it recomputes its optimisations.
Parameters: limit : int, optional
If not None, stop filtering after limit clusters have been seen (useful for testing purposes).
Raises: AlreadyFiltered