Filtering¶
Filter clusters and quotes to clean to MemeTracker dataset.
This module defines the ClusterFilterMixin mixin which adds filtering
capabilities to Cluster, and the filter_clusters() function
which uses that mixin to filter the whole MemeTracker dataset. A few other
utility functions are also defined.
-
exception
brainscopypaste.filter.AlreadyFiltered[source]¶ Bases:
ExceptionException raised when trying to filter a dataset that has already been filtered.
-
class
brainscopypaste.filter.ClusterFilterMixin[source]¶ Bases:
objectMixin for
Clusters adding thefilter()method used infilter_clusters().-
filter()[source]¶ Filter this
Clusterand its childrenQuotes to see if they’re worth keeping.First, iterate through all the children
Quotes of the cluster, seeing if each one of them is worth keeping. AQuoteis discarded if it has no urls, less thanMT_FILTER_MIN_TOKENS, spans longer thanMT_FILTER_MAX_DAYS, or is not in English. AnyQuotethat has none of those problems will be kept.If after this filtering there are no
Quotes left, or theClustermade of the remainingQuotes still spans longer thanMT_FILTER_MAX_DAYS, the cluster and all its quotes will be discarded and None is returned. If not, a newClusteris created with cluster.filtered = True and cluster.id = original_cluster.id +filter_cluster_offset(). That new cluster points to copies of all the keptQuotes, with quote.filtered = True and quote.id = original_quote.id +filter_quote_offset(). All those models (new cluster and new quotes) should later be saved to the database (the method does not do it for you), e.g. by running this method inside asession_scope().Returns: cluster :
Clusteror NoneThe filtered cluster pointing to filtered quotes, or None if it is to be discarded.
Raises: AlreadyFiltered
If this cluster is already filtered (i.e.
filteredis True).
-
-
brainscopypaste.filter._top_id(id)[source]¶ Get the smallest power of ten three orders of magnitude greater than id.
Used to compute
filter_cluster_offset()andfilter_quote_offset().
-
brainscopypaste.filter.filter_cluster_offset()[source]¶ Get the offset to add to filtered
Clusterids.A filtered
Cluster‘s id will be its originalCluster‘s id plus this offset. The function ismemoized()since it is called so often.
-
brainscopypaste.filter.filter_clusters(limit=None)[source]¶ Filter the whole MemeTracker dataset by copying all valid
Clusters andQuotes and setting their filtered attributes to True.Iterate through all the MemeTracker
Clusters, and filter each of them to see if it’s worth keeping. If aClusteris to be kept, the function creates a copy of it and all of its keptQuotes, marking them as filtered. Progress of this operation is printed to stdout.Once the operation finishes, a VACUUM and an ANALYZE operation are run on the database so that it recomputes its optimisations.
Parameters: limit : int, optional
If not None, stop filtering after limit clusters have been seen (useful for testing purposes).
Raises: AlreadyFiltered