Database models

Database models and related utilities.

This module defines the database structure underlying storage for the analysis. This consists in models that get turned into PostgreSQL tables by SQLAlchemy, along with a few utility classes, functions and exceptions around them.

Cluster and Quote represent respectively an individual cluster or quote from the MemeTracker data set. Url represents a quote occurrence, and those are stored as attributes of Quotes (as opposed to in their own table). Substitution represents an individual substitution mined with a given substitution Model.

Each model (except Url, which doesn’t have its own table) inherits the BaseMixin, which defines the table name, id field, and provides a common clone() method.

On top of that, models define a few computed properties (using the utils.cache decorator) which provide useful information that doesn’t need to be stored directly in the database (storing that in the database would make first access faster, but introduces more possibilities of inconsistent data if updates don’t align well). Cluster and Substitution also inherit functionality from the mine, filter and features modules, which you can inspect for more details.

Finally, this module defines save_by_copy(), a useful function to efficiently import clusters and quotes in bulk into the database.

class brainscopypaste.db.ArrayOfEnum(item_type, as_tuple=False, dimensions=None, zero_indexes=False)[source]

Bases: sqlalchemy.dialects.postgresql.base.ARRAY

ARRAY of ENUMs column type, which is not directly supported by DBAPIs.

This workaround is provided by SQLAchemy’s documentation.

class brainscopypaste.db.BaseMixin[source]

Bases: object

Common mixin for all models defining a table name, an id field, and a clone() method.

clone(**fields)[source]

Clone a model instance, excluding the original id and optionally setting some fields to values provided as arguments.

Give the fields to override as keyword arguments, their values will be set on the cloned instance. Any field that is not a known table column is ignored.

id = Column(None, Integer(), table=None, primary_key=True, nullable=False)

Primary key for the table.

class brainscopypaste.db.Cluster(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base, brainscopypaste.db.BaseMixin, brainscopypaste.filter.ClusterFilterMixin, brainscopypaste.mine.ClusterMinerMixin

Represent a MemeTracker cluster of quotes in the database.

Attributes below are defined as class attributes or cached methods, but they appear as instance attributes when you have an actual cluster instance. For instance, if cluster is a Cluster instance, cluster.size will give you that instance’s size.

filtered

Boolean indicating whether this cluster is part of the filtered (and kept) set of clusters or not.

format_copy()[source]

Create a string representing the cluster in a cursor.copy_from() or _copy() call.

format_copy_columns = ('id', 'sid', 'filtered', 'source')

Tuple of column names that are used by format_copy().

frequency

Complete number of occurrences of all the quotes in the cluster (i.e. counting url frequencies).

Look at size_urls for a count that ignores url frequencies.

quotes

List of Quotes in this cluster (this is a dynamic relationship on which you can run queries).

sid

Id of the cluster that originated this instance, i.e. the id as it appears in the MemeTracker data set.

size

Number of quotes in the cluster.

size_urls

Number of urls of all the quotes in the cluster (i.e. not counting url frequencies)

Look at frequency for a count that takes url frequencies into account.

source

Source data set from which this cluster originated. Currently this is always memetracker.

span

Span of the cluster (as a timedelta), from first to last occurrence.

Raises:

ValueError

If no urls are defined on any quotes of the cluster.

urls

Unordered list of Urls of all the quotes in the cluster.

class brainscopypaste.db.ModelType(*args, **kwargs)[source]

Bases: sqlalchemy.sql.type_api.TypeDecorator

Database type representing a substitution Model, used in the definition of Substitution.

impl

alias of String

process_bind_param(value, dialect)[source]

Convert a Model to its database representation.

process_result_value(value, dialect)[source]

Create a Model instance from its database representation.

class brainscopypaste.db.Quote(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base, brainscopypaste.db.BaseMixin

Represent a MemeTracker quote in the database.

Attributes below are defined as class attributes or cached methods, but they appear as instance attributes when you have an actual quote instance. For instance, if quote is a Quote instance, quote.size will give you that instance’s size.

Note that children Urls are stored directly inside this model through lists of url attributes, where a given url is defined by items at the same index in the various lists. This is an internal detail, and you should use the urls attribute to directly get a list of Url objects.

add_url(url)[source]

Add a Url to the quote.

The change is not automatically saved. If you want to persist this to the database, you should do it inside a session and commit afterwards (e.g. using session_scope()).

Parameters:

url : Url

The url to add to the quote.

Raises:

SealedException

If the urls attribute has already been accessed; since that attribute is cached, adding a url to the quote would invalidate the value.

add_urls(urls)[source]

Add a list of Urls to the quote.

As for add_url(), the changes are not automatically saved. If you want to persist this to the database, you should do it inside a session and commit afterwards (e.g. using session_scope()).

Parameters:

urls : list of Urls

The urls to add to the quote.

Raises:

SealedException

If the urls attribute has already been accessed; since that attribute is cached, adding urls to the quote would invalidate the value.

cluster

Parent Cluster.

cluster_id

Parent cluster id.

filtered

Boolean indicating whether this quote is part of the filtered (and kept) set of quotes or not.

format_copy()[source]

Create a string representing the quote and all its children urls in a cursor.copy_from() or _copy() call.

format_copy_columns = ('id', 'cluster_id', 'sid', 'filtered', 'string', 'url_timestamps', 'url_frequencies', 'url_url_types', 'url_urls')

Tuple of column names that are used by format_copy().

frequency

Complete number of occurrences of the quote (i.e. counting url frequencies).

Look at size for a count that ignores url frequencies.

lemmas

List of the lemmas in the quote’s string.

Raises:

ValueError

If the quote’s string is None.

sid

Id of the quote that originated this instance, i.e. the id as it appears in the MemeTracker data set.

size

Number of urls in the quote.

Look at frequency for a count that takes url frequencies into account.

span

Span of the quote (as a timedelta), from first to last occurrence.

Raises:

ValueError

If no urls are defined on the quote.

string

Text of the quote.

substitutions_destination

List of Substitutions for which this quote is the destination (this is a dynamic relationship on which you can run queries).

substitutions_source

List of Substitutions for which this quote is the source (this is a dynamic relationship on which you can run queries).

tags

List of TreeTagger POS tags of the tokens in the quote’s string.

Raises:

ValueError

If the quote’s string is None.

tokens

List of the tokens in the quote’s string.

Raises:

ValueError

If the quote’s string is None.

url_frequencies

List of ints representing the frequencies of children urls (i.e. how many times the quote string appears at each url).

url_timestamps

List of datetimes representing the times at which children urls appear.

url_url_types

List of url_types representing the types of the children urls.

url_urls

List of strs representing the URIs of the children urls.

urls

Unordered list of Urls of the quote; use this to access urls of the quote, instead of the url_* attributes.

exception brainscopypaste.db.SealedException[source]

Bases: Exception

Exception raised when trying to edit a model on which cached methods have already been accessed.

class brainscopypaste.db.Substitution(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base, brainscopypaste.db.BaseMixin, brainscopypaste.mine.SubstitutionValidatorMixin, brainscopypaste.features.SubstitutionFeaturesMixin

Represent a substitution in the database from one Quote to another.

A substitution is the replacement of a word from one quote (or a substring of that quote) in another quote. It is defined by a source quote, an occurrence of a destination quote, the position of a substring in the source quote string, the position of the replaced word in that substring, and the substitution model that detected the substitution in the data set.

Attributes below are defined as class attributes or cached methods, but they appear as instance attributes when you have an actual substitution instance. For instance, if substitution is a Substitution instance, substitution.tags will give you that instance’s tags.

destination

Destination Quote for the substitution.

destination_id

Id of the destination quote for the substitution.

lemmas

Tuple of lemmas of the replaced and replacing words.

model

Substitution detection Model that detected this substitution.

occurrence

Index of the destination Url in the destination quote.

position

Position of the replaced word in the substring of the source quote (which is also the position in the destination quote).

source

Source Quote for the substitution.

source_id

Id of the source quote for the substitution.

start

Index of the beginning of the substring in the source quote.

tags

Tuple of TreeTagger POS tags of the replaced and replacing words.

tokens

Tuple of the replaced and replacing words (the tokens here are the exact replaced and replacing words).

class brainscopypaste.db.Url(timestamp, frequency, url_type, url, quote=None)[source]

Bases: object

Represent a MemeTracker url in a Quote in the database.

The url occurrence is defined below as a cached method, but it appears as an instance attribute when you have an actual url instance. For instance, if url is a Url instance, url.occurrence will give you that url’s occurrence.

Note that Urls are stored directly inside Quote instances, and don’t have a dedicated database table.

Attributes

quote (Quote) Parent quote.
timestamp (datetime) Time at which the url occurred.
frequency (int) Number of times the quote string appears at this url.
url_type (url_type) Type of this url.
url (str) URI of this url.
_Url__key()

Unique identifier for this url, used to compute e.g. equality between two Url instances.

occurrence

Index of the url in the list of urls of the parent Quote.

Raises:

ValueError

If the url’s quote attribute is None.

brainscopypaste.db._copy(string, table, columns)[source]

Execute a PostgreSQL COPY command.

COPY is one of the fastest methods to import data in bulk into PostgreSQL. This function executes this operation through the raw psycopg2 cursor object.

Parameters:

string : file-like object

Contents of the data to import into the database, formatted for the COPY command (see PostgreSQL’s documentation for more details). Can be an io.StringIO if you don’t want to use a real file in the filesystem.

table : str

Name of the table into which the data is imported.

columns : list of str

List of the column names encoded in the string parameter. When string is produced using Quote.format_copy() or Cluster.format_copy() you can use the corresponding Quote.format_copy_columns or Cluster.format_copy_columns for this parameter.

brainscopypaste.db.save_by_copy(clusters, quotes)[source]

Import a list of clusters and a list of quotes into the database.

This function uses PostgreSQL’s COPY command to bulk import clusters and quotes, and prints its progress to stdout.

Parameters:

clusters : list of Clusters

List of clusters to import in the database.

quotes : list of Quotes

List of quotes to import in the database. Any clusters they reference should be in the clusters parameter.

brainscopypaste.db.url_type = Enum('B', 'M', name='url_type', metadata=MetaData(bind=None))

sqlalchemy.types.Enum of possible types of Urls from the MemeTracker data set.