Data loading¶

Load data from various datasets.

This module defines functions and classes to load and parse dataset files. load_fa_features() loads Free Association features (using FAFeatureLoader) and load_mt_frequency_and_tokens() loads MemeTracker features. Both save their computed features to pickle files for later use in analyses. MemeTrackerParser parses and loads the whole MemeTracker dataset into the database and is used by cli.

class brainscopypaste.load.FAFeatureLoader[source]¶

Bases: brainscopypaste.load.Parser

Loader for the Free Association dataset and features.

This class defines a method to load the FA norms (_norms()), utility methods to compute the different variants of graphs that can represent the norms (_norms_graph(), _inverse_norms_graph(), and _undirected_norms_graph()) or to help feature computation (_remove_zeros()), and public methods that compute features on the FA data (degree(), pagerank(), betweenness(), and clustering()). Use a single class instance to compute all FA features.

_inverse_norms_graph¶

Get the Free Association directed graph with inverted weights.

This graph is useful for computing e.g. betweenness(), where link strength should be considered an inverse cost (i.e. a stronger link is easier to cross, instead of harder).

memoized() for performance of the class.

Returns:

Returns:	`networkx.DiGraph()` The FA inversely weighted directed graph.

networkx.DiGraph()

The FA inversely weighted directed graph.

_norms¶

Parse the Free Association Appendix A files into self.norms.

After loading, self.norms is a dict containing, for each (lowercased) cue, a list of tuples. Each tuple represents a word referenced by the cue, and is in format (word, ref, weight): word is the referenced word; ref is a boolean indicating if word has been normed or not; weight is the strength of the referencing.

memoized() for performance of the class.

_norms_graph¶

Get the Free Association weighted directed graph.

memoized() for performance of the class.

Returns:

Returns:	`networkx.DiGraph()` The FA weighted directed graph.

networkx.DiGraph()

The FA weighted directed graph.

classmethod _remove_zeros(feature)[source]¶

Remove key-value pairs where value is zero, in dict feature.

Modifies the provided feature dict, and does not return anything.

Parameters:

Parameters:	feature : dict Any association of key-value pairs where values are numbers. Usually a dict of words to feature values.

feature : dict

Any association of key-value pairs where values are numbers. Usually a dict of words to feature values.

_undirected_norms_graph¶

Get the Free Association weighted undirected graph.

When a pair of words is connected in both directions, the undirected link between the two words receives the sum of the two directed link weights. This is used to compute e.g. clustering(), which is defined on the undirected (but weighted) FA graph.

memoized() for performance of the class.

Returns:

Returns:	`networkx.Graph()` The FA weighted undirected graph.

networkx.Graph()

The FA weighted undirected graph.

betweenness()[source]¶

Compute betweenness centrality for words coded by Free Association.

Returns:

Returns:	betweenness : dict The association of each word to its betweenness centrality. FA link weights are considered as inverse cost in the computation (i.e. a stronger link is easier to cross). Words with betweenness zero are removed from the dict.

betweenness : dict

The association of each word to its betweenness centrality. FA link weights are considered as inverse cost in the computation (i.e. a stronger link is easier to cross). Words with betweenness zero are removed from the dict.

clustering()[source]¶

Compute clustering coefficient for words coded by Free Association.

Returns:

Returns:	clustering : dict The association of each word to its clustering coefficient. FA link weights are taken into account in the computation, but direction of links is ignored (if words are connected in both directions, the link weights are added together). Words with clustering coefficient zero are removed from the dict.

clustering : dict

The association of each word to its clustering coefficient. FA link weights are taken into account in the computation, but direction of links is ignored (if words are connected in both directions, the link weights are added together). Words with clustering coefficient zero are removed from the dict.

degree()[source]¶

Compute in-degree centrality for words coded by Free Association.

Returns:

Returns:	degree : dict The association of each word to its in-degree. Each incoming link counts as 1 (i.e. link weights are ignored). Words with zero incoming links are removed from the dict.

degree : dict

The association of each word to its in-degree. Each incoming link counts as 1 (i.e. link weights are ignored). Words with zero incoming links are removed from the dict.

header_size = 4¶: Size (in lines) of the header in files to be parsed.

pagerank()[source]¶

Compute pagerank centrality for words coded by Free Association.

Returns:

Returns:	pagerank : dict The association of each word to its pagerank. FA link weights are taken into account in the computation. Words with pagerank zero are removed from the dict.

pagerank : dict

The association of each word to its pagerank. FA link weights are taken into account in the computation. Words with pagerank zero are removed from the dict.

class brainscopypaste.load.MemeTrackerParser(filename, line_count, limit=None)[source]¶

Bases: brainscopypaste.load.Parser

Parse the MemeTracker dataset into the database.

After initialisation, the parse() method does all the job. Its internal work is done by the utility methods _parse(), _parse_cluster_block() and _parse_line() (for actual parsing), _handle_cluster(), _handle_quote() and _handle_url() (for parsed data handling), and _check() (for consistency checking).

Parameters:

Parameters:	filename : str Path to the MemeTracker dataset file to parse. line_count : int Number of lines in filename, to help in showing a progress bar. Should be computed beforehand with e.g. `wc -l <filename>`, so python doesn’t need to load the complete file twice. limit : int, optional If not None (default), stops the parsing once limit clusters have been read. Useful for testing purposes.

filename : str

Path to the MemeTracker dataset file to parse.

line_count : int

Number of lines in filename, to help in showing a progress bar. Should be computed beforehand with e.g. wc -l <filename>, so python doesn’t need to load the complete file twice.

limit : int, optional

If not None (default), stops the parsing once limit clusters have been read. Useful for testing purposes.

_check()[source]¶

Check the consistency of the database with self._checks.

The original MemeTracker dataset specifies the number of quotes and frequency for each cluster, and the number of urls and frequency for each quote. This information is saved in self._checks during parsing. This method iterates through the whole database of saved Clusters and Quotes to check that their counts correspond to what the MemeTracker dataset says (as stored in self._checks).

Raises:

Raises:	ValueError If any count in the database differs from its specification in self._checks.

ValueError

If any count in the database differs from its specification in self._checks.

_handle_cluster(fields)[source]¶

Handle a list of cluster fields to create a new Cluster.

The newly created Cluster is appended to self._objects[‘clusters’], and corresponding fields are created in self._checks.

Parameters:

Parameters:	fields : list of str List of fields defining the new cluster, as returned by `_parse_line()`.

fields : list of str

List of fields defining the new cluster, as returned by _parse_line().

_handle_quote(fields)[source]¶

Handle a list of quote fields to create a new Quote.

The newly created Quote is appended to self._objects[‘quotes’], and corresponding fields are created in self._checks.

Parameters:

Parameters:	fields : list of str List of fields defining the new quote, as returned by `_parse_line()`.

fields : list of str

List of fields defining the new quote, as returned by _parse_line().

_handle_url(fields)[source]¶

Handle a list of url fields to create a new Url.

The newly created Url is stored on self._quote which holds the currently parsed quote.

Parameters:

Parameters:	fields : list of str List of fields defining the new url, as returned by `_parse_line()`.

fields : list of str

List of fields defining the new url, as returned by _parse_line().

_parse()[source]¶

Do the actual MemeTracker file parsing.

Initialises the parsing tracking variables, then delegates each new cluster block to _parse_cluster_block(). Parsed clusters and quotes are stored as Clusters and Quotes in self._objects (to be saved later in parse()). Frequency and url counts for clusters and quotes are saved in self._checks for later checking in parse().

_parse_cluster_block()[source]¶

Parse a block of lines representing a cluster in the source MemeTracker file.

The Cluster itself is first created from self._cluster_line with _handle_cluster(), then each following line is delegated to _handle_quote() or _handle_url() until exhaustion of this cluster block. During the parsing of this cluster, self._cluster holds the current cluster being filled and self._quote the current quote (both are cleaned up when the method finishes). At the end of this block, the method increments self._clusters_read and sets self._cluster_line to the line defining the next cluster, or None if the end of file or self.limit was reached.

Raises:

Raises:	ValueError If self._cluster_line is not a line defining a new cluster.

ValueError

If self._cluster_line is not a line defining a new cluster.

classmethod _parse_line(line)[source]¶

Parse line to determine if it’s a cluster-, quote- or url-line, or anything else.

Parameters:

Parameters:	line : str A line from the MemeTracker dataset to parse.
Returns:	tipe : str in {‘cluster’, ‘quote’, ‘url’} or None The type of object that line defines; None if unknown or empty line. fields : list of str List of the tab-separated fields in line.

line : str

A line from the MemeTracker dataset to parse.

Returns:

tipe : str in {‘cluster’, ‘quote’, ‘url’} or None

The type of object that line defines; None if unknown or empty line.

fields : list of str

List of the tab-separated fields in line.

header_size = 6¶: Size (in lines) of the header in the MemeTracker file to be parsed.

parse()[source]¶

Parse the whole MemeTracker file, save, optimise the database, and check for consistency.

Parse the MemeTracker file with _parse() to create Cluster and Quote database entries corresponding to the dataset. The parsed data is then persisted to database in one step (with save_by_copy()). The database is then VACUUMed and ANALYZEd (with execute_raw()) to force it to recompute its optimisations. Finally, the consistency of the database is checked (with _check()) against number of quotes and frequency in each cluster of the original file, and against number of urls and frequency in each quote of the original file. Progress is printed to stdout.

Note that if self.limit is not None, parsing will stop after self.limit clusters have been read.

Once the parsing is finished, self.parsed is set to True.

Raises:

Raises:	ValueError If this instance has already run a parsing.

ValueError

If this instance has already run a parsing.

class brainscopypaste.load.Parser[source]¶

Bases: object

Mixin for file parsers providing the _skip_header() method.

Used by FAFeatureLoader and MemeTrackerParser.

_skip_header()[source]¶: Skip self.header_size lines in the file self._file.

brainscopypaste.load.load_fa_features()[source]¶

Load the Free Association dataset and save all its computed features to pickle files.

FA degree, pagerank, betweenness, and clustering are computed using the FAFeatureLoader class, and saved respectively to DEGREE, PAGERANK, BETWEENNESS and CLUSTERING. Progress is printed to stdout.

brainscopypaste.load.load_mt_frequency_and_tokens()[source]¶

Compute MemeTracker frequency codings and the list of available tokens.

Iterate through the whole MemeTracker dataset loaded into the database to count word frequency and make a list of tokens encountered. Frequency codings are then saved to FREQUENCY, and the list of tokens is saved to TOKENS. The MemeTracker dataset must have been loaded and filtered previously, or an excetion will be raised (see Usage or cli for more about that). Progress is printed to stdout.

Data loading¶

Related Topics

This Page