Features

Features for words in substitutions.

This module defines the SubstitutionFeaturesMixin which is used to augment Substitutions with convenience methods that give access to feature values and related computed values (e.g. sentence-relative feature values and values for composite features).

A few other utility functions that load data for the features are also defined.

class brainscopypaste.features.SubstitutionFeaturesMixin[source]

Bases: object

Mixin for Substitutions adding feature-related functionality.

Methods in this class fall into 3 categories:

  • Raw feature methods: they are memoized() class methods of the form cls._feature_name(cls, word=None). Calling them with a word returns either the feature value of that word, or np.nan if the word is not encoded. Calling them with word=None returns the set of words encoded by that feature (which is used to compute e.g. averages over the pool of words encoded by that feature). Their docstring (which you will see below if you’re reading this in a web browser) is the short name used to identify e.g. the feature’s column in analyses in notebooks. These methods are used internally by the class, to provide the next category of methods.
  • Useful feature methods that can be used in analyses: features(), feature_average(), source_destination_features(), components(), and component_average(). These methods use the raw feature methods (previous category) and the utility methods (next category) to compute feature or composite values (eventually relative to sentence) on the source or destination words or sentences.
  • Private utility methods: _component(), _source_destination_components(), _average(), _static_average(), _strict_synonyms(), _substitution_features(), and _transformed_feature(). These methods are used by the previous category of methods.

Read the source of the first category (raw features) to know how exactly an individual feature is computed. Read the docstrings (and source) of the second category (useful methods for analyses) to learn how to use this class in analyses. Read the docstrings (and source) of the third category (private utility methods) to learn how the whole class assembles its different parts together.

classmethod _aoa(word=None)[source]

age of acquisition

_average(func, source_synonyms)[source]

Compute the average value of func over the words it codes, or over the synonyms of this substitution’s source word.

If source_synonyms is True, the method computes the average feature of the synonyms of the source word of this substitution. Otherwise, it computes the average over all words coded by func.

The method is memoized() since it is called so often.

Parameters:

func : function

The function to average. Calling func() must return the pool of words that the function codes. Calling func(word) must return the value for word.

source_synonyms : bool

If True, compute the average func of the synonyms of the source word in this substitution. If False, compute the average over all coded words.

Returns:

float

Average func value.

classmethod _betweenness(word=None)[source]

betweenness

classmethod _clustering(word=None)[source]

clustering

classmethod _component(n, pca, feature_names)[source]

Get a function computing the n-th component of pca using feature_names.

The method is memoized() since it is called so often.

Parameters:

n : int

Index of the component in pca that is to be computed.

pca : sklearn.decomposition.PCA

PCA instance that was computed using the features listed in feature_names.

feature_names : tuple of str

Tuple of feature names used in the computation of pca.

Returns:

component : function

The component function, with signature component(word=None). Call component() to get the set of words encoded by that component (which is the set of words encoded by all features in feature_names). Call component(word) to get the component value of word (or np.nan if word is not coded by that component).

Examples

Get the first component of “dog” in a PCA with very few words, using features aoa, frequency, and letters_count:

>>> mixin = SubstitutionFeaturesMixin()
>>> feature_names = ('aoa', 'frequency', 'letters_count')
>>> features = list(map(mixin._transformed_feature,
...                     feature_names))
>>> values = np.array([[f(w) for f in features]
...                    for w in ['bird', 'cat', 'human']])
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2)
>>> pca.fit(values)
>>> mixin._component(0, pca, feature_names)('dog')
-0.14284518091970733
classmethod _degree(word=None)[source]

degree

classmethod _frequency(word=None)[source]

frequency

classmethod _letters_count(word=None)[source]

#letters

classmethod _orthographic_density(word=None)[source]

orthographic nd

classmethod _pagerank(word=None)[source]

pagerank

classmethod _phonemes_count(word=None)[source]

<#phonemes>

classmethod _phonological_density(word=None)[source]

phonological nd

_source_destination_components(n, pca, feature_names)[source]

Compute the n-th component of pca for all words in source and destination sentences of this substitution.

The method is memoized() since it is called so often.

Parameters:

n : int

Index of the component in pca that is to be computed.

pca : sklearn.decomposition.PCA

PCA instance that was computed using the features listed in feature_names.

feature_names : tuple of str

Tuple of feature names used in the computation of pca.

Returns:

source_components : array of float

Array of component values for each word in the source sentence of this substitution. Non-coded words appear as np.nan.

destination_components : array of float

Array of component values for each word in the destination sentence of this substitution. Non-coded words appear as np.nan.

static _static_average(func)[source]

Static version of _average(), without the source_synonyms argument.

The method is memoized() since it is called so often.

classmethod _strict_synonyms(word)[source]

Get the set of synonyms of word through WordNet, excluding word itself; empty if nothing is found.

_substitution_features(name)[source]

Compute feature name for source and destination words of this substitution.

Feature values are transformed as explained in _transformed_feature().

The method is memoized() since it is called so often.

Parameters:

name : str

Name of the feature for which to compute source and destination values.

Returns:

tuple of float

Feature values of the source and destination words of this substitution.

classmethod _syllables_count(word=None)[source]

<#syllables>

classmethod _synonyms_count(word=None)[source]

<#synonyms>

classmethod _transformed_feature(name)[source]

Get a function computing feature name, transformed as defined by __features__.

Some features have a very skewed distribution (e.g. exponential, where a few words are valued orders of magnitude more than the vast majority of words), so we use their log-transformed values in the analysis to make them comparable to more regular features. The __features__ attribute (which appears in the source code but not in the web version of these docs) defines which features are transformed how. Given a feature name, this method will generate a function that proxies calls to the raw feature method, and transforms the value if necessary.

This method is memoized() for speed, since other methods call it all the time.

Parameters:

name : str

Name of the feature for which to create a function, without preceding underscore; for instance, call cls._transformed_feature(‘aoa’) to get a function that uses the _aoa() class method.

Returns:

feature : function

The feature function, with signature feature(word=None). Call feature() to get the set of words encoded by that feature. Call feature(word) to get the transformed feature value of word (or np.nan if word is not coded by that feature).

Examples

Get the transformed frequency value of “dog”:

>>> mixin = SubstitutionFeaturesMixin()
>>> logfrequency = mixin._transformed_feature('frequency')
>>> logfrequency('dog') == np.log(mixin._frequency('dog'))
True
component_average(n, pca, feature_names, source_synonyms=False, sentence_relative=None)[source]

Compute the average, over all coded words or synonyms of this substitution’s source word, of the n-th component of pca using feature_names, possibly sentence-relative.

If source_synonyms is True, the method computes the average component of the synonyms of the source word of this substitution. Otherwise, it computes the average over all words coded by the component.

If sentence_relative is not None, it indicates a NumPy function used to aggregate word components in the source sentence of this substitution; this method then returns the component average minus that aggregate value. For instance, if sentence_relative=’median’, this method returns the average component minus the median component value in the source sentence (words valued at np.nan are ignored).

The method is memoized() since it is called so often.

Parameters:

n : int

Index of the component in pca that is to be computed.

pca : sklearn.decomposition.PCA

PCA instance that was computed using the features listed in feature_names.

feature_names : tuple of str

Tuple of feature names used in the computation of pca.

source_synonyms : bool, optional

If True, compute the average component of the synonyms of the source word in this substitution. If False (default), compute the average over all coded words.

sentence_relative : str, optional

If not None (which is the default), return average component relative to component values of the source sentence of this substitution aggregated by this function; must be a name for which np.nan<sentence_relative> exists.

Returns:

float

Average component, of all coded words or of synonyms of the substitution’s source word (depending on source_synonyms), relative to an aggregated source sentence value if sentence_relative specifies it.

components(n, pca, feature_names, sentence_relative=None)[source]

Compute the n-th components of pca for source and destination words of this substitution, possibly sentence-relative.

If sentence_relative is not None, it indicates a NumPy function used to aggregate word components in the source and destination sentences of this substitution; this method then returns the source/destination word component values minus the corresponding aggregate value. For instance, if sentence_relative=’median’, this method returns the source word component minus the median of the source sentence, and the destination word component minus the median of the destination sentence (words valued at np.nan are ignored).

The method is memoized() since it is called so often.

Parameters:

n : int

Index of the component in pca that is to be computed.

pca : sklearn.decomposition.PCA

PCA instance that was computed using the features listed in feature_names.

feature_names : tuple of str

Tuple of feature names used in the computation of pca.

sentence_relative : str, optional

If not None (which is the default), return components relative to values of their corresponding sentence aggregated by this function; must be a name for which np.nan<sentence_relative> exists.

Returns:

tuple of float

Components (possibly sentence-relative) of the source and destination words of this substitution.

feature_average(name, source_synonyms=False, sentence_relative=None)[source]

Compute the average of feature name over all coded words or over synonyms of this substitution’s source word, possibly sentence-relative.

If source_synonyms is True, the method computes the average feature of the synonyms of the source word of this substitution. Otherwise, it computes the average over all words coded by the feature.

If sentence_relative is not None, it indicates a NumPy function used to aggregate word features in the source sentence of this substitution; this method then returns the feature average minus that aggregate value. For instance, if sentence_relative=’median’, this method returns the average feature minus the median feature value in the source sentence (words valued at np.nan are ignored).

The method is memoized() since it is called so often.

Parameters:

name : str

Name of the feature for which to compute an average.

source_synonyms : bool, optional

If True, compute the average feature of the synonyms of the source word in this substitution. If False (default), compute the average over all coded words.

sentence_relative : str, optional

If not None (which is the default), return average feature relative to feature values of the source sentence of this substitution aggregated by this function; must be a name for which np.nan<sentence_relative> exists.

Returns:

float

Average feature, of all coded words or of synonyms of the substitution’s source word (depending on source_synonyms), relative to an aggregated source sentence value if sentence_relative specifies it.

features(name, sentence_relative=None)[source]

Compute feature name for source and destination words of this substitution, possibly sentence-relative.

Feature values are transformed as explained in _transformed_feature().

If sentence_relative is not None, it indicates a NumPy function used to aggregate word features in the source and destination sentences of this substitution; this method then returns the source/destination word feature values minus the corresponding aggregate value. For instance, if sentence_relative=’median’, this method returns the source word feature minus the median of the source sentence, and the destination word feature minus the median of the destination sentence (words valued at np.nan are ignored).

The method is memoized() since it is called so often.

Parameters:

name : str

Name of the feature for which to compute source and destination values.

sentence_relative : str, optional

If not None (which is the default), return features relative to values of their corresponding sentence aggregated by this function; must be a name for which np.nan<sentence_relative> exists.

Returns:

tuple of float

Feature values (possibly sentence-relative) of the source and destination words of this substitution.

source_destination_features(name, sentence_relative=None)[source]

Compute the feature values for all words in source and destination sentences of this substitution, possibly sentence-relative.

Feature values are transformed as explained in _transformed_feature().

If sentence_relative is not None, it indicates a NumPy function used to aggregate word features in the source and destination sentences of this substitution; this method then returns the source/destination feature values minus the corresponding aggregate value. For instance, if sentence_relative=’median’, this method returns the source sentence feature values minus the median of that same sentence, and the destination sentence feature values minus the median of that same sentence (words valued at np.nan are ignored).

The method is memoized() since it is called so often.

Parameters:

name : str

Name of the feature for which to compute source and destination values.

sentence_relative : str, optional

If not None (which is the default), return features relative to values of their corresponding sentence aggregated by this function; must be a name for which np.nan<sentence_relative> exists.

Returns:

source_features : array of float

Array of feature values (possibly sentence-relative) for each word in the source sentence of this substitution. Non-coded words appear as np.nan.

destination_features : array of float

Array of feature values (possibly sentence-relative) for each word in the destination sentence of this substitution. Non-coded words appear as np.nan.

brainscopypaste.features._get_aoa()[source]

Get the Age-of-Acquisition data as a dict.

The method is memoized() since it is called so often.

Returns:

dict

Association of words to their average age of acquisition. NA values in the originating data set are ignored.

brainscopypaste.features._get_clearpond()[source]

Get CLEARPOND neighbourhood density data as a dict.

The method is memoized() since it is called so often.

Returns:

dict

Dict with two keys: orthographic and phonological. orthographic contains a dict associating words to their orthographic neighbourhood density (CLEARPOND’s OTAN column). phonological contains a dict associating words to their phonological neighbourhood density (CLEARPOND’s PTAN column).

brainscopypaste.features._get_pronunciations()[source]

Get the CMU pronunciation data as a dict.

The method is memoized() since it is called so often.

Returns:

dict

Association of words to their list of possible pronunciations.