similarities.termsim – Term similarity queries

This module provides classes that deal with term similarities.

class gensim.similarities.termsim.SparseTermSimilarityMatrix(source, dictionary=None, tfidf=None, symmetric=True, dominant=False, nonzero_limit=100, dtype=<class 'numpy.float32'>)

Builds a sparse term similarity matrix using a term similarity index.

Examples

>>> from gensim.test.utils import common_texts as corpus, datapath
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec
>>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
>>> from gensim.similarities.index import AnnoyIndexer
>>>
>>> model_corpus_file = datapath('lee_background.cor')
>>> model = Word2Vec(corpus_file=model_corpus_file, vector_size=20, min_count=1)  # train word-vectors
>>>
>>> dictionary = Dictionary(corpus)
>>> tfidf = TfidfModel(dictionary=dictionary)
>>> words = [word for word, count in dictionary.most_common()]
>>> word_vectors = model.wv.vectors_for_all(words, allow_inference=False)  # produce vectors for words in corpus
>>>
>>> indexer = AnnoyIndexer(word_vectors, num_trees=2)  # use Annoy for faster word similarity lookups
>>> termsim_index = WordEmbeddingSimilarityIndex(word_vectors, kwargs={'indexer': indexer})
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, tfidf)  # compute word similarities
>>>
>>> tfidf_corpus = tfidf[[dictionary.doc2bow(document) for document in common_texts]]
>>> docsim_index = SoftCosineSimilarity(tfidf_corpus, similarity_matrix, num_best=10)  # index tfidf_corpus
>>>
>>> query = 'graph trees computer'.split()  # make a query
>>> sims = docsim_index[dictionary.doc2bow(query)]  # find the ten closest documents from tfidf_corpus

Check out the Gallery for more examples.

Parameters
  • source (TermSimilarityIndex or scipy.sparse.spmatrix) – The source of the term similarity. Either a term similarity index that will be used for building the term similarity matrix, or an existing sparse term similarity matrix that will be encapsulated and stored in the matrix attribute. When a matrix is specified as the source, any other parameters will be ignored.

  • dictionary (Dictionary or None, optional) – A dictionary that specifies a mapping between terms and the indices of rows and columns of the resulting term similarity matrix. The dictionary may only be None when source is a scipy.sparse.spmatrix.

  • tfidf (gensim.models.tfidfmodel.TfidfModel or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.

  • symmetric (bool, optional) – Whether the symmetry of the term similarity matrix will be enforced. Symmetry is a necessary precondition for positive definiteness, which is necessary if you later wish to derive a unique change-of-basis matrix from the term similarity matrix using Cholesky factorization. Setting symmetric to False will significantly reduce memory usage during matrix construction.

  • dominant (bool, optional) – Whether the strict column diagonal dominance of the term similarity matrix will be enforced. Strict diagonal dominance and symmetry are sufficient preconditions for positive definiteness, which is necessary if you later wish to derive a change-of-basis matrix from the term similarity matrix using Cholesky factorization.

  • nonzero_limit (int or None, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix. If None, then no limit will be imposed.

  • dtype (numpy.dtype, optional) – The data type of the sparse term similarity matrix.

matrix

The encapsulated sparse term similarity matrix.

Type

scipy.sparse.csc_matrix

Raises

ValueError – If dictionary is empty.

See also

SoftCosineSimilarity

A document similarity index using the soft cosine similarity over the term similarity matrix.

LevenshteinSimilarityIndex

A term similarity index that computes Levenshtein similarities between terms.

WordEmbeddingSimilarityIndex

A term similarity index that computes cosine similarities between word embeddings.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

inner_product(X, Y, normalized=(False, False))

Get the inner product(s) between real vectors / corpora X and Y.

Return the inner product(s) between real vectors / corpora vec1 and vec2 expressed in a non-orthogonal normalized basis, where the dot product between the basis vectors is given by the sparse term similarity matrix.

Parameters
  • vec1 (list of (int, float) or iterable of list of (int, float)) – A query vector / corpus in the sparse bag-of-words format.

  • vec2 (list of (int, float) or iterable of list of (int, float)) – A document vector / corpus in the sparse bag-of-words format.

  • normalized (tuple of {True, False, 'maintain'}, optional) – First/second value specifies whether the query/document vectors in the inner product will be L2-normalized (True; corresponds to the soft cosine measure), maintain their L2-norm during change of basis (‘maintain’; corresponds to query expansion with partial membership), or kept as-is (False; corresponds to query expansion; default).

Returns

The inner product(s) between X and Y.

Return type

self.matrix.dtype, scipy.sparse.csr_matrix, or numpy.matrix

References

The soft cosine measure was perhaps first described by [sidorovetal14]. Further notes on the efficient implementation of the soft cosine measure are described by [novotny18].

sidorovetal14

Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”, 2014, http://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/2043/1921.

novotny18

Vít Novotný, “Implementation Notes for the Soft Cosine Measure”, 2018, http://dx.doi.org/10.1145/3269206.3269317.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.similarities.termsim.TermSimilarityIndex

Base class = common interface for retrieving the most similar terms for a given term.

See also

SparseTermSimilarityMatrix

A sparse term similarity matrix built using a term similarity index.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(term, topn=10)

Get most similar terms for a given term.

Return the most similar terms for a given term along with their similarities.

Parameters
  • term (str) – The term for which we are retrieving topn most similar terms.

  • topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.similarities.termsim.UniformTermSimilarityIndex(dictionary, term_similarity=0.5)

Retrieves most similar terms for a given term under the hypothesis that the similarities between distinct terms are uniform.

Parameters
  • dictionary (Dictionary) – A dictionary that specifies the considered terms.

  • term_similarity (float, optional) – The uniform similarity between distinct terms.

See also

SparseTermSimilarityMatrix

A sparse term similarity matrix built using a term similarity index.

Notes

This class is mainly intended for testing SparseTermSimilarityMatrix and other classes that depend on the TermSimilarityIndex.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(t1, topn=10)

Get most similar terms for a given term.

Return the most similar terms for a given term along with their similarities.

Parameters
  • term (str) – The term for which we are retrieving topn most similar terms.

  • topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.similarities.termsim.WordEmbeddingSimilarityIndex(keyedvectors, threshold=0.0, exponent=2.0, kwargs=None)

Computes cosine similarities between word embeddings and retrieves most similar terms for a given term.

Notes

By fitting the word embeddings to a vocabulary that you will be using, you can eliminate all out-of-vocabulary (OOV) words that you would otherwise receive from the most_similar method. In subword models such as fastText, this procedure will also infer word-vectors for words from your vocabulary that previously had no word-vector.

>>> from gensim.test.utils import common_texts, datapath
>>> from gensim.corpora import Dictionary
>>> from gensim.models import FastText
>>> from gensim.models.word2vec import LineSentence
>>> from gensim.similarities import WordEmbeddingSimilarityIndex
>>>
>>> model = FastText(common_texts, vector_size=20, min_count=1)  # train word-vectors on a corpus
>>> different_corpus = LineSentence(datapath('lee_background.cor'))
>>> dictionary = Dictionary(different_corpus)  # construct a vocabulary on a different corpus
>>> words = [word for word, count in dictionary.most_common()]
>>> word_vectors = model.wv.vectors_for_all(words)  # remove OOV word-vectors and infer word-vectors for new words
>>> assert len(dictionary) == len(word_vectors)  # all words from our vocabulary received their word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(word_vectors)
Parameters
  • keyedvectors (KeyedVectors) – The word embeddings.

  • threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.

  • exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.

  • kwargs (dict or None) – A dict with keyword arguments that will be passed to the most_similar() method when retrieving the word embeddings closest to a given word embedding.

See also

LevenshteinSimilarityIndex

Retrieve most similar terms for a given term using the Levenshtein distance.

SparseTermSimilarityMatrix

Build a term similarity matrix and compute the Soft Cosine Measure.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(t1, topn=10)

Get most similar terms for a given term.

Return the most similar terms for a given term along with their similarities.

Parameters
  • term (str) – The term for which we are retrieving topn most similar terms.

  • topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.