similarities.termsim – Term similarity queries

`similarities.termsim` – Term similarity queries¶

This module provides classes that deal with term similarities.

class gensim.similarities.termsim.SparseTermSimilarityMatrix(source, dictionary=None, tfidf=None, symmetric=True, positive_definite=False, nonzero_limit=100, dtype=<class 'numpy.float32'>)¶

Builds a sparse term similarity matrix using a term similarity index.

Notes

Building a DOK matrix, and converting it to a CSC matrix carries a significant memory overhead. Future work should switch to building arrays of rows, columns, and non-zero elements and directly passing these arrays to the CSC matrix constructor without copying.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
>>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)  # train word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
>>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> query = 'graph trees computer'.split()  # make a query
>>> sims = docsim_index[dictionary.doc2bow(query)]  # calculate similarity of query to each doc from bow_corpus

Check out Tutorial Notebook for more examples.

Parameters

source (TermSimilarityIndex or scipy.sparse.spmatrix) – The source of the term similarity. Either a term similarity index that will be used for building the term similarity matrix, or an existing sparse term similarity matrix that will be encapsulated and stored in the matrix attribute.
dictionary (Dictionary or None, optional) – A dictionary that specifies a mapping between terms and the indices of rows and columns of the resulting term similarity matrix. The dictionary may only be None when source is a scipy.sparse.spmatrix.
tfidf (gensim.models.tfidfmodel.TfidfModel or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.
symmetric (bool, optional) – Whether the symmetry of the term similarity matrix will be enforced. This parameter only has an effect when source is a scipy.sparse.spmatrix. Positive definiteness is a necessary precondition if you later wish to derive a change-of-basis matrix from the term similarity matrix using Cholesky factorization.
positive_definite (bool, optional) – Whether the positive definiteness of the term similarity matrix will be enforced through strict column diagonal dominance. Positive definiteness is a necessary precondition if you later wish to derive a change-of-basis matrix from the term similarity matrix using Cholesky factorization.
nonzero_limit (int or None, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix. If None, then no limit will be imposed.
dtype (numpy.dtype, optional) – Data-type of the sparse term similarity matrix.

matrix¶

The encapsulated sparse term similarity matrix.

Type: scipy.sparse.csc_matrix

inner_product(X, Y, normalized=False)¶

Get the inner product(s) between real vectors / corpora X and Y.

Return the inner product(s) between real vectors / corpora vec1 and vec2 expressed in a non-orthogonal normalized basis, where the dot product between the basis vectors is given by the sparse term similarity matrix.

Parameters

vec1 (list of (int, float) or iterable of list of (int, float)) – A query vector / corpus in the sparse bag-of-words format.
vec2 (list of (int, float) or iterable of list of (int, float)) – A document vector / corpus in the sparse bag-of-words format.
normalized (bool, optional) – Whether the inner product should be L2-normalized. The normalized inner product corresponds to the Soft Cosine Measure (SCM). SCM is a number between <-1.0, 1.0>, where higher is more similar.

Returns

The inner product(s) between X and Y.

Return type

self.matrix.dtype, scipy.sparse.csr_matrix, or numpy.matrix

References

The soft cosine measure was perhaps first described by [sidorovetal14].

sidorovetal14: Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”, 2014, http://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/2043/1921.

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save(): Save object to file.

Returns: Object loaded from fname.
Return type: object
Raises: AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶

Save the object to a file.

Parameters

fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.

See also

load(): Load object from file.

class gensim.similarities.termsim.TermSimilarityIndex¶

Retrieves most similar terms for a given term.

See also

SparseTermSimilarityMatrix: Build a term similarity matrix and compute the Soft Cosine Measure.

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save(): Save object to file.

Returns: Object loaded from fname.
Return type: object
Raises: AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(term, topn=10)¶

Get most similar terms for a given term.

Return most similar terms for a given term along with the similarities.

Parameters

term (str) – Tne term for which we are retrieving topn most similar terms.
topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶

Save the object to a file.

Parameters

fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.

See also

load(): Load object from file.

class gensim.similarities.termsim.UniformTermSimilarityIndex(dictionary, term_similarity=0.5)¶

Retrieves most similar terms for a given term under the hypothesis that the similarities between distinct terms are uniform.

Parameters

dictionary (Dictionary) – A dictionary that specifies the considered terms.
term_similarity (float, optional) – The uniform similarity between distinct terms.

See also

SparseTermSimilarityMatrix: Build a term similarity matrix and compute the Soft Cosine Measure.

Notes

This class is mainly intended for testing SparseTermSimilarityMatrix and other classes that depend on the TermSimilarityIndex.

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save(): Save object to file.

Returns: Object loaded from fname.
Return type: object
Raises: AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(t1, topn=10)¶

Get most similar terms for a given term.

Return most similar terms for a given term along with the similarities.

Parameters

term (str) – Tne term for which we are retrieving topn most similar terms.
topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶

Save the object to a file.

Parameters

fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.

See also

load(): Load object from file.

Get Expert Help From The Gensim Authors

similarities.termsim – Term similarity queries¶

`similarities.termsim` – Term similarity queries¶