similarities.termsim
– Term similarity queries¶This module provides classes that deal with term similarities.
gensim.similarities.termsim.
SparseTermSimilarityMatrix
(source, dictionary=None, tfidf=None, symmetric=True, positive_definite=False, nonzero_limit=100, dtype=<class 'numpy.float32'>)¶Builds a sparse term similarity matrix using a term similarity index.
Notes
Building a DOK matrix, and converting it to a CSC matrix carries a significant memory overhead. Future work should switch to building arrays of rows, columns, and non-zero elements and directly passing these arrays to the CSC matrix constructor without copying.
Examples
>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
>>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1) # train word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix
>>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> query = 'graph trees computer'.split() # make a query
>>> sims = docsim_index[dictionary.doc2bow(query)] # calculate similarity of query to each doc from bow_corpus
Check out Tutorial Notebook for more examples.
source (TermSimilarityIndex
or scipy.sparse.spmatrix
) – The source of the term similarity. Either a term similarity index that will be used for
building the term similarity matrix, or an existing sparse term similarity matrix that will
be encapsulated and stored in the matrix attribute.
dictionary (Dictionary
or None, optional) – A dictionary that specifies a mapping between terms and the indices of rows and columns
of the resulting term similarity matrix. The dictionary may only be None when source is
a scipy.sparse.spmatrix
.
tfidf (gensim.models.tfidfmodel.TfidfModel
or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns
of the term similarity matrix will be build in a decreasing order of importance of
terms, or in the order of term identifiers if None.
symmetric (bool, optional) – Whether the symmetry of the term similarity matrix will be enforced. This parameter only has
an effect when source is a scipy.sparse.spmatrix
. Positive definiteness is a
necessary precondition if you later wish to derive a change-of-basis matrix from the term
similarity matrix using Cholesky factorization.
positive_definite (bool, optional) – Whether the positive definiteness of the term similarity matrix will be enforced through strict column diagonal dominance. Positive definiteness is a necessary precondition if you later wish to derive a change-of-basis matrix from the term similarity matrix using Cholesky factorization.
nonzero_limit (int or None, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix. If None, then no limit will be imposed.
dtype (numpy.dtype, optional) – Data-type of the sparse term similarity matrix.
matrix
¶The encapsulated sparse term similarity matrix.
scipy.sparse.csc_matrix
inner_product
(X, Y, normalized=False)¶Get the inner product(s) between real vectors / corpora X and Y.
Return the inner product(s) between real vectors / corpora vec1 and vec2 expressed in a non-orthogonal normalized basis, where the dot product between the basis vectors is given by the sparse term similarity matrix.
vec1 (list of (int, float) or iterable of list of (int, float)) – A query vector / corpus in the sparse bag-of-words format.
vec2 (list of (int, float) or iterable of list of (int, float)) – A document vector / corpus in the sparse bag-of-words format.
normalized (bool, optional) – Whether the inner product should be L2-normalized. The normalized inner product corresponds to the Soft Cosine Measure (SCM). SCM is a number between <-1.0, 1.0>, where higher is more similar.
The inner product(s) between X and Y.
self.matrix.dtype, scipy.sparse.csr_matrix, or numpy.matrix
References
The soft cosine measure was perhaps first described by [sidorovetal14].
Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”, 2014, http://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/2043/1921.
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
gensim.similarities.termsim.
TermSimilarityIndex
¶Retrieves most similar terms for a given term.
See also
SparseTermSimilarityMatrix
Build a term similarity matrix and compute the Soft Cosine Measure.
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
most_similar
(term, topn=10)¶Get most similar terms for a given term.
Return most similar terms for a given term along with the similarities.
term (str) – Tne term for which we are retrieving topn most similar terms.
topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.
Most similar terms along with their similarities to term. Only terms distinct from term must be returned.
iterable of (str, float)
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
gensim.similarities.termsim.
UniformTermSimilarityIndex
(dictionary, term_similarity=0.5)¶Retrieves most similar terms for a given term under the hypothesis that the similarities between distinct terms are uniform.
dictionary (Dictionary
) – A dictionary that specifies the considered terms.
term_similarity (float, optional) – The uniform similarity between distinct terms.
See also
SparseTermSimilarityMatrix
Build a term similarity matrix and compute the Soft Cosine Measure.
Notes
This class is mainly intended for testing SparseTermSimilarityMatrix and other classes that depend on the TermSimilarityIndex.
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
most_similar
(t1, topn=10)¶Get most similar terms for a given term.
Return most similar terms for a given term along with the similarities.
term (str) – Tne term for which we are retrieving topn most similar terms.
topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.
Most similar terms along with their similarities to term. Only terms distinct from term must be returned.
iterable of (str, float)
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.