similarities.docsim
– Document similarity queries¶
Compute similarities across a collection of documents in the Vector Space Model.
The main class is Similarity
, which builds an index for a given set of documents.
Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.
How It Works¶
The Similarity
class splits the index into several smaller sub-indexes (“shards”),
which are disk-based. If your entire index fits in memory (~one million documents per 1GB of RAM),
you can also use the MatrixSimilarity
or SparseMatrixSimilarity
classes directly.
These are more simple but do not scale as well: they keep the entire index in RAM, no sharding. They also do not
support adding new document to the index dynamically.
Once the index has been initialized, you can query for document similarity simply by
>>> from gensim.similarities import Similarity
>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
>>>
>>> index_tmpfile = get_tmpfile("index")
>>> query = [(1, 2), (6, 1), (7, 2)]
>>>
>>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
>>> similarities = index[query] # get similarities between the query and all index documents
If you have more query documents, you can submit them all at once, in a batch
>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
>>>
>>> index_tmpfile = get_tmpfile("index")
>>> batch_of_documents = common_corpus[:] # only as example
>>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
>>>
>>> # the batch is simply an iterable of documents, aka gensim corpus:
>>> for similarities in index[batch_of_documents]:
... pass
The benefit of this batch (aka “chunked”) querying is a much better performance.
To see the speed-up on your machine, run python -m gensim.test.simspeed
(compare to my results here).
There is also a special syntax for when you need similarity of documents in the index to the index itself (i.e. queries = the indexed documents themselves). This special syntax uses the faster, batch queries internally and is ideal for all-vs-all pairwise similarities:
>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
>>>
>>> index_tmpfile = get_tmpfile("index")
>>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
>>>
>>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd...
... pass
- class gensim.similarities.docsim.MatrixSimilarity(corpus, num_best=None, dtype=<class 'numpy.float32'>, num_features=None, chunksize=256, corpus_len=None)¶
Compute cosine similarity against a corpus of documents by storing the index matrix in memory.
Unless the entire matrix fits into main memory, use
Similarity
instead.Examples
>>> from gensim.test.utils import common_corpus, common_dictionary >>> from gensim.similarities import MatrixSimilarity >>> >>> query = [(1, 2), (5, 4)] >>> index = MatrixSimilarity(common_corpus, num_features=len(common_dictionary)) >>> sims = index[query]
- Parameters
corpus (iterable of list of (int, number)) – Corpus in streamed Gensim bag-of-words format.
num_best (int, optional) – If set, return only the num_best most similar documents, always leaving out documents with similarity = 0. Otherwise, return a full vector with one float for every document in the index.
num_features (int) – Size of the dictionary (number of features).
corpus_len (int, optional) – Number of documents in corpus. If not specified, will scan the corpus to determine the matrix size.
chunksize (int, optional) – Size of query chunks. Used internally when the query is an entire corpus.
dtype (numpy.dtype, optional) – Datatype to store the internal matrix in.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- get_similarities(query)¶
Get similarity between query and this index.
Warning
Do not use this function directly, use the
__getitem__
instead.- Parameters
query ({list of (int, number), iterable of list of (int, number),
scipy.sparse.csr_matrix
}) – Document or collection of documents.- Returns
Similarity matrix.
- Return type
numpy.ndarray
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- class gensim.similarities.docsim.Shard(fname, index)¶
A proxy that represents a single shard instance within
Similarity
index.Basically just wraps
MatrixSimilarity
,SparseMatrixSimilarity
, etc, so that it mmaps from disk on request (query).- Parameters
fname (str) – Path to top-level directory (file) to traverse for corpus documents.
index (
SimilarityABC
) – Index object.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- fullname()¶
Get full path to shard file.
- Returns
Path to shard instance.
- Return type
str
- get_document_id(pos)¶
Get index vector at position pos.
- Parameters
pos (int) – Vector position.
- Returns
Index vector. Type depends on underlying index.
- Return type
{
scipy.sparse.csr_matrix
,numpy.ndarray
}
Notes
The vector is of the same type as the underlying index (ie., dense for
MatrixSimilarity
and scipy.sparse forSparseMatrixSimilarity
.
- get_index()¶
Load & get index.
- Returns
Index instance.
- Return type
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- class gensim.similarities.docsim.Similarity(output_prefix, corpus, num_features, num_best=None, chunksize=256, shardsize=32768, norm='l2')¶
Compute cosine similarity of a dynamic query against a corpus of documents (‘the index’).
The index supports adding new documents dynamically.
Notes
Scalability is achieved by sharding the index into smaller pieces, each of which fits into core memory The shards themselves are simply stored as files to disk and mmap’ed back as needed.
Examples
>>> from gensim.corpora.textcorpus import TextCorpus >>> from gensim.test.utils import datapath, get_tmpfile >>> from gensim.similarities import Similarity >>> >>> corpus = TextCorpus(datapath('testcorpus.mm')) >>> index_temp = get_tmpfile("index") >>> index = Similarity(index_temp, corpus, num_features=400) # create index >>> >>> query = next(iter(corpus)) >>> result = index[query] # search similar to `query` in index >>> >>> for sims in index[corpus]: # if you have more query documents, you can submit them all at once, in a batch ... pass >>> >>> # There is also a special syntax for when you need similarity of documents in the index >>> # to the index itself (i.e. queries=indexed documents themselves). This special syntax >>> # uses the faster, batch queries internally and **is ideal for all-vs-all pairwise similarities**: >>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd... ... pass
See also
MatrixSimilarity
Index similarity (dense with cosine distance).
SparseMatrixSimilarity
Index similarity (sparse with cosine distance).
WmdSimilarity
Index similarity (with word-mover distance).
- Parameters
output_prefix (str) – Prefix for shard filename. If None, a random filename in temp will be used.
corpus (iterable of list of (int, number)) – Corpus in streamed Gensim bag-of-words format.
num_features (int) – Size of the dictionary (number of features).
num_best (int, optional) – If set, return only the num_best most similar documents, always leaving out documents with similarity = 0. Otherwise, return a full vector with one float for every document in the index.
chunksize (int, optional) – Size of query chunks. Used internally when the query is an entire corpus.
shardsize (int, optional) – Maximum shard size, in documents. Choose a value so that a shardsize x chunksize matrix of floats fits comfortably into your RAM.
norm ({'l1', 'l2'}, optional) – Normalization to use.
Notes
Documents are split (internally, transparently) into shards of shardsize documents each, and each shard converted to a matrix, for faster BLAS calls. Each shard is stored to disk under output_prefix.shard_number.
If you don’t specify an output prefix, a random filename in temp will be used.
If your entire index fits in memory (~1 million documents per 1GB of RAM), you can also use the
MatrixSimilarity
orSparseMatrixSimilarity
classes directly. These are more simple but do not scale as well (they keep the entire index in RAM, no sharding). They also do not support adding new document dynamically.- add_documents(corpus)¶
Extend the index with new documents.
- Parameters
corpus (iterable of list of (int, number)) – Corpus in BoW format.
Notes
Internally, documents are buffered and then spilled to disk when there’s self.shardsize of them (or when a query is issued).
Examples
>>> from gensim.corpora.textcorpus import TextCorpus >>> from gensim.test.utils import datapath, get_tmpfile >>> from gensim.similarities import Similarity >>> >>> corpus = TextCorpus(datapath('testcorpus.mm')) >>> index_temp = get_tmpfile("index") >>> index = Similarity(index_temp, corpus, num_features=400) # create index >>> >>> one_more_corpus = TextCorpus(datapath('testcorpus.txt')) >>> index.add_documents(one_more_corpus) # add more documents in corpus
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- check_moved()¶
Update shard locations, for case where the server prefix location changed on the filesystem.
- close_shard()¶
- Force the latest shard to close (be converted to a matrix and stored to disk).
Do nothing if no new documents added since last call.
Notes
The shard is closed even if it is not full yet (its size is smaller than self.shardsize). If documents are added later via
add_documents()
this incomplete shard will be loaded again and completed.
- destroy()¶
Delete all files under self.output_prefix Index is not usable anymore after calling this method.
- get_similarities(doc)¶
Get similarities of the given document or corpus against this index.
- Parameters
doc ({list of (int, number), iterable of list of (int, number)}) – Document in the sparse Gensim bag-of-words format, or a streamed corpus of such documents.
- iter_chunks(chunksize=None)¶
Iteratively yield the index as chunks of document vectors, each of size <= chunksize.
- Parameters
chunksize (int, optional) – Size of chunk,, if None - self.chunksize will be used.
- Yields
numpy.ndarray
orscipy.sparse.csr_matrix
– Chunks of the index as 2D arrays. The arrays are either dense or sparse, depending on whether the shard was storing dense or sparse vectors.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- query_shards(query)¶
Apply shard[query] to each shard in self.shards. Used internally.
- Parameters
query ({iterable of list of (int, number) , list of (int, number))}) – Document in BoW format or corpus of documents.
- Returns
Query results.
- Return type
(None, list of individual shard query results)
- reopen_shard()¶
Reopen an incomplete shard.
- save(fname=None, *args, **kwargs)¶
Save the index object via pickling under fname. See also
load()
.- Parameters
fname (str, optional) – Path for save index, if not provided - will be saved to self.output_prefix.
*args (object) – Arguments, see
gensim.utils.SaveLoad.save()
.**kwargs (object) – Keyword arguments, see
gensim.utils.SaveLoad.save()
.
Notes
Will call
close_shard()
internally to spill any unfinished shards to disk first.Examples
>>> from gensim.corpora.textcorpus import TextCorpus >>> from gensim.test.utils import datapath, get_tmpfile >>> from gensim.similarities import Similarity >>> >>> temp_fname = get_tmpfile("index") >>> output_fname = get_tmpfile("saved_index") >>> >>> corpus = TextCorpus(datapath('testcorpus.txt')) >>> index = Similarity(output_fname, corpus, num_features=400) >>> >>> index.save(output_fname) >>> loaded_index = index.load(output_fname)
- shardid2filename(shardid)¶
Get shard file by shardid.
- Parameters
shardid (int) – Shard index.
- Returns
Path to shard file.
- Return type
str
- similarity_by_id(docpos)¶
Get similarity of a document specified by its index position docpos.
- Parameters
docpos (int) – Document position in the index.
- Returns
Similarities of the given document against this index.
- Return type
numpy.ndarray
orscipy.sparse.csr_matrix
Examples
>>> from gensim.corpora.textcorpus import TextCorpus >>> from gensim.test.utils import datapath >>> from gensim.similarities import Similarity >>> >>> corpus = TextCorpus(datapath('testcorpus.txt')) >>> index = Similarity('temp', corpus, num_features=400) >>> similarities = index.similarity_by_id(1)
- vector_by_id(docpos)¶
Get the indexed vector corresponding to the document at position docpos.
- Parameters
docpos (int) – Document position
- Returns
Indexed vector.
- Return type
scipy.sparse.csr_matrix
Examples
>>> from gensim.corpora.textcorpus import TextCorpus >>> from gensim.test.utils import datapath >>> from gensim.similarities import Similarity >>> >>> # Create index: >>> corpus = TextCorpus(datapath('testcorpus.txt')) >>> index = Similarity('temp', corpus, num_features=400) >>> vector = index.vector_by_id(1)
- class gensim.similarities.docsim.SoftCosineSimilarity(corpus, similarity_matrix, num_best=None, chunksize=256, normalized=None, normalize_queries=True, normalize_documents=True)¶
Compute soft cosine similarity against a corpus of documents by storing the index matrix in memory.
Examples
>>> from gensim.test.utils import common_texts >>> from gensim.corpora import Dictionary >>> from gensim.models import Word2Vec >>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix >>> from gensim.similarities import WordEmbeddingSimilarityIndex >>> >>> model = Word2Vec(common_texts, vector_size=20, min_count=1) # train word-vectors >>> termsim_index = WordEmbeddingSimilarityIndex(model.wv) >>> dictionary = Dictionary(common_texts) >>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts] >>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix >>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10) >>> >>> query = 'graph trees computer'.split() # make a query >>> sims = docsim_index[dictionary.doc2bow(query)] # calculate similarity of query to each doc from bow_corpus
Check out the Gallery for more examples.
- Parameters
corpus (iterable of list of (int, float)) – A list of documents in the BoW format.
similarity_matrix (
gensim.similarities.SparseTermSimilarityMatrix
) – A term similarity matrix.num_best (int, optional) – The number of results to retrieve for a query, if None - return similarities with all elements from corpus.
chunksize (int, optional) – Size of one corpus chunk.
normalized (tuple of {True, False, 'maintain', None}, optional) – A deprecated alias for (normalize_queries, normalize_documents). If None, use normalize_queries and normalize_documents. Default is None.
normalize_queries ({True, False, 'maintain'}, optional) – Whether the query vector in the inner product will be L2-normalized (True; corresponds to the soft cosine similarity measure; default), maintain their L2-norm during change of basis (‘maintain’; corresponds to queryexpansion with partial membership), or kept as-is (False; corresponds to query expansion).
normalize_documents ({True, False, 'maintain'}, optional) – Whether the document vector in the inner product will be L2-normalized (True; corresponds to the soft cosine similarity measure; default), maintain their L2-norm during change of basis (‘maintain’; corresponds to queryexpansion with partial membership), or kept as-is (False; corresponds to query expansion).
See also
SparseTermSimilarityMatrix
A sparse term similarity matrix built using a term similarity index.
LevenshteinSimilarityIndex
A term similarity index that computes Levenshtein similarities between terms.
WordEmbeddingSimilarityIndex
A term similarity index that computes cosine similarities between word embeddings.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- get_similarities(query)¶
Get similarity between query and this index.
Warning
Do not use this function directly; use the self[query] syntax instead.
- Parameters
query ({list of (int, number), iterable of list of (int, number)}) – Document or collection of documents.
- Returns
Similarity matrix.
- Return type
numpy.ndarray
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- class gensim.similarities.docsim.SparseMatrixSimilarity(corpus, num_features=None, num_terms=None, num_docs=None, num_nnz=None, num_best=None, chunksize=500, dtype=<class 'numpy.float32'>, maintain_sparsity=False, normalize_queries=True, normalize_documents=True)¶
Compute cosine similarity against a corpus of documents by storing the index matrix in memory.
Examples
Here is how you would index and query a corpus of documents in the bag-of-words format using the cosine similarity:
>>> from gensim.corpora import Dictionary >>> from gensim.similarities import SparseMatrixSimilarity >>> from gensim.test.utils import common_texts as corpus >>> >>> dictionary = Dictionary(corpus) # fit dictionary >>> bow_corpus = [dictionary.doc2bow(line) for line in corpus] # convert corpus to BoW format >>> index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary)) >>> >>> query = 'graph trees computer'.split() # make a query >>> bow_query = dictionary.doc2bow(query) >>> similarities = index[bow_query] # calculate similarity of query to each doc from bow_corpus
Here is how you would index and query a corpus of documents using the Okapi BM25 scoring function:
>>> from gensim.corpora import Dictionary >>> from gensim.models import TfidfModel, OkapiBM25Model >>> from gensim.similarities import SparseMatrixSimilarity >>> from gensim.test.utils import common_texts as corpus >>> >>> dictionary = Dictionary(corpus) # fit dictionary >>> query_model = TfidfModel(dictionary=dictionary, smartirs='bnn') # enforce binary weights >>> document_model = OkapiBM25Model(dictionary=dictionary) # fit bm25 model >>> >>> bow_corpus = [dictionary.doc2bow(line) for line in corpus] # convert corpus to BoW format >>> bm25_corpus = document_model[bow_corpus] >>> index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary), ... normalize_queries=False, normalize_documents=False) >>> >>> >>> query = 'graph trees computer'.split() # make a query >>> bow_query = dictionary.doc2bow(query) >>> bm25_query = query_model[bow_query] >>> similarities = index[bm25_query] # calculate similarity of query to each doc from bow_corpus
Notes
Use this if your input corpus contains sparse vectors (such as TF-IDF documents) and fits into RAM.
The matrix is internally stored as a
scipy.sparse.csr_matrix
matrix. Unless the entire matrix fits into main memory, useSimilarity
instead.Takes an optional maintain_sparsity argument, setting this to True causes get_similarities to return a sparse matrix instead of a dense representation if possible.
See also
Similarity
Index similarity (wrapper for other inheritors of
SimilarityABC
).MatrixSimilarity
Index similarity (dense with cosine distance).
- Parameters
corpus (iterable of list of (int, float)) – A list of documents in the BoW format.
num_features (int, optional) – Size of the dictionary. Must be either specified, or present in corpus.num_terms.
num_terms (int, optional) – Alias for num_features, you can use either.
num_docs (int, optional) – Number of documents in corpus. Will be calculated if not provided.
num_nnz (int, optional) – Number of non-zero elements in corpus. Will be calculated if not provided.
num_best (int, optional) – If set, return only the num_best most similar documents, always leaving out documents with similarity = 0. Otherwise, return a full vector with one float for every document in the index.
chunksize (int, optional) – Size of query chunks. Used internally when the query is an entire corpus.
dtype (numpy.dtype, optional) – Data type of the internal matrix.
maintain_sparsity (bool, optional) – Return sparse arrays from
get_similarities()
?normalize_queries (bool, optional) – If queries are in bag-of-words (int, float) format, as opposed to a sparse or dense 2D arrays, they will be L2-normalized. Default is True.
normalize_documents (bool, optional) – If corpus is in bag-of-words (int, float) format, as opposed to a sparse or dense 2D arrays, it will be L2-normalized. Default is True.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- get_similarities(query)¶
Get similarity between query and this index.
Warning
Do not use this function directly; use the self[query] syntax instead.
- Parameters
query ({list of (int, number), iterable of list of (int, number),
scipy.sparse.csr_matrix
}) – Document or collection of documents.- Returns
numpy.ndarray
– Similarity matrix (if maintain_sparsity=False) ORscipy.sparse.csc
– otherwise
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- class gensim.similarities.docsim.WmdSimilarity(corpus, kv_model, num_best=None, chunksize=256)¶
Compute negative WMD similarity against a corpus of documents.
Check out the Gallery for more examples.
When using this code, please consider citing the following papers:
Example
>>> from gensim.test.utils import common_texts >>> from gensim.models import Word2Vec >>> from gensim.similarities import WmdSimilarity >>> >>> model = Word2Vec(common_texts, vector_size=20, min_count=1) # train word-vectors >>> >>> index = WmdSimilarity(common_texts, model.wv) >>> # Make query. >>> query = ['trees'] >>> sims = index[query]
- Parameters
corpus (iterable of list of str) – A list of documents, each of which is a list of tokens.
kv_model (
KeyedVectors
) – A set of KeyedVectorsnum_best (int, optional) – Number of results to retrieve.
chunksize (int, optional) – Size of chunk.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- get_similarities(query)¶
Get similarity between query and this index.
Warning
Do not use this function directly; use the self[query] syntax instead.
- Parameters
query ({list of str, iterable of list of str}) – Document or collection of documents.
- Returns
Similarity matrix.
- Return type
numpy.ndarray
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- gensim.similarities.docsim.query_shard(args)¶
Helper for request query from shard, same as shard[query].
- Parameters
args ((list of (int, number),
SimilarityABC
)) – Query and Shard instances- Returns
Similarities of the query against documents indexed in this shard.
- Return type
numpy.ndarray
orscipy.sparse.csr_matrix