matutils
– Math utils¶Math helper functions.
gensim.matutils.
Dense2Corpus
(dense, documents_columns=True)¶Bases: object
Treat dense numpy array as a streamed Gensim corpus in the bag-of-words format.
Notes
No data copy is made (changes to the underlying matrix imply changes in the streamed corpus).
See also
corpus2dense()
Convert Gensim corpus to dense matrix.
Sparse2Corpus
Convert sparse matrix to Gensim corpus format.
dense (numpy.ndarray) – Corpus in dense format.
documents_columns (bool, optional) – Documents in dense represented as columns, as opposed to rows?
gensim.matutils.
MmWriter
(fname)¶Bases: object
Store a corpus in Matrix Market format,
using MmCorpus
.
Notes
The output is written one document at a time, not the whole matrix at once (unlike e.g. scipy.io.mmread). This allows you to write corpora which are larger than the available RAM.
The output file is created in a single pass through the input corpus, so that the input can be a once-only stream (generator).
To achieve this, a fake MM header is written first, corpus statistics are collected during the pass (shape of the matrix, number of non-zeroes), followed by a seek back to the beginning of the file, rewriting the fake header with the final values.
fname (str) – Path to output file.
HEADER_LINE
= b'%%MatrixMarket matrix coordinate real general\n'¶close
()¶Close self.fout file.
fake_headers
(num_docs, num_terms, num_nnz)¶Write “fake” headers to file, to be rewritten once we’ve scanned the entire corpus.
num_docs (int) – Number of documents in corpus.
num_terms (int) – Number of term in corpus.
num_nnz (int) – Number of non-zero elements in corpus.
write_corpus
(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False)¶Save the corpus to disk in Matrix Market format.
fname (str) – Filename of the resulting file.
corpus (iterable of list of (int, number)) – Corpus in streamed bag-of-words format.
progress_cnt (int, optional) – Print progress for every progress_cnt number of documents.
index (bool, optional) – Return offsets?
num_terms (int, optional) – Number of terms in the corpus. If provided, the corpus.num_terms attribute (if any) will be ignored.
metadata (bool, optional) – Generate a metadata file?
offsets – List of offsets (if index=True) or nothing.
{list of int, None}
Notes
Documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.
See also
gensim.corpora.mmcorpus.MmCorpus.save_corpus()
Save corpus to disk.
write_headers
(num_docs, num_terms, num_nnz)¶Write headers to file.
num_docs (int) – Number of documents in corpus.
num_terms (int) – Number of term in corpus.
num_nnz (int) – Number of non-zero elements in corpus.
write_vector
(docno, vector)¶Write a single sparse vector to the file.
docno (int) – Number of document.
vector (list of (int, number)) – Document in BoW format.
Max word index in vector and len of vector. If vector is empty, return (-1, 0).
(int, int)
gensim.matutils.
Scipy2Corpus
(vecs)¶Bases: object
Convert a sequence of dense/sparse vectors into a streamed Gensim corpus object.
See also
corpus2csc()
Convert corpus in Gensim format to scipy.sparse.csc matrix.
vecs (iterable of {numpy.ndarray, scipy.sparse}) – Input vectors.
gensim.matutils.
Sparse2Corpus
(sparse, documents_columns=True)¶Bases: object
Convert a matrix in scipy.sparse format into a streaming Gensim corpus.
See also
corpus2csc()
Convert gensim corpus format to scipy.sparse.csc matrix
Dense2Corpus
Convert dense matrix to gensim corpus.
sparse (scipy.sparse) – Corpus scipy sparse format
documents_columns (bool, optional) – Documents will be column?
gensim.matutils.
any2sparse
(vec, eps=1e-09)¶Convert a numpy.ndarray or scipy.sparse vector into the Gensim bag-of-words format.
vec ({numpy.ndarray, scipy.sparse}) – Input vector
eps (float, optional) – Value used for threshold, all coordinates less than eps will not be presented in result.
Vector in BoW format.
list of (int, float)
gensim.matutils.
argsort
(x, topn=None, reverse=False)¶Efficiently calculate indices of the topn smallest elements in array x.
x (array_like) – Array to get the smallest element indices from.
topn (int, optional) – Number of indices of the smallest (greatest) elements to be returned. If not given, indices of all elements will be returned in ascending (descending) order.
reverse (bool, optional) – Return the topn greatest elements in descending order, instead of smallest elements in ascending order?
Array of topn indices that sort the array in the requested order.
numpy.ndarray
gensim.matutils.
blas
(name, ndarray)¶Helper for getting the appropriate BLAS function, using scipy.linalg.get_blas_funcs()
.
name (str) – Name(s) of BLAS functions, without the type prefix.
ndarray (numpy.ndarray) – Arrays can be given to determine optimal prefix of BLAS routines.
BLAS function for the needed operation on the given data type.
object
gensim.matutils.
corpus2csc
(corpus, num_terms=None, dtype=<class 'numpy.float64'>, num_docs=None, num_nnz=None, printprogress=0)¶Convert a streamed corpus in bag-of-words format into a sparse matrix scipy.sparse.csc_matrix, with documents as columns.
Notes
If the number of terms, documents and non-zero elements is known, you can pass them here as parameters and a (much) more memory efficient code path will be taken.
corpus (iterable of iterable of (int, number)) – Input corpus in BoW format
num_terms (int, optional) – Number of terms in corpus. If provided, the corpus.num_terms attribute (if any) will be ignored.
dtype (data-type, optional) – Data type of output CSC matrix.
num_docs (int, optional) – Number of documents in corpus. If provided, the corpus.num_docs attribute (in any) will be ignored.
num_nnz (int, optional) – Number of non-zero elements in corpus. If provided, the corpus.num_nnz attribute (if any) will be ignored.
printprogress (int, optional) – Log a progress message at INFO level once every printprogress documents. 0 to turn off progress logging.
corpus converted into a sparse CSC matrix.
scipy.sparse.csc_matrix
See also
Sparse2Corpus
Convert sparse format to Gensim corpus format.
gensim.matutils.
corpus2dense
(corpus, num_terms, num_docs=None, dtype=<class 'numpy.float32'>)¶Convert corpus into a dense numpy 2D array, with documents as columns.
corpus (iterable of iterable of (int, number)) – Input corpus in the Gensim bag-of-words format.
num_terms (int) – Number of terms in the dictionary. X-axis of the resulting matrix.
num_docs (int, optional) – Number of documents in the corpus. If provided, a slightly more memory-efficient code path is taken. Y-axis of the resulting matrix.
dtype (data-type, optional) – Data type of the output matrix.
Dense 2D array that presents corpus.
numpy.ndarray
See also
Dense2Corpus
Convert dense matrix to Gensim corpus format.
gensim.matutils.
cossim
(vec1, vec2)¶Get cosine similarity between two sparse vectors.
Cosine similarity is a number between <-1.0, 1.0>, higher means more similar.
vec1 (list of (int, float)) – Vector in BoW format.
vec2 (list of (int, float)) – Vector in BoW format.
Cosine similarity between vec1 and vec2.
float
gensim.matutils.
dense2vec
(vec, eps=1e-09)¶Convert a dense numpy array into the Gensim bag-of-words format.
vec (numpy.ndarray) – Dense input vector.
eps (float) – Feature weight threshold value. Features with abs(weight) < eps are considered sparse and won’t be included in the BOW result.
BoW format of vec, with near-zero values omitted (sparse vector).
list of (int, float)
See also
sparse2full()
Convert a document in Gensim bag-of-words format into a dense numpy array.
gensim.matutils.
full2sparse
(vec, eps=1e-09)¶Convert a dense numpy array into the Gensim bag-of-words format.
vec (numpy.ndarray) – Dense input vector.
eps (float) – Feature weight threshold value. Features with abs(weight) < eps are considered sparse and won’t be included in the BOW result.
BoW format of vec, with near-zero values omitted (sparse vector).
list of (int, float)
See also
sparse2full()
Convert a document in Gensim bag-of-words format into a dense numpy array.
gensim.matutils.
full2sparse_clipped
(vec, topn, eps=1e-09)¶Like full2sparse()
, but only return the topn elements of the greatest magnitude (abs).
This is more efficient that sorting a vector and then taking the greatest values, especially where len(vec) >> topn.
vec (numpy.ndarray) – Input dense vector
topn (int) – Number of greatest (abs) elements that will be presented in result.
eps (float) – Threshold value, if coordinate in vec < eps, this will not be presented in result.
Clipped vector in BoW format.
list of (int, float)
See also
full2sparse()
Convert dense array to gensim bag-of-words format.
gensim.matutils.
hellinger
(vec1, vec2)¶Calculate Hellinger distance between two probability distributions.
vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
Hellinger distance between vec1 and vec2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).
float
gensim.matutils.
isbow
(vec)¶Checks if a vector is in the sparse Gensim bag-of-words format.
vec (object) – Object to check.
Is vec in BoW format.
bool
gensim.matutils.
ismatrix
(m)¶Check whether m is a 2D numpy.ndarray or scipy.sparse matrix.
m (object) – Object to check.
Is m a 2D numpy.ndarray or scipy.sparse matrix.
bool
gensim.matutils.
jaccard
(vec1, vec2)¶Calculate Jaccard distance between two vectors.
vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
Jaccard distance between vec1 and vec2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).
float
gensim.matutils.
jaccard_distance
(set1, set2)¶Calculate Jaccard distance between two sets.
set1 (set) – Input set.
set2 (set) – Input set.
Jaccard distance between set1 and set2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).
float
gensim.matutils.
jensen_shannon
(vec1, vec2, num_features=None)¶Calculate Jensen-Shannon distance between two probability distributions using scipy.stats.entropy.
vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
num_features (int, optional) – Number of features in the vectors.
Jensen-Shannon distance between vec1 and vec2.
float
Notes
This is a symmetric and finite “version” of gensim.matutils.kullback_leibler()
.
gensim.matutils.
kullback_leibler
(vec1, vec2, num_features=None)¶Calculate Kullback-Leibler distance between two probability distributions using scipy.stats.entropy.
vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
num_features (int, optional) – Number of features in the vectors.
Kullback-Leibler distance between vec1 and vec2. Value in range [0, +∞) where values closer to 0 mean less distance (higher similarity).
float
gensim.matutils.
pad
(mat, padrow, padcol)¶Add additional rows/columns to mat. The new rows/columns will be initialized with zeros.
mat (numpy.ndarray) – Input 2D matrix
padrow (int) – Number of additional rows
padcol (int) – Number of additional columns
Matrix with needed padding.
numpy.matrixlib.defmatrix.matrix
gensim.matutils.
qr_destroy
(la)¶Get QR decomposition of la[0].
la (list of numpy.ndarray) – Run QR decomposition on the first elements of la. Must not be empty.
Matrices and .
(numpy.ndarray, numpy.ndarray)
Notes
Using this function is less memory intense than calling scipy.linalg.qr(la[0]), because the memory used in la[0] is reclaimed earlier. This makes a difference when decomposing very large arrays, where every memory copy counts.
Warning
Content of la as well as la[0] gets destroyed in the process. Again, for memory-effiency reasons.
gensim.matutils.
ret_log_normalize_vec
(vec, axis=1)¶gensim.matutils.
ret_normalized_vec
(vec, length)¶Normalize a vector in L2 (Euclidean unit norm).
vec (list of (int, number)) – Input vector in BoW format.
length (float) – Length of vector
L2-normalized vector in BoW format.
list of (int, number)
gensim.matutils.
scipy2scipy_clipped
(matrix, topn, eps=1e-09)¶Get the ‘topn’ elements of the greatest magnitude (absolute value) from a scipy.sparse vector or matrix.
matrix (scipy.sparse) – Input vector or matrix (1D or 2D sparse array).
topn (int) – Number of greatest elements, in absolute value, to return.
eps (float) – Ignored.
Clipped matrix.
scipy.sparse.csr.csr_matrix
gensim.matutils.
scipy2sparse
(vec, eps=1e-09)¶Convert a scipy.sparse vector into the Gensim bag-of-words format.
vec (scipy.sparse) – Sparse vector.
eps (float, optional) – Value used for threshold, all coordinates less than eps will not be presented in result.
Vector in Gensim bag-of-words format.
list of (int, float)
gensim.matutils.
softcossim
(vec1, vec2, similarity_matrix)¶Get Soft Cosine Measure between two vectors given a term similarity matrix.
Return Soft Cosine Measure between two sparse vectors given a sparse term similarity matrix
in the scipy.sparse.csc_matrix
format. The similarity is a number between <-1.0, 1.0>,
higher is more similar.
Notes
Soft Cosine Measure was perhaps first defined by Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”.
vec1 (list of (int, float)) – A query vector in the BoW format.
vec2 (list of (int, float)) – A document vector in the BoW format.
similarity_matrix ({scipy.sparse.csc_matrix
, scipy.sparse.csr_matrix
}) – A term similarity matrix. If the matrix is scipy.sparse.csr_matrix
, it is going
to be transposed. If you rely on the fact that there is at most a constant number of
non-zero elements in a single column, it is your responsibility to ensure that the matrix
is symmetric.
The Soft Cosine Measure between vec1 and vec2.
similarity_matrix.dtype
ValueError – When the term similarity matrix is in an unknown format.
See also
gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix()
A term similarity matrix produced from term embeddings.
gensim.similarities.docsim.SoftCosineSimilarity
A class for performing corpus-based similarity queries with Soft Cosine Measure.
gensim.matutils.
sparse2full
(doc, length)¶Convert a document in Gensim bag-of-words format into a dense numpy array.
doc (list of (int, number)) – Document in BoW format.
length (int) – Vector dimensionality. This cannot be inferred from the BoW, and you must supply it explicitly. This is typically the vocabulary size or number of topics, depending on how you created doc.
Dense numpy vector for doc.
numpy.ndarray
See also
full2sparse()
Convert dense array to gensim bag-of-words format.
gensim.matutils.
unitvec
(vec, norm='l2', return_norm=False)¶Scale a vector to unit length.
vec ({numpy.ndarray, scipy.sparse, list of (int, float)}) – Input vector in any format
norm ({'l1', 'l2', 'unique'}, optional) – Metric to normalize in.
return_norm (bool, optional) – Return the length of vector vec, in addition to the normalized vector itself?
numpy.ndarray, scipy.sparse, list of (int, float)} – Normalized vector in same format as vec.
float – Length of vec before normalization, if return_norm is set.
Notes
Zero-vector will be unchanged.
gensim.matutils.
veclen
(vec)¶Calculate L2 (euclidean) length of a vector.
vec (list of (int, number)) – Input vector in sparse bag-of-words format.
Length of vec.
float
gensim.matutils.
zeros_aligned
(shape, dtype, order='C', align=128)¶Get array aligned at align byte boundary in memory.
shape (int or (int, int)) – Shape of array.
dtype (data-type) – Data type of array.
order ({'C', 'F'}, optional) – Whether to store multidimensional data in C- or Fortran-contiguous (row- or column-wise) order in memory.
align (int, optional) – Boundary for alignment in bytes.
Aligned array.
numpy.ndarray