gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

matutils – Math utils

matutils – Math utils

Math helper functions.

class gensim.matutils.Dense2Corpus(dense, documents_columns=True)

Bases: object

Treat dense numpy array as a streamed Gensim corpus in the bag-of-words format.

Notes

No data copy is made (changes to the underlying matrix imply changes in the streamed corpus).

See also

corpus2dense()

Convert Gensim corpus to dense matrix.

Sparse2Corpus

Convert sparse matrix to Gensim corpus format.

Parameters
  • dense (numpy.ndarray) – Corpus in dense format.

  • documents_columns (bool, optional) – Documents in dense represented as columns, as opposed to rows?

class gensim.matutils.MmWriter(fname)

Bases: object

Store a corpus in Matrix Market format, using MmCorpus.

Notes

The output is written one document at a time, not the whole matrix at once (unlike e.g. scipy.io.mmread). This allows you to write corpora which are larger than the available RAM.

The output file is created in a single pass through the input corpus, so that the input can be a once-only stream (generator).

To achieve this, a fake MM header is written first, corpus statistics are collected during the pass (shape of the matrix, number of non-zeroes), followed by a seek back to the beginning of the file, rewriting the fake header with the final values.

Parameters

fname (str) – Path to output file.

HEADER_LINE = b'%%MatrixMarket matrix coordinate real general\n'
close()

Close self.fout file.

fake_headers(num_docs, num_terms, num_nnz)

Write “fake” headers to file, to be rewritten once we’ve scanned the entire corpus.

Parameters
  • num_docs (int) – Number of documents in corpus.

  • num_terms (int) – Number of term in corpus.

  • num_nnz (int) – Number of non-zero elements in corpus.

static write_corpus(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False)

Save the corpus to disk in Matrix Market format.

Parameters
  • fname (str) – Filename of the resulting file.

  • corpus (iterable of list of (int, number)) – Corpus in streamed bag-of-words format.

  • progress_cnt (int, optional) – Print progress for every progress_cnt number of documents.

  • index (bool, optional) – Return offsets?

  • num_terms (int, optional) – Number of terms in the corpus. If provided, the corpus.num_terms attribute (if any) will be ignored.

  • metadata (bool, optional) – Generate a metadata file?

Returns

offsets – List of offsets (if index=True) or nothing.

Return type

{list of int, None}

Notes

Documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.

See also

gensim.corpora.mmcorpus.MmCorpus.save_corpus()

Save corpus to disk.

write_headers(num_docs, num_terms, num_nnz)

Write headers to file.

Parameters
  • num_docs (int) – Number of documents in corpus.

  • num_terms (int) – Number of term in corpus.

  • num_nnz (int) – Number of non-zero elements in corpus.

write_vector(docno, vector)

Write a single sparse vector to the file.

Parameters
  • docno (int) – Number of document.

  • vector (list of (int, number)) – Document in BoW format.

Returns

Max word index in vector and len of vector. If vector is empty, return (-1, 0).

Return type

(int, int)

class gensim.matutils.Scipy2Corpus(vecs)

Bases: object

Convert a sequence of dense/sparse vectors into a streamed Gensim corpus object.

See also

corpus2csc()

Convert corpus in Gensim format to scipy.sparse.csc matrix.

Parameters

vecs (iterable of {numpy.ndarray, scipy.sparse}) – Input vectors.

class gensim.matutils.Sparse2Corpus(sparse, documents_columns=True)

Bases: object

Convert a matrix in scipy.sparse format into a streaming Gensim corpus.

See also

corpus2csc()

Convert gensim corpus format to scipy.sparse.csc matrix

Dense2Corpus

Convert dense matrix to gensim corpus.

Parameters
  • sparse (scipy.sparse) – Corpus scipy sparse format

  • documents_columns (bool, optional) – Documents will be column?

gensim.matutils.any2sparse(vec, eps=1e-09)

Convert a numpy.ndarray or scipy.sparse vector into the Gensim bag-of-words format.

Parameters
  • vec ({numpy.ndarray, scipy.sparse}) – Input vector

  • eps (float, optional) – Value used for threshold, all coordinates less than eps will not be presented in result.

Returns

Vector in BoW format.

Return type

list of (int, float)

gensim.matutils.argsort(x, topn=None, reverse=False)

Efficiently calculate indices of the topn smallest elements in array x.

Parameters
  • x (array_like) – Array to get the smallest element indices from.

  • topn (int, optional) – Number of indices of the smallest (greatest) elements to be returned. If not given, indices of all elements will be returned in ascending (descending) order.

  • reverse (bool, optional) – Return the topn greatest elements in descending order, instead of smallest elements in ascending order?

Returns

Array of topn indices that sort the array in the requested order.

Return type

numpy.ndarray

gensim.matutils.blas(name, ndarray)

Helper for getting the appropriate BLAS function, using scipy.linalg.get_blas_funcs().

Parameters
  • name (str) – Name(s) of BLAS functions, without the type prefix.

  • ndarray (numpy.ndarray) – Arrays can be given to determine optimal prefix of BLAS routines.

Returns

BLAS function for the needed operation on the given data type.

Return type

object

gensim.matutils.corpus2csc(corpus, num_terms=None, dtype=<class 'numpy.float64'>, num_docs=None, num_nnz=None, printprogress=0)

Convert a streamed corpus in bag-of-words format into a sparse matrix scipy.sparse.csc_matrix, with documents as columns.

Notes

If the number of terms, documents and non-zero elements is known, you can pass them here as parameters and a (much) more memory efficient code path will be taken.

Parameters
  • corpus (iterable of iterable of (int, number)) – Input corpus in BoW format

  • num_terms (int, optional) – Number of terms in corpus. If provided, the corpus.num_terms attribute (if any) will be ignored.

  • dtype (data-type, optional) – Data type of output CSC matrix.

  • num_docs (int, optional) – Number of documents in corpus. If provided, the corpus.num_docs attribute (in any) will be ignored.

  • num_nnz (int, optional) – Number of non-zero elements in corpus. If provided, the corpus.num_nnz attribute (if any) will be ignored.

  • printprogress (int, optional) – Log a progress message at INFO level once every printprogress documents. 0 to turn off progress logging.

Returns

corpus converted into a sparse CSC matrix.

Return type

scipy.sparse.csc_matrix

See also

Sparse2Corpus

Convert sparse format to Gensim corpus format.

gensim.matutils.corpus2dense(corpus, num_terms, num_docs=None, dtype=<class 'numpy.float32'>)

Convert corpus into a dense numpy 2D array, with documents as columns.

Parameters
  • corpus (iterable of iterable of (int, number)) – Input corpus in the Gensim bag-of-words format.

  • num_terms (int) – Number of terms in the dictionary. X-axis of the resulting matrix.

  • num_docs (int, optional) – Number of documents in the corpus. If provided, a slightly more memory-efficient code path is taken. Y-axis of the resulting matrix.

  • dtype (data-type, optional) – Data type of the output matrix.

Returns

Dense 2D array that presents corpus.

Return type

numpy.ndarray

See also

Dense2Corpus

Convert dense matrix to Gensim corpus format.

gensim.matutils.cossim(vec1, vec2)

Get cosine similarity between two sparse vectors.

Cosine similarity is a number between <-1.0, 1.0>, higher means more similar.

Parameters
  • vec1 (list of (int, float)) – Vector in BoW format.

  • vec2 (list of (int, float)) – Vector in BoW format.

Returns

Cosine similarity between vec1 and vec2.

Return type

float

gensim.matutils.dense2vec(vec, eps=1e-09)

Convert a dense numpy array into the Gensim bag-of-words format.

Parameters
  • vec (numpy.ndarray) – Dense input vector.

  • eps (float) – Feature weight threshold value. Features with abs(weight) < eps are considered sparse and won’t be included in the BOW result.

Returns

BoW format of vec, with near-zero values omitted (sparse vector).

Return type

list of (int, float)

See also

sparse2full()

Convert a document in Gensim bag-of-words format into a dense numpy array.

gensim.matutils.full2sparse(vec, eps=1e-09)

Convert a dense numpy array into the Gensim bag-of-words format.

Parameters
  • vec (numpy.ndarray) – Dense input vector.

  • eps (float) – Feature weight threshold value. Features with abs(weight) < eps are considered sparse and won’t be included in the BOW result.

Returns

BoW format of vec, with near-zero values omitted (sparse vector).

Return type

list of (int, float)

See also

sparse2full()

Convert a document in Gensim bag-of-words format into a dense numpy array.

gensim.matutils.full2sparse_clipped(vec, topn, eps=1e-09)

Like full2sparse(), but only return the topn elements of the greatest magnitude (abs).

This is more efficient that sorting a vector and then taking the greatest values, especially where len(vec) >> topn.

Parameters
  • vec (numpy.ndarray) – Input dense vector

  • topn (int) – Number of greatest (abs) elements that will be presented in result.

  • eps (float) – Threshold value, if coordinate in vec < eps, this will not be presented in result.

Returns

Clipped vector in BoW format.

Return type

list of (int, float)

See also

full2sparse()

Convert dense array to gensim bag-of-words format.

gensim.matutils.hellinger(vec1, vec2)

Calculate Hellinger distance between two probability distributions.

Parameters
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

Returns

Hellinger distance between vec1 and vec2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).

Return type

float

gensim.matutils.isbow(vec)

Checks if a vector is in the sparse Gensim bag-of-words format.

Parameters

vec (object) – Object to check.

Returns

Is vec in BoW format.

Return type

bool

gensim.matutils.ismatrix(m)

Check whether m is a 2D numpy.ndarray or scipy.sparse matrix.

Parameters

m (object) – Object to check.

Returns

Is m a 2D numpy.ndarray or scipy.sparse matrix.

Return type

bool

gensim.matutils.jaccard(vec1, vec2)

Calculate Jaccard distance between two vectors.

Parameters
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

Returns

Jaccard distance between vec1 and vec2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).

Return type

float

gensim.matutils.jaccard_distance(set1, set2)

Calculate Jaccard distance between two sets.

Parameters
  • set1 (set) – Input set.

  • set2 (set) – Input set.

Returns

Jaccard distance between set1 and set2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).

Return type

float

gensim.matutils.jensen_shannon(vec1, vec2, num_features=None)

Calculate Jensen-Shannon distance between two probability distributions using scipy.stats.entropy.

Parameters
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

  • num_features (int, optional) – Number of features in the vectors.

Returns

Jensen-Shannon distance between vec1 and vec2.

Return type

float

Notes

This is a symmetric and finite “version” of gensim.matutils.kullback_leibler().

gensim.matutils.kullback_leibler(vec1, vec2, num_features=None)

Calculate Kullback-Leibler distance between two probability distributions using scipy.stats.entropy.

Parameters
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.

  • num_features (int, optional) – Number of features in the vectors.

Returns

Kullback-Leibler distance between vec1 and vec2. Value in range [0, +∞) where values closer to 0 mean less distance (higher similarity).

Return type

float

gensim.matutils.pad(mat, padrow, padcol)

Add additional rows/columns to mat. The new rows/columns will be initialized with zeros.

Parameters
  • mat (numpy.ndarray) – Input 2D matrix

  • padrow (int) – Number of additional rows

  • padcol (int) – Number of additional columns

Returns

Matrix with needed padding.

Return type

numpy.matrixlib.defmatrix.matrix

gensim.matutils.qr_destroy(la)

Get QR decomposition of la[0].

Parameters

la (list of numpy.ndarray) – Run QR decomposition on the first elements of la. Must not be empty.

Returns

Matrices Q and R.

Return type

(numpy.ndarray, numpy.ndarray)

Notes

Using this function is less memory intense than calling scipy.linalg.qr(la[0]), because the memory used in la[0] is reclaimed earlier. This makes a difference when decomposing very large arrays, where every memory copy counts.

Warning

Content of la as well as la[0] gets destroyed in the process. Again, for memory-effiency reasons.

gensim.matutils.ret_log_normalize_vec(vec, axis=1)
gensim.matutils.ret_normalized_vec(vec, length)

Normalize a vector in L2 (Euclidean unit norm).

Parameters
  • vec (list of (int, number)) – Input vector in BoW format.

  • length (float) – Length of vector

Returns

L2-normalized vector in BoW format.

Return type

list of (int, number)

gensim.matutils.scipy2scipy_clipped(matrix, topn, eps=1e-09)

Get the ‘topn’ elements of the greatest magnitude (absolute value) from a scipy.sparse vector or matrix.

Parameters
  • matrix (scipy.sparse) – Input vector or matrix (1D or 2D sparse array).

  • topn (int) – Number of greatest elements, in absolute value, to return.

  • eps (float) – Ignored.

Returns

Clipped matrix.

Return type

scipy.sparse.csr.csr_matrix

gensim.matutils.scipy2sparse(vec, eps=1e-09)

Convert a scipy.sparse vector into the Gensim bag-of-words format.

Parameters
  • vec (scipy.sparse) – Sparse vector.

  • eps (float, optional) – Value used for threshold, all coordinates less than eps will not be presented in result.

Returns

Vector in Gensim bag-of-words format.

Return type

list of (int, float)

gensim.matutils.softcossim(vec1, vec2, similarity_matrix)

Get Soft Cosine Measure between two vectors given a term similarity matrix.

Return Soft Cosine Measure between two sparse vectors given a sparse term similarity matrix in the scipy.sparse.csc_matrix format. The similarity is a number between <-1.0, 1.0>, higher is more similar.

Notes

Soft Cosine Measure was perhaps first defined by Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”.

Parameters
  • vec1 (list of (int, float)) – A query vector in the BoW format.

  • vec2 (list of (int, float)) – A document vector in the BoW format.

  • similarity_matrix ({scipy.sparse.csc_matrix, scipy.sparse.csr_matrix}) – A term similarity matrix. If the matrix is scipy.sparse.csr_matrix, it is going to be transposed. If you rely on the fact that there is at most a constant number of non-zero elements in a single column, it is your responsibility to ensure that the matrix is symmetric.

Returns

The Soft Cosine Measure between vec1 and vec2.

Return type

similarity_matrix.dtype

Raises

ValueError – When the term similarity matrix is in an unknown format.

See also

gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix()

A term similarity matrix produced from term embeddings.

gensim.similarities.docsim.SoftCosineSimilarity

A class for performing corpus-based similarity queries with Soft Cosine Measure.

gensim.matutils.sparse2full(doc, length)

Convert a document in Gensim bag-of-words format into a dense numpy array.

Parameters
  • doc (list of (int, number)) – Document in BoW format.

  • length (int) – Vector dimensionality. This cannot be inferred from the BoW, and you must supply it explicitly. This is typically the vocabulary size or number of topics, depending on how you created doc.

Returns

Dense numpy vector for doc.

Return type

numpy.ndarray

See also

full2sparse()

Convert dense array to gensim bag-of-words format.

gensim.matutils.unitvec(vec, norm='l2', return_norm=False)

Scale a vector to unit length.

Parameters
  • vec ({numpy.ndarray, scipy.sparse, list of (int, float)}) – Input vector in any format

  • norm ({'l1', 'l2', 'unique'}, optional) – Metric to normalize in.

  • return_norm (bool, optional) – Return the length of vector vec, in addition to the normalized vector itself?

Returns

  • numpy.ndarray, scipy.sparse, list of (int, float)} – Normalized vector in same format as vec.

  • float – Length of vec before normalization, if return_norm is set.

Notes

Zero-vector will be unchanged.

gensim.matutils.veclen(vec)

Calculate L2 (euclidean) length of a vector.

Parameters

vec (list of (int, number)) – Input vector in sparse bag-of-words format.

Returns

Length of vec.

Return type

float

gensim.matutils.zeros_aligned(shape, dtype, order='C', align=128)

Get array aligned at align byte boundary in memory.

Parameters
  • shape (int or (int, int)) – Shape of array.

  • dtype (data-type) – Data type of array.

  • order ({'C', 'F'}, optional) – Whether to store multidimensional data in C- or Fortran-contiguous (row- or column-wise) order in memory.

  • align (int, optional) – Boundary for alignment in bytes.

Returns

Aligned array.

Return type

numpy.ndarray