gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

matutils – Math utils

matutils – Math utils

This module contains math helper functions.

class gensim.matutils.Dense2Corpus(dense, documents_columns=True)

Bases: object

Treat dense numpy array as a streamed gensim corpus in BoW format.

Notes

No data copy is made (changes to the underlying matrix imply changes in the corpus).

Parameters:
  • dense (numpy.ndarray) – Corpus in dense format.
  • documents_columns (bool, optional) – If True - documents will be column, rows otherwise.
class gensim.matutils.MmWriter(fname)

Bases: object

Store a corpus in Matrix Market format, used for MmCorpus.

Notes

Output is written one document at a time, not the whole matrix at once (unlike scipy.io.mmread). This allows us to process corpora which are larger than the available RAM.

The output file is created in a single pass through the input corpus, so that the input can be a once-only stream (iterator). To achieve this, a fake MM header is written first, statistics are collected during the pass (shape of the matrix, number of non-zeroes), followed by a seek back to the beginning of the file, rewriting the fake header with proper values.

Parameters:fname (str) – Path to output file.
HEADER_LINE = '%%MatrixMarket matrix coordinate real general\n'
close()

Close self.fout file.

fake_headers(num_docs, num_terms, num_nnz)

Write “fake” headers to file.

Parameters:
  • num_docs (int) – Number of documents in corpus.
  • num_terms (int) – Number of term in corpus.
  • num_nnz (int) – Number of non-zero elements in corpus.
static write_corpus(corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False)

Save the corpus to disk in Matrix Market format.

Parameters:
  • fname (str) – Filename of the resulting file.
  • corpus (iterable of list of (int, number)) – Corpus in Bow format.
  • progress_cnt (int, optional) – Print progress for every progress_cnt number of documents.
  • index (bool, optional) – If True, the offsets will be return, otherwise return None.
  • num_terms (int, optional) – If provided, the num_terms attributes in the corpus will be ignored.
  • metadata (bool, optional) – If True, a metadata file will be generated.
Returns:

offsets – List of offsets (if index=True) or nothing.

Return type:

{list of int, None}

Notes

Documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.

See also

save_corpus()

write_headers(num_docs, num_terms, num_nnz)

Write headers to file.

Parameters:
  • num_docs (int) – Number of documents in corpus.
  • num_terms (int) – Number of term in corpus.
  • num_nnz (int) – Number of non-zero elements in corpus.
write_vector(docno, vector)

Write a single sparse vector to the file.

Parameters:
  • docno (int) – Number of document.
  • vector (list of (int, number)) – Document in BoW format.
Returns:

Max word index in vector and len of vector. If vector is empty, return (-1, 0).

Return type:

(int, int)

class gensim.matutils.Scipy2Corpus(vecs)

Bases: object

Convert a sequence of dense/sparse vectors into a streamed gensim corpus object.

See also

corpus2csc()

Parameters:vecs (iterable of {numpy.ndarray, scipy.sparse}) – Input vectors.
class gensim.matutils.Sparse2Corpus(sparse, documents_columns=True)

Bases: object

Convert a matrix in scipy.sparse format into a streaming gensim corpus.

Parameters:
  • sparse (scipy.sparse) – Corpus scipy sparse format
  • documents_columns (bool, optional) – If True - documents will be column, rows otherwise.
gensim.matutils.any2sparse(vec, eps=1e-09)

Convert a numpy.ndarray or scipy.sparse vector into gensim BoW format.

Parameters:
  • vec ({numpy.ndarray, scipy.sparse}) – Input vector
  • eps (float, optional) – Value used for threshold, all coordinates less than eps will not be presented in result.
Returns:

Vector in BoW format.

Return type:

list of (int, float)

gensim.matutils.argsort(x, topn=None, reverse=False)

Get indices of the topn smallest elements in array x.

Parameters:
  • x (array_like) – Array to sort.
  • topn (int, optional) – Number of indices of the smallest(greatest) elements to be returned if given, otherwise - indices of all elements will be returned in ascending(descending) order.
  • reverse (bool, optional) – If True - return the topn greatest elements, in descending order.
Returns:

Array of topn indices that.sort the array in the required order.

Return type:

numpy.ndarray

gensim.matutils.blas(name, ndarray)

Helper for getting BLAS function, used scipy.linalg.get_blas_funcs().

Parameters:
  • name (str) – Name(s) of BLAS functions without type prefix.
  • ndarray (numpy.ndarray) – Arrays can be given to determine optimal prefix of BLAS routines.
Returns:

Fortran function for needed operation.

Return type:

fortran object

gensim.matutils.convert_vec(vec1, vec2, num_features=None)

Convert vectors to dense format

Parameters:
  • vec1 ({scipy.sparse, list of (int, float)}) – Input vector.
  • vec2 ({scipy.sparse, list of (int, float)}) – Input vector.
  • num_features (int, optional) – Number of features in vector.
Returns:

(vec1, vec2) in dense format.

Return type:

(numpy.ndarray, numpy.ndarray)

gensim.matutils.corpus2csc(corpus, num_terms=None, dtype=<type 'numpy.float64'>, num_docs=None, num_nnz=None, printprogress=0)

Convert a streamed corpus in BoW format into a sparse matrix scipy.sparse.csc_matrix, with documents as columns.

Notes

If the number of terms, documents and non-zero elements is known, you can pass them here as parameters and a more memory efficient code path will be taken.

Parameters:
  • corpus (iterable of iterable of (int, number)) – Input corpus in BoW format
  • num_terms (int, optional) – If provided, the num_terms attributes in the corpus will be ignored.
  • dtype (data-type, optional) – Data type of output matrix.
  • num_docs (int, optional) – If provided, the num_docs attributes in the corpus will be ignored.
  • num_nnz (int, optional) – If provided, the num_nnz attributes in the corpus will be ignored.
  • printprogress (int, optional) – Print progress for every printprogress number of documents, If 0 - nothing will be printed.
Returns:

Sparse matrix inferred based on corpus.

Return type:

scipy.sparse.csc_matrix

See also

Sparse2Corpus

gensim.matutils.corpus2dense(corpus, num_terms, num_docs=None, dtype=<type 'numpy.float32'>)

Convert corpus into a dense numpy array (documents will be columns).

Parameters:
  • corpus (iterable of iterable of (int, number)) – Input corpus in BoW format.
  • num_terms (int) – Number of terms in dictionary (will be used as size of output vector.
  • num_docs (int, optional) – Number of documents in corpus.
  • dtype (data-type, optional) – Data type of output matrix
Returns:

Dense array that present corpus.

Return type:

numpy.ndarray

See also

Dense2Corpus

gensim.matutils.cossim(vec1, vec2)

Get cosine similarity between two sparse vectors. The similarity is a number between <-1.0, 1.0>, higher is more similar.

Parameters:
  • vec1 (list of (int, float)) – Vector in BoW format
  • vec2 (list of (int, float)) – Vector in BoW format
Returns:

Cosine similarity between vec1 and vec2.

Return type:

float

gensim.matutils.dense2vec(vec, eps=1e-09)

Convert a dense array into the BoW format.

Parameters:
  • vec (numpy.ndarray) – Input dense vector
  • eps (float) – Threshold value, if coordinate in vec < eps, this will not be presented in result.
Returns:

BoW format of vec.

Return type:

list of (int, float)

See also

sparse2full()

gensim.matutils.full2sparse(vec, eps=1e-09)

Convert a dense array into the BoW format.

Parameters:
  • vec (numpy.ndarray) – Input dense vector
  • eps (float) – Threshold value, if coordinate in vec < eps, this will not be presented in result.
Returns:

BoW format of vec.

Return type:

list of (int, float)

See also

sparse2full()

gensim.matutils.full2sparse_clipped(vec, topn, eps=1e-09)

Like full2sparse(), but only return the topn elements of the greatest magnitude (abs).

Parameters:
  • vec (numpy.ndarray) – Input dense vector
  • topn (int) – Number of greatest (abs) elements that will be presented in result.
  • eps (float) – Threshold value, if coordinate in vec < eps, this will not be presented in result.
Returns:

Clipped vector in BoW format.

Return type:

list of (int, float)

See also

full2sparse()

gensim.matutils.hellinger(vec1, vec2)

Calculate Hellinger distance between two probability distributions.

Parameters:
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
Returns:

Hellinger distance between vec1 and vec2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).

Return type:

float

gensim.matutils.isbow(vec)

Checks if vector passed is in BoW format.

Parameters:vec (object) – Input vector in any format
Returns:True if vector in BoW format, False otherwise.
Return type:bool
gensim.matutils.ismatrix(m)

Check does m numpy.ndarray or scipy.sparse matrix.

Parameters:m (object) – Candidate for matrix
Returns:True if m is matrix, False otherwise.
Return type:bool
gensim.matutils.jaccard(vec1, vec2)

Calculate Jaccard distance between vectors.

Parameters:
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
Returns:

Jaccard distance between vec1 and vec2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).

Return type:

float

gensim.matutils.jaccard_distance(set1, set2)

Calculate Jaccard distance between two sets

Parameters:
  • set1 (set) – Input set.
  • set2 (set) – Input set.
Returns:

Jaccard distance between set1 and set2. Value in range [0, 1], where 0 is min distance (max similarity) and 1 is max distance (min similarity).

Return type:

float

gensim.matutils.jensen_shannon(vec1, vec2, num_features=None)

Calculate Jensen-Shannon distance between two probability distributions using scipy.stats.entropy.

Parameters:
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
  • num_features (int, optional) – Number of features in vector.
Returns:

Jensen-Shannon distance between vec1 and vec2.

Return type:

float

Notes

This is symmetric and finite “version” of gensim.matutils.kullback_leibler().

gensim.matutils.kullback_leibler(vec1, vec2, num_features=None)

Calculate Kullback-Leibler distance between two probability distributions using scipy.stats.entropy.

Parameters:
  • vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
  • vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
  • num_features (int, optional) – Number of features in vector.
Returns:

Kullback-Leibler distance between vec1 and vec2. Value in range [0, +∞) where values closer to 0 mean less distance (and a higher similarity).

Return type:

float

gensim.matutils.pad(mat, padrow, padcol)

Add additional rows/columns to mat. The new rows/columns will be initialized with zeros.

Parameters:
  • mat (numpy.ndarray) – Input 2D matrix
  • padrow (int) – Number of additional rows
  • padcol (int) – Number of additional columns
Returns:

Matrix with needed padding.

Return type:

numpy.matrixlib.defmatrix.matrix

gensim.matutils.qr_destroy(la)

Get QR decomposition of la[0].

Notes

Using this function should be less memory intense than calling scipy.linalg.qr(la[0]), because the memory used in la[0] is reclaimed earlier.

Returns:Matrices Q and R.
Return type:(numpy.ndarray, numpy.ndarray)

Warning

Content of la gets destroyed in the process.

gensim.matutils.ret_log_normalize_vec(vec, axis=1)
gensim.matutils.ret_normalized_vec(vec, length)

Normalize vector.

Parameters:
  • vec (list of (int, number)) – Input vector in BoW format.
  • length (float) – Length of vector
Returns:

Normalized vector in BoW format.

Return type:

list of (int, number)

gensim.matutils.scipy2scipy_clipped(matrix, topn, eps=1e-09)

Get a scipy.sparse vector / matrix consisting of ‘topn’ elements of the greatest magnitude (absolute value).

Parameters:
  • matrix (scipy.sparse) – Input vector / matrix.
  • topn (int) – Number of greatest (by module) elements, that will be in result.
  • eps (float) – PARAMETER IGNORED.
Returns:

Clipped matrix.

Return type:

scipy.sparse.csr.csr_matrix

gensim.matutils.scipy2sparse(vec, eps=1e-09)

Convert a scipy.sparse vector BoW format.

Parameters:
  • vec (scipy.sparse) – Sparse vector
  • eps (float, optional) – Value used for threshold, all coordinates less than eps will not be presented in result.
Returns:

Vector in BoW format.

Return type:

list of (int, float)

gensim.matutils.softcossim(vec1, vec2, similarity_matrix)

Get Soft Cosine Measure between two vectors given a term similarity matrix.

Return Soft Cosine Measure between two sparse vectors given a sparse term similarity matrix in the scipy.sparse.csc_matrix format. The similarity is a number between <-1.0, 1.0>, higher is more similar.

Parameters:
  • vec1 (list of (int, float)) – A query vector in the BoW format.
  • vec2 (list of (int, float)) – A document vector in the BoW format.
  • similarity_matrix ({scipy.sparse.csc_matrix, scipy.sparse.csr_matrix}) – A term similarity matrix, typically produced by similarity_matrix().
Returns:

The Soft Cosine Measure between vec1 and vec2.

Return type:

similarity_matrix.dtype

Raises:

ValueError – When the term similarity matrix is in an unknown format.

See also

gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix()
A term similarity matrix produced from term embeddings.
gensim.similarities.docsim.SoftCosineSimilarity
A class for performing corpus-based similarity queries with Soft Cosine Measure.

References

Soft Cosine Measure was perhaps first defined by [sidorovetal14].

[sidorovetal14]Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”, 2014, http://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/2043/1921.
gensim.matutils.sparse2full(doc, length)

Convert a document in BoW format into dense numpy array.

Parameters:
  • doc (list of (int, number)) – Document in BoW format
  • length (int) – Length of result vector
Returns:

Dense variant of doc vector.

Return type:

numpy.ndarray

See also

full2sparse()

gensim.matutils.unitvec(vec, norm='l2')

Scale a vector to unit length.

Parameters:
  • vec ({numpy.ndarray, scipy.sparse, list of (int, float)}) – Input vector in any format
  • norm ({'l1', 'l2'}, optional) – Normalization that will be used.
Returns:

Normalized vector in same format as vec.

Return type:

{numpy.ndarray, scipy.sparse, list of (int, float)}

Notes

Zero-vector will be unchanged.

gensim.matutils.veclen(vec)

Calculate length of vector

Parameters:vec (list of (int, number)) – Input vector in BoW format.
Returns:Length of vec.
Return type:float
gensim.matutils.zeros_aligned(shape, dtype, order='C', align=128)

Get array aligned at align byte boundary.

Parameters:
  • shape (int or (int, int)) – Shape of array.
  • dtype (data-type) – Data type of array.
  • order ({'C', 'F'}, optional) – Whether to store multidimensional data in C- or Fortran-contiguous (row- or column-wise) order in memory.
  • align (int, optional) – Boundary for alignment in bytes.
Returns:

Aligned array.

Return type:

numpy.ndarray