gensim logo

gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

matutils – Math utils

matutils – Math utils

This module contains math helper functions.

class gensim.matutils.Dense2Corpus(dense, documents_columns=True)

Bases: object

Treat dense np array as a sparse, streamed gensim corpus.

No data copy is made (changes to the underlying matrix imply changes in the corpus).

This is the mirror function to corpus2dense.

class gensim.matutils.MmReader(input, transposed=True)

Bases: object

Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).

Note that the file is read into memory one document at a time, not the whole matrix at once (unlike This allows us to process corpora which are larger than the available RAM.

Initialize the matrix reader.

The input refers to a file on local filesystem, which is expected to be in the sparse (coordinate) Matrix Market format. Documents are assumed to be rows of the matrix (and document features are columns).

input is either a string (file path) or a file-like object that supports seek() (e.g. gzip.GzipFile, bz2.BZ2File).


Return document at file offset offset (in bytes)


Skip file headers that appear before the first document.

class gensim.matutils.MmWriter(fname)

Bases: object

Store a corpus in Matrix Market format.

Note that the output is written one document at a time, not the whole matrix at once (unlike This allows us to process corpora which are larger than the available RAM.

NOTE: the output file is created in a single pass through the input corpus, so that the input can be a once-only stream (iterator). To achieve this, a fake MM header is written first, statistics are collected during the pass (shape of the matrix, number of non-zeroes), followed by a seek back to the beginning of the file, rewriting the fake header with proper values.

HEADER_LINE = '%%MatrixMarket matrix coordinate real general\n'
fake_headers(num_docs, num_terms, num_nnz)
static write_corpus(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False)

Save the vector space representation of an entire corpus to disk.

Note that the documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.

write_headers(num_docs, num_terms, num_nnz)
write_vector(docno, vector)

Write a single sparse vector to the file.

Sparse vector is any iterable yielding (field id, field value) pairs.

class gensim.matutils.Scipy2Corpus(vecs)

Bases: object

Convert a sequence of dense/sparse vectors into a streamed gensim corpus object.

This is the mirror function to corpus2csc.

vecs is a sequence of dense and/or sparse vectors, such as a 2d np array, or a scipy.sparse.csc_matrix, or any sequence containing a mix of 1d np/scipy vectors.

class gensim.matutils.Sparse2Corpus(sparse, documents_columns=True)

Bases: object

Convert a matrix in scipy.sparse format into a streaming gensim corpus.

This is the mirror function to corpus2csc.

gensim.matutils.any2sparse(vec, eps=1e-09)

Convert a np/scipy vector into gensim document format (=list of 2-tuples).

gensim.matutils.argsort(x, topn=None, reverse=False)

Return indices of the topn smallest elements in array x, in ascending order.

If reverse is True, return the greatest elements instead, in descending order.

gensim.matutils.blas(name, ndarray)
gensim.matutils.corpus2csc(corpus, num_terms=None, dtype=<type 'numpy.float64'>, num_docs=None, num_nnz=None, printprogress=0)

Convert a streamed corpus into a sparse matrix, in scipy.sparse.csc_matrix format, with documents as columns.

If the number of terms, documents and non-zero elements is known, you can pass them here as parameters and a more memory efficient code path will be taken.

The input corpus may be a non-repeatable stream (generator).

This is the mirror function to Sparse2Corpus.

gensim.matutils.corpus2dense(corpus, num_terms, num_docs=None, dtype=<type 'numpy.float32'>)

Convert corpus into a dense np array (documents will be columns). You must supply the number of features num_terms, because dimensionality cannot be deduced from the sparse vectors alone.

You can optionally supply num_docs (=the corpus length) as well, so that a more memory-efficient code path is taken.

This is the mirror function to Dense2Corpus.

gensim.matutils.cossim(vec1, vec2)

Return cosine similarity between two sparse vectors. The similarity is a number between <-1.0, 1.0>, higher is more similar.

gensim.matutils.dense2vec(vec, eps=1e-09)

Convert a dense np array into the sparse document format (sequence of 2-tuples).

Values of magnitude < eps are treated as zero (ignored).

This is the mirror function to sparse2full.


For a vector theta~Dir(alpha), compute E[log(theta)].

gensim.matutils.full2sparse(vec, eps=1e-09)

Convert a dense np array into the sparse document format (sequence of 2-tuples).

Values of magnitude < eps are treated as zero (ignored).

This is the mirror function to sparse2full.

gensim.matutils.full2sparse_clipped(vec, topn, eps=1e-09)

Like full2sparse, but only return the topn elements of the greatest magnitude (abs).

gensim.matutils.hellinger(vec1, vec2)

Hellinger distance is a distance metric to quantify the similarity between two probability distributions. Distance between distributions will be a number between <0,1>, where 0 is minimum distance (maximum similarity) and 1 is maximum distance (minimum similarity).


Checks if vector passed is in bag of words representation or not. Vec is considered to be in bag of words format if it is 2-tuple format.

gensim.matutils.jaccard(vec1, vec2)

A distance metric between bags of words representation. Returns 1 minus the intersection divided by union, where union is the sum of the size of the two bags. If it is not a bag of words representation, the union and intersection is calculated in the traditional manner. Returns a value in range <0,1> where values closer to 0 mean less distance and thus higher similarity.

gensim.matutils.kullback_leibler(vec1, vec2, num_features=None)

A distance metric between two probability distributions. Returns a distance value in range <0,1> where values closer to 0 mean less distance (and a higher similarity) Uses the scipy.stats.entropy method to identify kullback_leibler convergence value. If the distribution draws from a certain number of docs, that value must be passed.

gensim.matutils.pad(mat, padrow, padcol)

Add additional rows/columns to a np.matrix mat. The new rows/columns will be initialized with zeros.


Return QR decomposition of la[0]. Content of la gets destroyed in the process.

Using this function should be less memory intense than calling scipy.linalg.qr(la[0]), because the memory used in la[0] is reclaimed earlier.

gensim.matutils.ret_log_normalize_vec(vec, axis=1)
gensim.matutils.ret_normalized_vec(vec, length)
gensim.matutils.scipy2sparse(vec, eps=1e-09)

Convert a scipy.sparse vector into gensim document format (=list of 2-tuples).

gensim.matutils.sparse2full(doc, length)

Convert a document in sparse document format (=sequence of 2-tuples) into a dense np array (of size length).

This is the mirror function to full2sparse.

gensim.matutils.triu_indices(n, k=0)
gensim.matutils.unitvec(vec, norm='l2')

Scale a vector to unit length. The only exception is the zero vector, which is returned back unchanged.

Output will be in the same format as input (i.e., gensim vector=>gensim vector, or np array=>np array, scipy.sparse=>scipy.sparse).

gensim.matutils.zeros_aligned(shape, dtype, order='C', align=128)

Like np.zeros(), but the array will be aligned at align byte boundary.