gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom system design, development, optimizations

• tech trainings & business consulting

matutils – Math utils

matutils – Math utils

This module contains math helper functions.

class gensim.matutils.Dense2Corpus(dense, documents_columns=True)

Treat dense numpy array as a sparse, streamed gensim corpus.

No data copy is made (changes to the underlying matrix imply changes in the corpus).

class gensim.matutils.MmReader(input, transposed=True)

Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).

Note that the file is read into memory one document at a time, not the whole matrix at once (unlike scipy.io.mmread). This allows us to process corpora which are larger than the available RAM.

Initialize the matrix reader.

The input refers to a file on local filesystem, which is expected to be in the sparse (coordinate) Matrix Market format. Documents are assumed to be rows of the matrix (and document features are columns).

input is either a string (file path) or a file-like object that supports seek() (e.g. gzip.GzipFile, bz2.BZ2File).

docbyoffset(offset)

Return document at file offset offset (in bytes)

skip_headers(input_file)

Skip file headers that appear before the first document.

class gensim.matutils.MmWriter(fname)

Store a corpus in Matrix Market format.

Note that the output is written one document at a time, not the whole matrix at once (unlike scipy.io.mmread). This allows us to process corpora which are larger than the available RAM.

NOTE: the output file is created in a single pass through the input corpus, so that the input can be a once-only stream (iterator). To achieve this, a fake MM header is written first, statistics are collected during the pass (shape of the matrix, number of non-zeroes), followed by a seek back to the beginning of the file, rewriting the fake header with proper values.

static write_corpus(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False)

Save the vector space representation of an entire corpus to disk.

Note that the documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.

write_vector(docno, vector)

Write a single sparse vector to the file.

Sparse vector is any iterable yielding (field id, field value) pairs.

class gensim.matutils.Sparse2Corpus(sparse, documents_columns=True)

Convert a matrix in scipy.sparse format into a streaming gensim corpus.

gensim.matutils.any2sparse(vec, eps=1e-09)

Convert a numpy/scipy vector into gensim format (list of 2-tuples).

gensim.matutils.argsort(x, topn=None)

Return indices of the topn greatest elements in numpy array x, in order.

gensim.matutils.corpus2csc(corpus, num_terms=None, dtype=<type 'numpy.float64'>, num_docs=None, num_nnz=None, printprogress=0)

Convert corpus into a sparse matrix, in scipy.sparse.csc_matrix format, with documents as columns.

If the number of terms, documents and non-zero elements is known, you can pass them here as parameters and a more memory efficient code path will be taken.

gensim.matutils.corpus2dense(corpus, num_terms, num_docs=None, dtype=<type 'numpy.float32'>)

Convert corpus into a dense numpy array (documents will be columns). You must supply the number of features num_terms, because dimensionality cannot be deduced from sparse vectors alone.

You can optionally supply num_docs (=the corpus length) as well, so that a more memory efficient code path is taken.

gensim.matutils.cossim(vec1, vec2)

Return cosine similarity between two sparse vectors. The similarity is a number between <-1.0, 1.0>, higher is more similar.

gensim.matutils.dense2vec(vec, eps=1e-09)

Convert a dense numpy array into the sparse corpus format (sequence of 2-tuples).

Values of magnitude < eps are treated as zero (ignored).

gensim.matutils.full2sparse(vec, eps=1e-09)

Convert a dense numpy array into the sparse corpus format (sequence of 2-tuples).

Values of magnitude < eps are treated as zero (ignored).

gensim.matutils.full2sparse_clipped(vec, topn, eps=1e-09)

Like full2sparse, but only return the topn greatest elements (not all).

gensim.matutils.pad(mat, padrow, padcol)

Add additional rows/columns to a numpy.matrix mat. The new rows/columns will be initialized with zeros.

gensim.matutils.qr_destroy(la)

Return QR decomposition of la[0]. Content of la gets destroyed in the process.

Using this function should be less memory intense than calling scipy.linalg.qr(la[0]), because the memory used in la[0] is reclaimed earlier.

gensim.matutils.scipy2sparse(vec, eps=1e-09)

Convert a scipy.sparse vector to gensim format (list of 2-tuples).

gensim.matutils.sparse2full(doc, length)

Convert a document in sparse corpus format (sequence of 2-tuples) into a dense numpy array (of size length).

gensim.matutils.unitvec(vec)

Scale a vector to unit length. The only exception is the zero vector, which is returned back unchanged.

Output will be in the same format as input (i.e., gensim vector=>gensim vector, or numpy array=>numpy array, scipy.sparse=>scipy.sparse).

gensim.matutils.zeros_aligned(shape, dtype, order='C', align=128)

Like numpy.zeros(), but the array will be aligned at align byte boundary.