This module contains math helper functions.
Treat dense numpy array as a sparse, streamed gensim corpus.
No data copy is made (changes to the underlying matrix imply changes in the corpus).
This is the mirror function to corpus2dense.
Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).
Note that the file is read into memory one document at a time, not the whole matrix at once (unlike scipy.io.mmread). This allows us to process corpora which are larger than the available RAM.
Initialize the matrix reader.
The input refers to a file on local filesystem, which is expected to be in the sparse (coordinate) Matrix Market format. Documents are assumed to be rows of the matrix (and document features are columns).
input is either a string (file path) or a file-like object that supports seek() (e.g. gzip.GzipFile, bz2.BZ2File).
Return document at file offset offset (in bytes)
Skip file headers that appear before the first document.
Store a corpus in Matrix Market format.
Note that the output is written one document at a time, not the whole matrix at once (unlike scipy.io.mmread). This allows us to process corpora which are larger than the available RAM.
NOTE: the output file is created in a single pass through the input corpus, so that the input can be a once-only stream (iterator). To achieve this, a fake MM header is written first, statistics are collected during the pass (shape of the matrix, number of non-zeroes), followed by a seek back to the beginning of the file, rewriting the fake header with proper values.
Save the vector space representation of an entire corpus to disk.
Note that the documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.
Write a single sparse vector to the file.
Sparse vector is any iterable yielding (field id, field value) pairs.
Convert a sequence of dense/sparse vectors into a streamed gensim corpus object.
This is the mirror function to corpus2csc.
vecs is a sequence of dense and/or sparse vectors, such as a 2d numpy array, or a scipy.sparse.csc_matrix, or any sequence containing a mix of 1d numpy/scipy vectors.
Convert a matrix in scipy.sparse format into a streaming gensim corpus.
This is the mirror function to corpus2csc.
Convert a numpy/scipy vector into gensim document format (=list of 2-tuples).
Return indices of the topn greatest elements in numpy array x, in order.
Convert a streamed corpus into a sparse matrix, in scipy.sparse.csc_matrix format, with documents as columns.
If the number of terms, documents and non-zero elements is known, you can pass them here as parameters and a more memory efficient code path will be taken.
The input corpus may be a non-repeatable stream (generator).
This is the mirror function to Sparse2Corpus.
Convert corpus into a dense numpy array (documents will be columns). You must supply the number of features num_terms, because dimensionality cannot be deduced from the sparse vectors alone.
You can optionally supply num_docs (=the corpus length) as well, so that a more memory-efficient code path is taken.
This is the mirror function to Dense2Corpus.
Return cosine similarity between two sparse vectors. The similarity is a number between <-1.0, 1.0>, higher is more similar.
Convert a dense numpy array into the sparse document format (sequence of 2-tuples).
Values of magnitude < eps are treated as zero (ignored).
This is the mirror function to sparse2full.
Convert a dense numpy array into the sparse document format (sequence of 2-tuples).
Values of magnitude < eps are treated as zero (ignored).
This is the mirror function to sparse2full.
Like full2sparse, but only return the topn elements of the greatest magnitude (abs).
Add additional rows/columns to a numpy.matrix mat. The new rows/columns will be initialized with zeros.
Return QR decomposition of la[0]. Content of la gets destroyed in the process.
Using this function should be less memory intense than calling scipy.linalg.qr(la[0]), because the memory used in la[0] is reclaimed earlier.
Convert a scipy.sparse vector into gensim document format (=list of 2-tuples).
Convert a document in sparse document format (=sequence of 2-tuples) into a dense numpy array (of size length).
This is the mirror function to full2sparse.
Scale a vector to unit length. The only exception is the zero vector, which is returned back unchanged.
Output will be in the same format as input (i.e., gensim vector=>gensim vector, or numpy array=>numpy array, scipy.sparse=>scipy.sparse).
Like numpy.zeros(), but the array will be aligned at align byte boundary.