gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.normmodel – Normalization model

models.normmodel – Normalization model

class gensim.models.normmodel.NormModel(corpus=None, norm='l2')

Bases: gensim.interfaces.TransformationABC

Objects of this class realize the explicit normalization of vectors. Supported norms are l1’ and ‘l2’ with ‘l2’ being default.

The main methods are:

  1. Constructor which normalizes the terms in the given corpus document-wise.
  2. The normalize() method which normalizes a simple count representation.
  3. The [] transformation which internally calls the self.normalize() method.
>>> norm_l2 = NormModel(corpus)
>>> print(norm_l2[some_doc])
>>> norm_l2.save('/tmp/foo.tfidf_model')

Model persistency is achieved via its load/save methods

Compute the ‘l1’ or ‘l2’ normalization by normalizing separately for each doc in a corpus. Formula for ‘l1’ norm for term ‘i’ in document ‘j’ in a corpus of ‘D’ documents is:

norml1_{i, j} = (i / sum(absolute(values in j)))

Formula for ‘l2’ norm for term ‘i’ in document ‘j’ in a corpus of ‘D’ documents is:

norml2_{i, j} = (i / sqrt(sum(square(values in j))))
calc_norm(corpus)

Calculates the norm by calling matutils.unitvec with the norm parameter.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

normalize(bow)
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.