gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• tech trainings & IT consulting

corpora.hashdictionary – Construct word<->id mappings

corpora.hashdictionary – Construct word<->id mappings

This module implements the “hashing trick” – a mapping between words and their integer ids using a fixed, static mapping. The static mapping has a constant memory footprint, regardless of the number of word-types (features) in your corpus, so it’s suitable for processing extremely large corpora.

The ids are computed as hash(word) % id_range, where hash is a user-configurable function (adler32 by default). Using HashDictionary, new words can be represented immediately, without an extra pass through the corpus to collect all the ids first. This is another advantage: HashDictionary can be used with non-repeatable (once-only) streams of documents.

A disadvantage of HashDictionary is that, unlike plain Dictionary, several words may map to the same id, causing hash collisions. The word<->id mapping is no longer a bijection.

class gensim.corpora.hashdictionary.HashDictionary(documents=None, id_range=32000, myhash=<built-in function adler32>, debug=True)

Bases: gensim.utils.SaveLoad, dict

HashDictionary encapsulates the mapping between normalized words and their integer ids.

Unlike Dictionary, building a HashDictionary before using it is not a necessary step. The documents can be computed immediately, from an uninitialized HashDictionary, without seeing the rest of the corpus first.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.

By default, keep track of debug statistics and mappings. If you find yourself running out of memory (or are sure you don’t need the debug info), set debug=False.

add_documents(documents)

Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.

This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.

clear() → None. Remove all items from D.
copy() → a shallow copy of D
doc2bow(document, allow_update=False, return_missing=False)

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string. No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update or self.allow_update is set, then also update dictionary in the process: update overall corpus statistics and document frequencies. For each id appearing in this document, increase its document frequency (self.dfs) by one.

filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

Remove document frequency statistics for tokens that appear in

  1. less than no_below documents (absolute number) or
  2. more than no_above documents (fraction of total corpus size, not absolute number).
  3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

Note: since HashDictionary’s id range is fixed and doesn’t depend on the number of tokens seen, this doesn’t really “remove” anything. It only clears some supplementary statistics, for easier debugging and a smaller RAM footprint.

static from_documents(*args, **kwargs)
static fromkeys(S[, v]) → New dict with keys from S and values equal to v.

v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys()

Return a list of all token ids.

classmethod load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

restricted_hash(token)

Calculate id of the given token. Also keep track of what words were mapped to what ids, for debugging reasons.

save(fname, separately=None, sep_limit=10485760, ignore=frozenset([]))

Save the object to file (also see load).

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

save_as_text(fname)

Save this HashDictionary to a text file, for easier debugging.

The format is: id[TAB]document frequency of this id[TAB]tab-separated set of words in UTF8 that map to this id[NEWLINE].

Note: use save/load to store in binary format instead (pickle).

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
update(E, **F) → None. Update D from dict/iterable E and F.

If E has a .keys() method, does: for k in E: D[k] = E[k] If E lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values
viewitems() → a set-like object providing a view on D's items
viewkeys() → a set-like object providing a view on D's keys
viewvalues() → an object providing a view on D's values