corpora.hashdictionary – Construct word<->id mappings

`corpora.hashdictionary` – Construct word<->id mappings¶

Implements the “hashing trick” – a mapping between words and their integer ids using a fixed, static mapping (hash function).

Notes

The static mapping has a constant memory footprint, regardless of the number of word-types (features) in your corpus, so it’s suitable for processing extremely large corpora. The ids are computed as hash(word) %% id_range, where hash is a user-configurable function (zlib.adler32 by default).

Advantages:

New words can be represented immediately, without an extra pass through the corpus to collect all the ids first.
Can be used with non-repeatable (once-only) streams of documents.
Able to represent any token (not only those present in training documents)

Disadvantages:

Multiple words may map to the same id, causing hash collisions. The word <-> id mapping is no longer a bijection.

class gensim.corpora.hashdictionary.HashDictionary(documents=None, id_range=32000, myhash=<built-in function adler32>, debug=True)¶

Bases: gensim.utils.SaveLoad, dict

Mapping between words and their integer ids, using a hashing function.

Unlike Dictionary, building a HashDictionary before using it isn’t a necessary step.

You can start converting words to ids immediately, without training on a corpus.

Examples

>>> from gensim.corpora import HashDictionary
>>>
>>> dct = HashDictionary(debug=False)  # needs no training corpus!
>>>
>>> texts = [['human', 'interface', 'computer']]
>>> dct.doc2bow(texts[0])
[(10608, 1), (12466, 1), (31002, 1)]

Parameters

documents (iterable of iterable of str, optional) – Iterable of documents. If given, used to collect additional corpus statistics. HashDictionary can work without these statistics (optional parameter).
id_range (int, optional) – Number of hash-values in table, used as id = myhash(key) %% id_range.
myhash (function, optional) – Hash function, should support interface myhash(str) -> int, uses zlib.adler32 by default.
debug (bool, optional) – Store which tokens have mapped to a given id? Will use a lot of RAM. If you find yourself running out of memory (or not sure that you really need raw tokens), keep debug=False.

add_documents(documents)¶

Collect corpus statistics from a corpus.

Warning

Useful only if debug=True, to build the reverse id=>set(words) mapping.

Notes

This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.

Parameters: documents (iterable of list of str) – Collection of documents.

Examples

>>> from gensim.corpora import HashDictionary
>>>
>>> dct = HashDictionary(debug=True)  # needs no training corpus!
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> "sparta" in dct.token2id
False
>>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
>>> "sparta" in dct.token2id
True

clear() → None. Remove all items from D.¶

copy() → a shallow copy of D¶

doc2bow(document, allow_update=False, return_missing=False)¶

Convert a sequence of words document into the bag-of-words format of [(word_id, word_count)] (e.g. [(1, 4), (150, 1), (2005, 2)]).

Notes

Each word is assumed to be a tokenized and normalized string. No further preprocessing is done on the words in document: you have to apply tokenization, stemming etc before calling this method.

If allow_update or self.allow_update is set, then also update the dictionary in the process: update overall corpus statistics and document frequencies. For each id appearing in this document, increase its document frequency (self.dfs) by one.

Parameters

document (sequence of str) – A sequence of word tokens = tokenized and normalized strings.
allow_update (bool, optional) – Update corpus statistics and if debug=True, also the reverse id=>word mapping?
return_missing (bool, optional) – Not used. Only here for compatibility with the Dictionary class.

Returns

Document in Bag-of-words (BoW) format.

Return type

list of (int, int)

Examples

>>> from gensim.corpora import HashDictionary
>>>
>>> dct = HashDictionary()
>>> dct.doc2bow(["this", "is", "máma"])
[(1721, 1), (5280, 1), (22493, 1)]

filter_extremes(no_below=5, no_above=0.5, keep_n=100000)¶

Filter tokens in the debug dictionary by their frequency.

Since HashDictionary id range is fixed and doesn’t depend on the number of tokens seen, this doesn’t really “remove” anything. It only clears some internal corpus statistics, for easier debugging and a smaller RAM footprint.

Warning

Only makes sense when debug=True.

Parameters

no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
keep_n (int, optional) – Keep only the first keep_n most frequent tokens.

Notes

For tokens that appear in:

Less than no_below documents (absolute number) or
More than no_above documents (fraction of total corpus size, not absolute number).
After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

static from_documents(*args, **kwargs)¶

fromkeys()¶: Returns a new dict with keys from iterable and values equal to value.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.¶

items() → a set-like object providing a view on D's items¶

keys()¶: Get a list of all token ids.

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

Get Expert Help From The Gensim Authors

corpora.hashdictionary – Construct word<->id mappings¶

`corpora.hashdictionary` – Construct word<->id mappings¶