gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.hashdictionary – Construct word<->id mappings

corpora.hashdictionary – Construct word<->id mappings

This module implements the “hashing trick” [1] – a mapping between words and their integer ids using a fixed and static mapping.

Notes

The static mapping has a constant memory footprint, regardless of the number of word-types (features) in your corpus, so it’s suitable for processing extremely large corpora. The ids are computed as hash(word) % id_range, where hash is a user-configurable function (zlib.adler32 by default).

Advantages:

  • New words can be represented immediately, without an extra pass through the corpus to collect all the ids first.
  • Can be used with non-repeatable (once-only) streams of documents.
  • All tokens will be used (not only that you see in documents), typical problem for Dictionary.

Disadvantages:

  • Words may map to the same id, causing hash collisions. The word <-> id mapping is no longer a bijection.

References

[1]http://en.wikipedia.org/wiki/Hashing-Trick
class gensim.corpora.hashdictionary.HashDictionary(documents=None, id_range=32000, myhash=<built-in function adler32>, debug=True)

Bases: gensim.utils.SaveLoad, dict

Encapsulates the mapping between normalized words and their integer ids.

Notes

Unlike Dictionary, building a HashDictionary before using it isn’t a necessary step. The documents can be computed immediately, from an uninitialized HashDictionary without seeing the rest of the corpus first.

Examples

>>> from gensim.corpora import HashDictionary
>>>
>>> texts = [['human', 'interface', 'computer']]
>>> dct = HashDictionary(texts)
>>> dct.doc2bow(texts[0])
[(10608, 1), (12466, 1), (31002, 1)]
Parameters:
  • documents (iterable of iterable of str) – Iterable of documents, if given - use them to initialization.
  • id_range (int, optional) – Number of hash-values in table, used as id = myhash(key) % id_range.
  • myhash (function) – Hash function, should support interface myhash(str) -> int, used zlib.adler32 by default.
  • debug (bool) – If True - store raw tokens mapping (as str <-> id). If you find yourself running out of memory (or not sure that you really need raw tokens), set debug=False.
add_documents(documents)

Build dictionary from a collection of documents.

Notes

This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.

Parameters:documents (iterable of list of str) – Collection of documents.

Examples

>>> from gensim.corpora import HashDictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = HashDictionary(corpus)
>>> "sparta" in dct.token2id
False
>>> dct.add_documents([["this","is","sparta"],["just","joking"]])  # add more documents in dictionary
>>> "sparta" in dct.token2id
True
clear() → None. Remove all items from D.
copy() → a shallow copy of D
doc2bow(document, allow_update=False, return_missing=False)

Convert document into the bag-of-words format, like [(1, 4), (150, 1), (2005, 2)].

Notes

Each word is assumed to be a tokenized and normalized utf-8 encoded string. No further preprocessing is done on the words in document (apply tokenization, stemming etc) before calling this method.

If allow_update or self.allow_update is set, then also update dictionary in the process: update overall corpus statistics and document frequencies. For each id appearing in this document, increase its document frequency (self.dfs) by one.

Parameters:
  • document (list of str) – Is a list of tokens = tokenized and normalized strings (either utf8 or unicode).
  • allow_update (bool, optional) – If True - update dictionary in the process.
  • return_missing (bool, optional) – Show token_count for missing words. HAVE NO SENSE FOR THIS CLASS, BECAUSE WE USING HASHING-TRICK.
Returns:

  • list of (int, int) – Document in Bag-of-words (BoW) format.
  • list of (int, int), dict – If return_missing=True, return document in Bag-of-words (BoW) format + empty dictionary.

Examples

>>> from gensim.corpora import HashDictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = HashDictionary(corpus)
>>> dct.doc2bow(["this","is","máma"])
[(1721, 1), (5280, 1), (22493, 1)]
>>> dct.doc2bow(["this","is","máma"], return_missing=True)
([(1721, 1), (5280, 1), (22493, 1)], {})
filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

Filter tokens in dictionary by frequency.

Parameters:
  • no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
  • no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
  • keep_n (int, optional) – Keep only the first keep_n most frequent tokens.

Notes

For tokens that appear in:

  1. Less than no_below documents (absolute number) or
  2. More than no_above documents (fraction of total corpus size, not absolute number).
  3. After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

Since HashDictionary id range is fixed and doesn’t depend on the number of tokens seen, this doesn’t really “remove” anything. It only clears some supplementary statistics, for easier debugging and a smaller RAM footprint.

Examples

>>> from gensim.corpora import HashDictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = HashDictionary(corpus)
>>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
>>> print dct.token2id
{'maso': 15025}
static from_documents(**kwargs)
fromkeys(S[, v]) → New dict with keys from S and values equal to v.

v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys()

Get a list of all token ids.

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

restricted_hash(token)

Calculate id of the given token. Also keep track of what words were mapped to what ids, for debugging reasons.

Parameters:token (str) – Input token.
Returns:Hash value of token.
Return type:int
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

save_as_text(fname)

Save this HashDictionary to a text file.

Parameters:fname (str) – Path to output file.

Notes

The format is: id[TAB]document frequency of this id[TAB]tab-separated set of words in UTF8 that map to this id[NEWLINE].

Examples

>>> from gensim.corpora import HashDictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> data = HashDictionary(corpus)
>>> data.save_as_text(get_tmpfile("dictionary_in_text_format"))
setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F) → None. Update D from dict/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values
viewitems() → a set-like object providing a view on D's items
viewkeys() → a set-like object providing a view on D's keys
viewvalues() → an object providing a view on D's values