corpora.hashdictionary
– Construct word<->id mappings¶
Implements the “hashing trick” – a mapping between words and their integer ids using a fixed, static mapping (hash function).
Notes
The static mapping has a constant memory footprint, regardless of the number of word-types (features) in your corpus, so it’s suitable for processing extremely large corpora. The ids are computed as hash(word) %% id_range, where hash is a user-configurable function (zlib.adler32 by default).
Advantages:
New words can be represented immediately, without an extra pass through the corpus to collect all the ids first.
Can be used with non-repeatable (once-only) streams of documents.
Able to represent any token (not only those present in training documents)
Disadvantages:
Multiple words may map to the same id, causing hash collisions. The word <-> id mapping is no longer a bijection.
- class gensim.corpora.hashdictionary.HashDictionary(documents=None, id_range=32000, myhash=<built-in function adler32>, debug=True)¶
Bases:
SaveLoad
,dict
Mapping between words and their integer ids, using a hashing function.
Unlike
Dictionary
, building aHashDictionary
before using it isn’t a necessary step.You can start converting words to ids immediately, without training on a corpus.
Examples
>>> from gensim.corpora import HashDictionary >>> >>> dct = HashDictionary(debug=False) # needs no training corpus! >>> >>> texts = [['human', 'interface', 'computer']] >>> dct.doc2bow(texts[0]) [(10608, 1), (12466, 1), (31002, 1)]
- Parameters
documents (iterable of iterable of str, optional) – Iterable of documents. If given, used to collect additional corpus statistics.
HashDictionary
can work without these statistics (optional parameter).id_range (int, optional) – Number of hash-values in table, used as id = myhash(key) %% id_range.
myhash (function, optional) – Hash function, should support interface myhash(str) -> int, uses zlib.adler32 by default.
debug (bool, optional) – Store which tokens have mapped to a given id? Will use a lot of RAM. If you find yourself running out of memory (or not sure that you really need raw tokens), keep debug=False.
- add_documents(documents)¶
Collect corpus statistics from a corpus.
Warning
Useful only if debug=True, to build the reverse id=>set(words) mapping.
Notes
This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.
- Parameters
documents (iterable of list of str) – Collection of documents.
Examples
>>> from gensim.corpora import HashDictionary >>> >>> dct = HashDictionary(debug=True) # needs no training corpus! >>> >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> "sparta" in dct.token2id False >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]]) >>> "sparta" in dct.token2id True
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- clear() None. Remove all items from D. ¶
- copy() a shallow copy of D ¶
- doc2bow(document, allow_update=False, return_missing=False)¶
Convert a sequence of words document into the bag-of-words format of [(word_id, word_count)] (e.g. [(1, 4), (150, 1), (2005, 2)]).
Notes
Each word is assumed to be a tokenized and normalized string. No further preprocessing is done on the words in document: you have to apply tokenization, stemming etc before calling this method.
If allow_update or self.allow_update is set, then also update the dictionary in the process: update overall corpus statistics and document frequencies. For each id appearing in this document, increase its document frequency (self.dfs) by one.
- Parameters
document (sequence of str) – A sequence of word tokens = tokenized and normalized strings.
allow_update (bool, optional) – Update corpus statistics and if debug=True, also the reverse id=>word mapping?
return_missing (bool, optional) – Not used. Only here for compatibility with the Dictionary class.
- Returns
Document in Bag-of-words (BoW) format.
- Return type
list of (int, int)
Examples
>>> from gensim.corpora import HashDictionary >>> >>> dct = HashDictionary() >>> dct.doc2bow(["this", "is", "máma"]) [(1721, 1), (5280, 1), (22493, 1)]
- filter_extremes(no_below=5, no_above=0.5, keep_n=100000)¶
Filter tokens in the debug dictionary by their frequency.
Since
HashDictionary
id range is fixed and doesn’t depend on the number of tokens seen, this doesn’t really “remove” anything. It only clears some internal corpus statistics, for easier debugging and a smaller RAM footprint.Warning
Only makes sense when debug=True.
- Parameters
no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
keep_n (int, optional) – Keep only the first keep_n most frequent tokens.
Notes
For tokens that appear in:
Less than no_below documents (absolute number) or
More than no_above documents (fraction of total corpus size, not absolute number).
After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).
- static from_documents(*args, **kwargs)¶
- fromkeys(value=None, /)¶
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)¶
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D's items ¶
- keys()¶
Get a list of all token ids.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- pop(k[, d]) v, remove specified key and return the corresponding value. ¶
If the key is not found, return the default if given; otherwise, raise a KeyError.
- popitem()¶
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- restricted_hash(token)¶
Calculate id of the given token. Also keep track of what words were mapped to what ids, if debug=True was set in the constructor.
- Parameters
token (str) – Input token.
- Returns
Hash value of token.
- Return type
int
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- save_as_text(fname)¶
Save the debug token=>id mapping to a text file.
Warning
Only makes sense when debug=True, for debugging.
- Parameters
fname (str) – Path to output file.
Notes
The format is: id[TAB]document frequency of this id[TAB]tab-separated set of words in UTF8 that map to this id[NEWLINE].
Examples
>>> from gensim.corpora import HashDictionary >>> from gensim.test.utils import get_tmpfile >>> >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> data = HashDictionary(corpus) >>> data.save_as_text(get_tmpfile("dictionary_in_text_format"))
- setdefault(key, default=None, /)¶
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- update([E, ]**F) None. Update D from dict/iterable E and F. ¶
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values() an object providing a view on D's values ¶