gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

corpora.dictionary – Construct word<->id mappings

corpora.dictionary – Construct word<->id mappings

This module implements the concept of Dictionary – a mapping between words and their integer ids.

Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filter_extremes() method), save/loaded from disk (via Dictionary.save() and Dictionary.load() methods), merged with other dictionary (Dictionary.merge_with()) etc.

class gensim.corpora.dictionary.Dictionary(documents=None, prune_at=2000000)

Bases: gensim.utils.SaveLoad, _abcoll.Mapping

Dictionary encapsulates the mapping between normalized words and their integer ids.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.

If documents are given, use them to initialize Dictionary (see add_documents()).

add_documents(documents, prune_at=2000000)

Update dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized strings (either utf8 or unicode).

This is a convenience wrapper for calling doc2bow on each document with allow_update=True, which also prunes infrequent words, keeping the total number of unique words <= prune_at. This is to save memory on very large inputs. To disable this pruning, set prune_at=None.

>>> print(Dictionary(["máma mele maso".split(), "ema má máma".split()]))
Dictionary(5 unique tokens)
compactify()

Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

doc2bow(document, allow_update=False, return_missing=False)

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)

Filter out tokens that appear in

  1. less than no_below documents (absolute number) or
  2. more than no_above documents (fraction of total corpus size, not absolute number).
  3. if tokens are given in keep_tokens (list of strings), they will be kept regardless of the no_below and no_above settings
  4. after (1), (2) and (3), keep only the first keep_n most frequent tokens (or keep all if None).

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

filter_n_most_frequent(remove_n)

Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

filter_tokens(bad_ids=None, good_ids=None)

Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.

bad_ids and good_ids are collections of word ids to be removed.

static from_corpus(corpus, id2word=None)

Create Dictionary from an existing corpus. This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus.

This will scan the term-document count matrix for all word ids that appear in it, then construct and return Dictionary which maps each word_id -> id2word[word_id].

id2word is an optional dictionary that maps the word_id to a token. In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.

static from_documents(documents)
get(k[, d]) → D[k] if k in D, else d. d defaults to None.
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys()

Return a list of all token ids.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

static load_from_text(fname)

Load a previously stored Dictionary from a text file. Mirror function to save_as_text.

merge_with(other)

Merge another dictionary into this dictionary, mapping same tokens to the same ids and new tokens to new ids. The purpose is to merge two corpora created using two different dictionaries, one from self and one from other.

other can be any id=>word mapping (a dict, a Dictionary object, ...).

Return a transformation object which, when accessed as result[doc_from_other_corpus], will convert documents from a corpus built using the other dictionary into a document using the new, merged dictionary (see gensim.interfaces.TransformationABC).

Example:

>>> dict1 = Dictionary(some_documents)
>>> dict2 = Dictionary(other_documents)  # ids not compatible with dict1!
>>> dict2_to_dict1 = dict1.merge_with(dict2)
>>> # now we can merge corpora from the two incompatible dictionaries into one
>>> merged_corpus = itertools.chain(some_corpus_from_dict1, dict2_to_dict1[some_corpus_from_dict2])
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

save_as_text(fname, sort_by_word=True)

Save this Dictionary to a text file, in format: num_docs id[TAB]word_utf8[TAB]document frequency[NEWLINE]. Sorted by word, or by decreasing word frequency.

Note: text format should be use for corpus inspection. Use save/load to store in binary format (pickle) for improved performance.

values() → list of D's values