corpora.dictionary – Construct word<->id mappings

This module implements the concept of a Dictionary – a mapping between words and their integer ids.

class gensim.corpora.dictionary.Dictionary(documents=None, prune_at=2000000)

Bases: gensim.utils.SaveLoad, collections.abc.Mapping

Dictionary encapsulates the mapping between normalized words and their integer ids.

Notable instance attributes:

token2id

token -> token_id. I.e. the reverse mapping to self[token_id].

Type

dict of (str, int)

cfs

Collection frequencies: token_id -> how many instances of this token are contained in the documents.

Type

dict of (int, int)

dfs

Document frequencies: token_id -> how many documents contain this token.

Type

dict of (int, int)

num_docs

Number of documents processed.

Type

int

num_pos

Total number of corpus positions (number of processed words).

Type

int

num_nnz

Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).

Type

int

Parameters
  • documents (iterable of iterable of str, optional) – Documents to be used to initialize the mapping and collect corpus statistics.

  • prune_at (int, optional) – Dictionary will try to keep no more than prune_at words in its mapping, to limit its RAM footprint, the correctness is not guaranteed. Use filter_extremes() to perform proper filtering.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> texts = [['human', 'interface', 'computer']]
>>> dct = Dictionary(texts)  # initialize a Dictionary
>>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
>>> dct.doc2bow(["dog", "computer", "non_existent_word"])
[(0, 1), (6, 1)]
add_documents(documents, prune_at=2000000)

Update dictionary from a collection of documents.

Parameters
  • documents (iterable of iterable of str) – Input corpus. All tokens should be already tokenized and normalized.

  • prune_at (int, optional) – Dictionary will try to keep no more than prune_at words in its mapping, to limit its RAM footprint, the correctness is not guaranteed. Use filter_extremes() to perform proper filtering.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = ["máma mele maso".split(), "ema má máma".split()]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
>>> len(dct)
10
add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

compactify()

Assign new word ids to all words, shrinking any gaps.

doc2bow(document, allow_update=False, return_missing=False)

Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.

Parameters
  • document (list of str) – Input document.

  • allow_update (bool, optional) – Update self, by adding new tokens from document and updating internal corpus statistics.

  • return_missing (bool, optional) – Return missing tokens (tokens present in document but not in self) with frequencies?

Returns

  • list of (int, int) – BoW representation of document.

  • list of (int, int), dict of (str, int) – If return_missing is True, return BoW representation of document + dictionary with missing tokens and their frequencies.

Examples

>>> from gensim.corpora import Dictionary
>>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
>>> dct.doc2bow(["this", "is", "máma"])
[(2, 1)]
>>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
([(2, 1)], {u'this': 1, u'is': 1})
doc2idx(document, unknown_word_index=- 1)

Convert document (a list of words) into a list of indexes = list of token_id. Replace all unknown words i.e, words not in the dictionary with the index as set via unknown_word_index.

Parameters
  • document (list of str) – Input document

  • unknown_word_index (int, optional) – Index to use for words not in the dictionary.

Returns

Token ids for tokens in document, in the same order.

Return type

list of int

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["a", "a", "b"], ["a", "c"]]
>>> dct = Dictionary(corpus)
>>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
[0, 0, 2, -1, 2]
filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)

Filter out tokens in the dictionary by their frequency.

Parameters
  • no_below (int, optional) – Keep tokens which are contained in at least no_below documents.

  • no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).

  • keep_n (int, optional) – Keep only the first keep_n most frequent tokens.

  • keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.

Notes

This removes all tokens in the dictionary that are:

  1. Less frequent than no_below documents (absolute number, e.g. 5) or

  2. More frequent than no_above documents (fraction of the total corpus size, e.g. 0.3).

  3. After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if keep_n=None).

After the pruning, resulting gaps in word ids are shrunk. Due to this gap shrinking, the same word may have a different word id before and after the call to this function!

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
>>> len(dct)
1
filter_n_most_frequent(remove_n)

Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

Parameters

remove_n (int) – Number of the most frequent tokens that will be removed.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_n_most_frequent(2)
>>> len(dct)
3
filter_tokens(bad_ids=None, good_ids=None)

Remove the selected bad_ids tokens from Dictionary.

Alternatively, keep selected good_ids in Dictionary and remove the rest.

Parameters
  • bad_ids (iterable of int, optional) – Collection of word ids to be removed.

  • good_ids (collection of int, optional) – Keep selected collection of word ids and remove the rest.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> 'ema' in dct.token2id
True
>>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
>>> 'ema' in dct.token2id
False
>>> len(dct)
4
>>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
>>> len(dct)
1
static from_corpus(corpus, id2word=None)

Create Dictionary from an existing corpus.

Parameters
  • corpus (iterable of iterable of (int, number)) – Corpus in BoW format.

  • id2word (dict of (int, object)) – Mapping id -> word. If None, the mapping id2word[word_id] = str(word_id) will be used.

Notes

This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus. This method will scan the term-document count matrix for all word ids that appear in it, then construct Dictionary which maps each word_id -> id2word[word_id]. id2word is an optional dictionary that maps the word_id to a token. In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.

Returns

Inferred dictionary from corpus.

Return type

Dictionary

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
>>> dct = Dictionary.from_corpus(corpus)
>>> len(dct)
3
static from_documents(documents)

Create Dictionary from documents.

Equivalent to Dictionary(documents=documents).

Parameters

documents (iterable of iterable of str) – Input corpus.

Returns

Dictionary initialized from documents.

Return type

Dictionary

get(k[, d]) D[k] if k in D, else d.  d defaults to None.
items() a set-like object providing a view on D’s items
iteritems()
iterkeys()

Iterate over all tokens.

itervalues()
keys()

Get all stored ids.

Returns

List of all token ids.

Return type

list of int

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

static load_from_text(fname)

Load a previously stored Dictionary from a text file.

Mirror function to save_as_text().

Parameters

fname (str) – Path to a file produced by save_as_text().

See also

save_as_text()

Save Dictionary to text file.

Examples

>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text(tmp_fname)
>>> assert dct.token2id == loaded_dct.token2id
merge_with(other)

Merge another dictionary into this dictionary, mapping the same tokens to the same ids and new tokens to new ids.

Notes

The purpose is to merge two corpora created using two different dictionaries: self and other. other can be any id=>word mapping (a dict, a Dictionary object, …).

Return a transformation object which, when accessed as result[doc_from_other_corpus], will convert documents from a corpus built using the other dictionary into a document using the new, merged dictionary.

Parameters

other ({dict, Dictionary}) – Other dictionary.

Returns

Transformation object.

Return type

gensim.models.VocabTransform

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
>>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1)]
>>> transformer = dct_1.merge_with(dct_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1), (3, 2)]
most_common(n: Optional[int] = None) List[Tuple[str, int]]

Return a list of the n most common words and their counts from the most common to the least.

Words with equal counts are ordered in the increasing order of their ids.

Parameters

n (int or None, optional) – The number of most common words to be returned. If None, all words in the dictionary will be returned. Default is None.

Returns

most_common – The n most common words and their counts from the most common to the least.

Return type

list of (str, int)

patch_with_special_tokens(special_token_dict)

Patch token2id and id2token using a dictionary of special tokens.

Usecase: when doing sequence modeling (e.g. named entity recognition), one may want to specify special tokens that behave differently than others. One example is the “unknown” token, and another is the padding token. It is usual to set the padding token to have index 0, and patching the dictionary with {‘<PAD>’: 0} would be one way to specify this.

Parameters

special_token_dict (dict of (str, int)) – dict containing the special tokens as keys and their wanted indices as values.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>>
>>> special_tokens = {'pad': 0, 'space': 1}
>>> print(dct.token2id)
{'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
>>>
>>> dct.patch_with_special_tokens(special_tokens)
>>> print(dct.token2id)
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

save_as_text(fname, sort_by_word=True)

Save Dictionary to a text file.

Parameters
  • fname (str) – Path to output file.

  • sort_by_word (bool, optional) – Sort words in lexicographical order before writing them out?

Notes

Format:

num_docs
id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
....
id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]

This text format is great for corpus inspection and debugging. As plaintext, it’s also easily portable to other tools and frameworks. For better performance and to store the entire object state, including collected corpus statistics, use save() and load() instead.

See also

load_from_text()

Load Dictionary from text file.

Examples

>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text(tmp_fname)
>>> assert dct.token2id == loaded_dct.token2id
values() an object providing a view on D’s values