gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.dictionary – Construct word<->id mappings

corpora.dictionary – Construct word<->id mappings

This module implements the concept of Dictionary – a mapping between words and their integer ids.

class gensim.corpora.dictionary.Dictionary(documents=None, prune_at=2000000)

Bases: gensim.utils.SaveLoad, _abcoll.Mapping

Dictionary encapsulates the mapping between normalized words and their integer ids.

token2id

dict of (str, int) – token -> tokenId.

id2token

dict of (int, str) – Reverse mapping for token2id, initialized in lazy manner to save memory.

dfs

dict of (int, int) – Document frequencies: token_id -> in how many documents contain this token.

num_docs

int – Number of documents processed.

num_pos

int – Total number of corpus positions (number of processed words).

num_nnz

int – Total number of non-zeroes in the BOW matrix.

Parameters:
  • documents (iterable of iterable of str, optional) – Documents that used for initialization.
  • prune_at (int, optional) – Total number of unique words. Dictionary will keep not more than prune_at words.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> texts = [['human', 'interface', 'computer']]
>>> dct = Dictionary(texts)  # fit dictionary
>>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # update dictionary with new documents
>>> dct.doc2bow(["dog", "computer", "non_existent_word"])
[(0, 1), (6, 1)]
add_documents(documents, prune_at=2000000)

Update dictionary from a collection of documents.

Parameters:
  • documents (iterable of iterable of str) – Input corpus. All tokens should be already tokenized and normalized.
  • prune_at (int, optional) – Total number of unique words. Dictionary will keep not more than prune_at words.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = ["máma mele maso".split(), "ema má máma".split()]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.add_documents([["this","is","sparta"],["just","joking"]])
>>> len(dct)
10
compactify()

Assign new word ids to all words, shrinking gaps.

doc2bow(document, allow_update=False, return_missing=False)

Convert document into the bag-of-words (BoW) format = list of (token_id, token_count).

Parameters:
  • document (list of str) – Input document.
  • allow_update (bool, optional) – If True - update dictionary in the process (i.e. add new tokens and update frequencies).
  • return_missing (bool, optional) – Also return missing tokens (that doesn’t contains in current dictionary).
Returns:

  • list of (int, int) – BoW representation of document
  • list of (int, int), dict of (str, int) – If return_missing is True, return BoW representation of document + dictionary with missing tokens and their frequencies.

Examples

>>> from gensim.corpora import Dictionary
>>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
>>> dct.doc2bow(["this","is","máma"])
[(2, 1)]
>>> dct.doc2bow(["this","is","máma"], return_missing=True)
([(2, 1)], {u'this': 1, u'is': 1})
doc2idx(document, unknown_word_index=-1)

Convert document (a list of words) into a list of indexes = list of token_id.

Notes

Replace all unknown words i.e, words not in the dictionary with the index as set via unknown_word_index.

Parameters:
  • document (list of str) – Input document
  • unknown_word_index (int, optional) – Index to use for words not in the dictionary.
Returns:

Indexes in the dictionary for words in the document (preserving the order of words).

Return type:

list of int

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["a", "a", "b"], ["a", "c"]]
>>> dct = Dictionary(corpus)
>>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
[0, 0, 2, -1, 2]
filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)

Filter tokens in dictionary by frequency.

Parameters:
  • no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
  • no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
  • keep_n (int, optional) – Keep only the first keep_n most frequent tokens.
  • keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.

Notes

For tokens that appear in:

  1. Less than no_below documents (absolute number) or
  2. More than no_above documents (fraction of total corpus size, not absolute number).
  3. After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

After the pruning, shrink resulting gaps in word ids. Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
>>> len(dct)
1
filter_n_most_frequent(remove_n)

Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

Parameters:remove_n (int) – Number of the most frequent tokens that will be removed.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_n_most_frequent(2)
>>> len(dct)
3
filter_tokens(bad_ids=None, good_ids=None)

Remove the selected bad_ids tokens from Dictionary. Alternative - keep selected good_ids in Dictionary and remove the rest.

Parameters:
  • bad_ids (iterable of int, optional) – Collection of word ids to be removed.
  • good_ids (collection of int, optional) – Keep selected collection of word ids and remove the rest.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> 'ema' in dct.token2id
True
>>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
>>> 'ema' in dct.token2id
False
>>> len(dct)
4
>>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
>>> len(dct)
1
static from_corpus(id2word=None)

Create Dictionary from an existing corpus.

Parameters:
  • corpus (iterable of iterable of (int, number)) – Corpus in BoW format.
  • id2word (dict of (int, object)) – Mapping id -> word. If None, the mapping id2word[word_id] = str(word_id) will be used.

Notes

This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus. This method will scan the term-document count matrix for all word ids that appear in it, then construct Dictionary which maps each word_id -> id2word[word_id]. id2word is an optional dictionary that maps the word_id to a token. In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.

Returns:Inferred dictionary from corpus.
Return type:Dictionary

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
>>> dct = Dictionary.from_corpus(corpus)
>>> len(dct)
3
static from_documents()

Create Dictionary based on documents

Parameters:documents (iterable of iterable of str) – Input corpus.
Returns:Dictionary filled by documents.
Return type:Dictionary
get(k[, d]) → D[k] if k in D, else d. d defaults to None.
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys()

Get all stored ids.

Returns:List of all token ids.
Return type:list of int
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
static load_from_text()

Load a previously stored Dictionary from a text file. Mirror function to save_as_text().

Parameters:fname (str) – Path to file produced by save_as_text().

See also

save_as_text()

Examples

>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text("testdata")
>>> assert dct.token2id == loaded_dct.token2id
merge_with(other)

Merge another dictionary into this dictionary, mapping same tokens to the same ids and new tokens to new ids.

Notes

The purpose is to merge two corpora created using two different dictionaries: self and other. other can be any id=>word mapping (a dict, a Dictionary object, …).

Get a transformation object which, when accessed as result[doc_from_other_corpus], will convert documents from a corpus built using the other dictionary into a document using the new, merged dictionary.

Warning

This method will change self dictionary.

Parameters:other (Dictionary) – Other dictionary.
Returns:Transformation object.
Return type:gensim.models.VocabTransform

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
>>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1)]
>>> transformer = dct_1.merge_with(dct_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1), (3, 2)]
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

save_as_text(fname, sort_by_word=True)

Save Dictionary to a text file.

Parameters:
  • fname (str) – Path to output file.
  • sort_by_word (bool, optional) – if True - sort by word in lexicographical order.

Notes

Format:

num_docs
id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
....
id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]

Warning

Text format should be use for corpus inspection. Use save() and load() to store in binary format (pickle) for better performance.

See also

load_from_text()

Examples

>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text("testdata")
>>> assert dct.token2id == loaded_dct.token2id
values() → list of D's values