gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.dictionary – Construct word<->id mappings

corpora.dictionary – Construct word<->id mappings

This module implements the concept of a Dictionary – a mapping between words and their integer ids.

class gensim.corpora.dictionary.Dictionary(documents=None, prune_at=2000000)

Bases: gensim.utils.SaveLoad, _abcoll.Mapping

Dictionary encapsulates the mapping between normalized words and their integer ids.

Notable instance attributes:

token2id

dict of (str, int) – token -> tokenId.

id2token

dict of (int, str) – Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).

dfs

dict of (int, int) – Document frequencies: token_id -> how many documents contain this token.

num_docs

int – Number of documents processed.

num_pos

int – Total number of corpus positions (number of processed words).

num_nnz

int – Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).

Parameters:
  • documents (iterable of iterable of str, optional) – Documents to be used to initialize the mapping and collect corpus statistics.
  • prune_at (int, optional) – Dictionary will keep no more than prune_at words in its mapping, to limit its RAM footprint.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> texts = [['human', 'interface', 'computer']]
>>> dct = Dictionary(texts)  # initialize a Dictionary
>>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
>>> dct.doc2bow(["dog", "computer", "non_existent_word"])
[(0, 1), (6, 1)]
add_documents(documents, prune_at=2000000)

Update dictionary from a collection of documents.

Parameters:
  • documents (iterable of iterable of str) – Input corpus. All tokens should be already tokenized and normalized.
  • prune_at (int, optional) – Dictionary will keep no more than prune_at words in its mapping, to limit its RAM footprint.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = ["máma mele maso".split(), "ema má máma".split()]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
>>> len(dct)
10
compactify()

Assign new word ids to all words, shrinking any gaps.

doc2bow(document, allow_update=False, return_missing=False)

Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.

Parameters:
  • document (list of str) – Input document.
  • allow_update (bool, optional) – Update self, by adding new tokens from document and updating internal corpus statistics.
  • return_missing (bool, optional) – Return missing tokens (tokens present in document but not in self) with frequencies?
Returns:

  • list of (int, int) – BoW representation of document.
  • list of (int, int), dict of (str, int) – If return_missing is True, return BoW representation of document + dictionary with missing tokens and their frequencies.

Examples

>>> from gensim.corpora import Dictionary
>>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
>>> dct.doc2bow(["this", "is", "máma"])
[(2, 1)]
>>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
([(2, 1)], {u'this': 1, u'is': 1})
doc2idx(document, unknown_word_index=-1)

Convert document (a list of words) into a list of indexes = list of token_id. Replace all unknown words i.e, words not in the dictionary with the index as set via unknown_word_index.

Parameters:
  • document (list of str) – Input document
  • unknown_word_index (int, optional) – Index to use for words not in the dictionary.
Returns:

Token ids for tokens in document, in the same order.

Return type:

list of int

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["a", "a", "b"], ["a", "c"]]
>>> dct = Dictionary(corpus)
>>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
[0, 0, 2, -1, 2]
filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)

Filter out tokens in the dictionary by their frequency.

Parameters:
  • no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
  • no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
  • keep_n (int, optional) – Keep only the first keep_n most frequent tokens.
  • keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.

Notes

This removes all tokens in the dictionary that are:

  1. Less frequent than no_below documents (absolute number, e.g. 5) or
  2. More frequent than no_above documents (fraction of the total corpus size, e.g. 0.3).
  3. After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if keep_n=None).

After the pruning, resulting gaps in word ids are shrunk. Due to this gap shrinking, the same word may have a different word id before and after the call to this function!

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
>>> len(dct)
1
filter_n_most_frequent(remove_n)

Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

Parameters:remove_n (int) – Number of the most frequent tokens that will be removed.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_n_most_frequent(2)
>>> len(dct)
3
filter_tokens(bad_ids=None, good_ids=None)

Remove the selected bad_ids tokens from Dictionary.

Alternatively, keep selected good_ids in Dictionary and remove the rest.

Parameters:
  • bad_ids (iterable of int, optional) – Collection of word ids to be removed.
  • good_ids (collection of int, optional) – Keep selected collection of word ids and remove the rest.

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> 'ema' in dct.token2id
True
>>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
>>> 'ema' in dct.token2id
False
>>> len(dct)
4
>>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
>>> len(dct)
1
static from_corpus(corpus, id2word=None)

Create Dictionary from an existing corpus.

Parameters:
  • corpus (iterable of iterable of (int, number)) – Corpus in BoW format.
  • id2word (dict of (int, object)) – Mapping id -> word. If None, the mapping id2word[word_id] = str(word_id) will be used.

Notes

This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus. This method will scan the term-document count matrix for all word ids that appear in it, then construct Dictionary which maps each word_id -> id2word[word_id]. id2word is an optional dictionary that maps the word_id to a token. In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.

Returns:Inferred dictionary from corpus.
Return type:Dictionary

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
>>> dct = Dictionary.from_corpus(corpus)
>>> len(dct)
3
static from_documents(documents)

Create Dictionary from documents.

Equivalent to Dictionary(documents=documents).

Parameters:documents (iterable of iterable of str) – Input corpus.
Returns:Dictionary initialized from documents.
Return type:Dictionary
get(k[, d]) → D[k] if k in D, else d. d defaults to None.
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys()

Get all stored ids.

Returns:List of all token ids.
Return type:list of int
load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()
Save object to file.
Returns:Object loaded from fname.
Return type:object
Raises:AttributeError – When called on an object instance instead of class (this is a class method).
static load_from_text(fname)

Load a previously stored Dictionary from a text file.

Mirror function to save_as_text().

Parameters:fname (str) – Path to a file produced by save_as_text().

See also

save_as_text()
Save Dictionary to text file.

Examples

>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text(tmp_fname)
>>> assert dct.token2id == loaded_dct.token2id
merge_with(other)

Merge another dictionary into this dictionary, mapping the same tokens to the same ids and new tokens to new ids.

Notes

The purpose is to merge two corpora created using two different dictionaries: self and other. other can be any id=>word mapping (a dict, a Dictionary object, …).

Return a transformation object which, when accessed as result[doc_from_other_corpus], will convert documents from a corpus built using the other dictionary into a document using the new, merged dictionary.

Parameters:other ({dict, Dictionary}) – Other dictionary.
Returns:Transformation object.
Return type:gensim.models.VocabTransform

Examples

>>> from gensim.corpora import Dictionary
>>>
>>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
>>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1)]
>>> transformer = dct_1.merge_with(dct_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1), (3, 2)]
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to a file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()
Load object from file.
save_as_text(fname, sort_by_word=True)

Save Dictionary to a text file.

Parameters:
  • fname (str) – Path to output file.
  • sort_by_word (bool, optional) – Sort words in lexicographical order before writing them out?

Notes

Format:

num_docs
id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
....
id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]

This text format is great for corpus inspection and debugging. As plaintext, it’s also easily portable to other tools and frameworks. For better performance and to store the entire object state, including collected corpus statistics, use save() and load() instead.

See also

load_from_text()
Load Dictionary from text file.

Examples

>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text(tmp_fname)
>>> assert dct.token2id == loaded_dct.token2id
values() → list of D's values