corpora.dictionary
– Construct word<->id mappings¶This module implements the concept of a Dictionary – a mapping between words and their integer ids.
gensim.corpora.dictionary.
Dictionary
(documents=None, prune_at=2000000)¶Bases: gensim.utils.SaveLoad
, collections.abc.Mapping
Dictionary encapsulates the mapping between normalized words and their integer ids.
Notable instance attributes:
token2id
¶token -> tokenId.
dict of (str, int)
id2token
¶Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
dict of (int, str)
cfs
¶Collection frequencies: token_id -> how many instances of this token are contained in the documents.
dict of (int, int)
dfs
¶Document frequencies: token_id -> how many documents contain this token.
dict of (int, int)
num_docs
¶Number of documents processed.
int
num_pos
¶Total number of corpus positions (number of processed words).
int
num_nnz
¶Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).
int
documents (iterable of iterable of str, optional) – Documents to be used to initialize the mapping and collect corpus statistics.
prune_at (int, optional) – Dictionary will try to keep no more than prune_at words in its mapping, to limit its RAM
footprint, the correctness is not guaranteed.
Use filter_extremes()
to perform proper filtering.
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> texts = [['human', 'interface', 'computer']]
>>> dct = Dictionary(texts) # initialize a Dictionary
>>> dct.add_documents([["cat", "say", "meow"], ["dog"]]) # add more document (extend the vocabulary)
>>> dct.doc2bow(["dog", "computer", "non_existent_word"])
[(0, 1), (6, 1)]
add_documents
(documents, prune_at=2000000)¶Update dictionary from a collection of documents.
documents (iterable of iterable of str) – Input corpus. All tokens should be already tokenized and normalized.
prune_at (int, optional) – Dictionary will try to keep no more than prune_at words in its mapping, to limit its RAM
footprint, the correctness is not guaranteed.
Use filter_extremes()
to perform proper filtering.
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = ["máma mele maso".split(), "ema má máma".split()]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
>>> len(dct)
10
compactify
()¶Assign new word ids to all words, shrinking any gaps.
doc2bow
(document, allow_update=False, return_missing=False)¶Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.
document (list of str) – Input document.
allow_update (bool, optional) – Update self, by adding new tokens from document and updating internal corpus statistics.
return_missing (bool, optional) – Return missing tokens (tokens present in document but not in self) with frequencies?
list of (int, int) – BoW representation of document.
list of (int, int), dict of (str, int) – If return_missing is True, return BoW representation of document + dictionary with missing tokens and their frequencies.
Examples
>>> from gensim.corpora import Dictionary
>>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
>>> dct.doc2bow(["this", "is", "máma"])
[(2, 1)]
>>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
([(2, 1)], {u'this': 1, u'is': 1})
doc2idx
(document, unknown_word_index=-1)¶Convert document (a list of words) into a list of indexes = list of token_id. Replace all unknown words i.e, words not in the dictionary with the index as set via unknown_word_index.
document (list of str) – Input document
unknown_word_index (int, optional) – Index to use for words not in the dictionary.
Token ids for tokens in document, in the same order.
list of int
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["a", "a", "b"], ["a", "c"]]
>>> dct = Dictionary(corpus)
>>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
[0, 0, 2, -1, 2]
filter_extremes
(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)¶Filter out tokens in the dictionary by their frequency.
no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
keep_n (int, optional) – Keep only the first keep_n most frequent tokens.
keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.
Notes
This removes all tokens in the dictionary that are:
Less frequent than no_below documents (absolute number, e.g. 5) or
More frequent than no_above documents (fraction of the total corpus size, e.g. 0.3).
After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if keep_n=None).
After the pruning, resulting gaps in word ids are shrunk. Due to this gap shrinking, the same word may have a different word id before and after the call to this function!
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
>>> len(dct)
1
filter_n_most_frequent
(remove_n)¶Filter out the ‘remove_n’ most frequent tokens that appear in the documents.
remove_n (int) – Number of the most frequent tokens that will be removed.
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> len(dct)
5
>>> dct.filter_n_most_frequent(2)
>>> len(dct)
3
filter_tokens
(bad_ids=None, good_ids=None)¶Remove the selected bad_ids tokens from Dictionary
.
Alternatively, keep selected good_ids in Dictionary
and remove the rest.
bad_ids (iterable of int, optional) – Collection of word ids to be removed.
good_ids (collection of int, optional) – Keep selected collection of word ids and remove the rest.
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>> 'ema' in dct.token2id
True
>>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
>>> 'ema' in dct.token2id
False
>>> len(dct)
4
>>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
>>> len(dct)
1
from_corpus
(corpus, id2word=None)¶Create Dictionary
from an existing corpus.
corpus (iterable of iterable of (int, number)) – Corpus in BoW format.
id2word (dict of (int, object)) – Mapping id -> word. If None, the mapping id2word[word_id] = str(word_id) will be used.
Notes
This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original
text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
then construct Dictionary
which maps each word_id -> id2word[word_id].
id2word is an optional dictionary that maps the word_id to a token.
In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.
Inferred dictionary from corpus.
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
>>> dct = Dictionary.from_corpus(corpus)
>>> len(dct)
3
from_documents
(documents)¶Create Dictionary
from documents.
Equivalent to Dictionary(documents=documents).
documents (iterable of iterable of str) – Input corpus.
Dictionary initialized from documents.
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶items
() → a set-like object providing a view on D's items¶iteritems
()¶iterkeys
()¶Iterate over all tokens.
itervalues
()¶keys
()¶Get all stored ids.
List of all token ids.
list of int
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
load_from_text
(fname)¶Load a previously stored Dictionary
from a text file.
Mirror function to save_as_text()
.
fname (str) – Path to a file produced by save_as_text()
.
See also
save_as_text()
Save Dictionary
to text file.
Examples
>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text(tmp_fname)
>>> assert dct.token2id == loaded_dct.token2id
merge_with
(other)¶Merge another dictionary into this dictionary, mapping the same tokens to the same ids and new tokens to new ids.
Notes
The purpose is to merge two corpora created using two different dictionaries: self and other. other can be any id=>word mapping (a dict, a Dictionary object, …).
Return a transformation object which, when accessed as result[doc_from_other_corpus], will convert documents from a corpus built using the other dictionary into a document using the new, merged dictionary.
other ({dict, Dictionary
}) – Other dictionary.
Transformation object.
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
>>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1)]
>>> transformer = dct_1.merge_with(dct_2)
>>> dct_1.doc2bow(corpus_2[0])
[(0, 1), (3, 2)]
patch_with_special_tokens
(special_token_dict)¶Patch token2id and id2token using a dictionary of special tokens.
Usecase: when doing sequence modeling (e.g. named entity recognition), one may want to specify special tokens that behave differently than others. One example is the “unknown” token, and another is the padding token. It is usual to set the padding token to have index 0, and patching the dictionary with {‘<PAD>’: 0} would be one way to specify this.
special_token_dict (dict of (str, int)) – dict containing the special tokens as keys and their wanted indices as values.
Examples
>>> from gensim.corpora import Dictionary
>>>
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>> dct = Dictionary(corpus)
>>>
>>> special_tokens = {'pad': 0, 'space': 1}
>>> print(dct.token2id)
{'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
>>>
>>> dct.patch_with_special_tokens(special_tokens)
>>> print(dct.token2id)
{'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
save_as_text
(fname, sort_by_word=True)¶Save Dictionary
to a text file.
fname (str) – Path to output file.
sort_by_word (bool, optional) – Sort words in lexicographical order before writing them out?
Notes
Format:
num_docs
id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
....
id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
This text format is great for corpus inspection and debugging. As plaintext, it’s also easily portable
to other tools and frameworks. For better performance and to store the entire object state,
including collected corpus statistics, use save()
and
load()
instead.
See also
load_from_text()
Load Dictionary
from text file.
Examples
>>> from gensim.corpora import Dictionary
>>> from gensim.test.utils import get_tmpfile
>>>
>>> tmp_fname = get_tmpfile("dictionary")
>>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
>>>
>>> dct = Dictionary(corpus)
>>> dct.save_as_text(tmp_fname)
>>>
>>> loaded_dct = Dictionary.load_from_text(tmp_fname)
>>> assert dct.token2id == loaded_dct.token2id
values
() → an object providing a view on D's values¶