corpora.dictionary
– Construct word<->id mappings¶
This module implements the concept of a Dictionary – a mapping between words and their integer ids.
- class gensim.corpora.dictionary.Dictionary(documents=None, prune_at=2000000)¶
Bases:
SaveLoad
,Mapping
Dictionary encapsulates the mapping between normalized words and their integer ids.
Notable instance attributes:
- token2id¶
token -> token_id. I.e. the reverse mapping to self[token_id].
- Type
dict of (str, int)
- cfs¶
Collection frequencies: token_id -> how many instances of this token are contained in the documents.
- Type
dict of (int, int)
- dfs¶
Document frequencies: token_id -> how many documents contain this token.
- Type
dict of (int, int)
- num_docs¶
Number of documents processed.
- Type
int
- num_pos¶
Total number of corpus positions (number of processed words).
- Type
int
- num_nnz¶
Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).
- Type
int
- Parameters
documents (iterable of iterable of str, optional) – Documents to be used to initialize the mapping and collect corpus statistics.
prune_at (int, optional) – Dictionary will try to keep no more than prune_at words in its mapping, to limit its RAM footprint, the correctness is not guaranteed. Use
filter_extremes()
to perform proper filtering.
Examples
>>> from gensim.corpora import Dictionary >>> >>> texts = [['human', 'interface', 'computer']] >>> dct = Dictionary(texts) # initialize a Dictionary >>> dct.add_documents([["cat", "say", "meow"], ["dog"]]) # add more document (extend the vocabulary) >>> dct.doc2bow(["dog", "computer", "non_existent_word"]) [(0, 1), (6, 1)]
- add_documents(documents, prune_at=2000000)¶
Update dictionary from a collection of documents.
- Parameters
documents (iterable of iterable of str) – Input corpus. All tokens should be already tokenized and normalized.
prune_at (int, optional) – Dictionary will try to keep no more than prune_at words in its mapping, to limit its RAM footprint, the correctness is not guaranteed. Use
filter_extremes()
to perform proper filtering.
Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus = ["máma mele maso".split(), "ema má máma".split()] >>> dct = Dictionary(corpus) >>> len(dct) 5 >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]]) >>> len(dct) 10
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- compactify()¶
Assign new word ids to all words, shrinking any gaps.
- doc2bow(document, allow_update=False, return_missing=False)¶
Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.
- Parameters
document (list of str) – Input document.
allow_update (bool, optional) – Update self, by adding new tokens from document and updating internal corpus statistics.
return_missing (bool, optional) – Return missing tokens (tokens present in document but not in self) with frequencies?
- Returns
list of (int, int) – BoW representation of document.
list of (int, int), dict of (str, int) – If return_missing is True, return BoW representation of document + dictionary with missing tokens and their frequencies.
Examples
>>> from gensim.corpora import Dictionary >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()]) >>> dct.doc2bow(["this", "is", "máma"]) [(2, 1)] >>> dct.doc2bow(["this", "is", "máma"], return_missing=True) ([(2, 1)], {u'this': 1, u'is': 1})
- doc2idx(document, unknown_word_index=-1)¶
Convert document (a list of words) into a list of indexes = list of token_id. Replace all unknown words i.e, words not in the dictionary with the index as set via unknown_word_index.
- Parameters
document (list of str) – Input document
unknown_word_index (int, optional) – Index to use for words not in the dictionary.
- Returns
Token ids for tokens in document, in the same order.
- Return type
list of int
Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus = [["a", "a", "b"], ["a", "c"]] >>> dct = Dictionary(corpus) >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"]) [0, 0, 2, -1, 2]
- filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)¶
Filter out tokens in the dictionary by their frequency.
- Parameters
no_below (int, optional) – Keep tokens which are contained in at least no_below documents.
no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).
keep_n (int, optional) – Keep only the first keep_n most frequent tokens.
keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.
Notes
This removes all tokens in the dictionary that are:
Less frequent than no_below documents (absolute number, e.g. 5) or
More frequent than no_above documents (fraction of the total corpus size, e.g. 0.3).
After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if keep_n=None).
After the pruning, resulting gaps in word ids are shrunk. Due to this gap shrinking, the same word may have a different word id before and after the call to this function! See
gensim.models.VocabTransform
and the dedicated FAQ entry on how # noqa to transform a corpus built with a dictionary before pruning.Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> dct = Dictionary(corpus) >>> len(dct) 5 >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1) >>> len(dct) 1
- filter_n_most_frequent(remove_n)¶
Filter out the ‘remove_n’ most frequent tokens that appear in the documents.
- Parameters
remove_n (int) – Number of the most frequent tokens that will be removed.
Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> dct = Dictionary(corpus) >>> len(dct) 5 >>> dct.filter_n_most_frequent(2) >>> len(dct) 3
- filter_tokens(bad_ids=None, good_ids=None)¶
Remove the selected bad_ids tokens from
Dictionary
.Alternatively, keep selected good_ids in
Dictionary
and remove the rest.- Parameters
bad_ids (iterable of int, optional) – Collection of word ids to be removed.
good_ids (collection of int, optional) – Keep selected collection of word ids and remove the rest.
Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> dct = Dictionary(corpus) >>> 'ema' in dct.token2id True >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']]) >>> 'ema' in dct.token2id False >>> len(dct) 4 >>> dct.filter_tokens(good_ids=[dct.token2id['maso']]) >>> len(dct) 1
- static from_corpus(corpus, id2word=None)¶
Create
Dictionary
from an existing corpus.- Parameters
corpus (iterable of iterable of (int, number)) – Corpus in BoW format.
id2word (dict of (int, object)) – Mapping id -> word. If None, the mapping id2word[word_id] = str(word_id) will be used.
Notes
This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus. This method will scan the term-document count matrix for all word ids that appear in it, then construct
Dictionary
which maps each word_id -> id2word[word_id]. id2word is an optional dictionary that maps the word_id to a token. In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.- Returns
Inferred dictionary from corpus.
- Return type
Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []] >>> dct = Dictionary.from_corpus(corpus) >>> len(dct) 3
- static from_documents(documents)¶
Create
Dictionary
from documents.Equivalent to Dictionary(documents=documents).
- Parameters
documents (iterable of iterable of str) – Input corpus.
- Returns
Dictionary initialized from documents.
- Return type
- get(k[, d]) D[k] if k in D, else d. d defaults to None. ¶
- items() a set-like object providing a view on D's items ¶
- iteritems()¶
- iterkeys()¶
Iterate over all tokens.
- itervalues()¶
- keys()¶
Get all stored ids.
- Returns
List of all token ids.
- Return type
list of int
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- static load_from_text(fname)¶
Load a previously stored
Dictionary
from a text file.Mirror function to
save_as_text()
.- Parameters
fname (str) – Path to a file produced by
save_as_text()
.
Examples
>>> from gensim.corpora import Dictionary >>> from gensim.test.utils import get_tmpfile >>> >>> tmp_fname = get_tmpfile("dictionary") >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> >>> dct = Dictionary(corpus) >>> dct.save_as_text(tmp_fname) >>> >>> loaded_dct = Dictionary.load_from_text(tmp_fname) >>> assert dct.token2id == loaded_dct.token2id
- merge_with(other)¶
Merge another dictionary into this dictionary, mapping the same tokens to the same ids and new tokens to new ids.
Notes
The purpose is to merge two corpora created using two different dictionaries: self and other. other can be any id=>word mapping (a dict, a Dictionary object, …).
Return a transformation object which, when accessed as result[doc_from_other_corpus], will convert documents from a corpus built using the other dictionary into a document using the new, merged dictionary.
- Parameters
other ({dict,
Dictionary
}) – Other dictionary.- Returns
Transformation object.
- Return type
Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]] >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2) >>> dct_1.doc2bow(corpus_2[0]) [(0, 1)] >>> transformer = dct_1.merge_with(dct_2) >>> dct_1.doc2bow(corpus_2[0]) [(0, 1), (3, 2)]
- most_common(n: Optional[int] = None) List[Tuple[str, int]] ¶
Return a list of the n most common words and their counts from the most common to the least.
Words with equal counts are ordered in the increasing order of their ids.
- Parameters
n (int or None, optional) – The number of most common words to be returned. If None, all words in the dictionary will be returned. Default is None.
- Returns
most_common – The n most common words and their counts from the most common to the least.
- Return type
list of (str, int)
- patch_with_special_tokens(special_token_dict)¶
Patch token2id and id2token using a dictionary of special tokens.
Usecase: when doing sequence modeling (e.g. named entity recognition), one may want to specify special tokens that behave differently than others. One example is the “unknown” token, and another is the padding token. It is usual to set the padding token to have index 0, and patching the dictionary with {‘<PAD>’: 0} would be one way to specify this.
- Parameters
special_token_dict (dict of (str, int)) – dict containing the special tokens as keys and their wanted indices as values.
Examples
>>> from gensim.corpora import Dictionary >>> >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> dct = Dictionary(corpus) >>> >>> special_tokens = {'pad': 0, 'space': 1} >>> print(dct.token2id) {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4} >>> >>> dct.patch_with_special_tokens(special_tokens) >>> print(dct.token2id) {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- save_as_text(fname, sort_by_word=True)¶
Save
Dictionary
to a text file.- Parameters
fname (str) – Path to output file.
sort_by_word (bool, optional) – Sort words in lexicographical order before writing them out?
Notes
Format:
num_docs id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE] id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE] .... id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
This text format is great for corpus inspection and debugging. As plaintext, it’s also easily portable to other tools and frameworks. For better performance and to store the entire object state, including collected corpus statistics, use
save()
andload()
instead.Examples
>>> from gensim.corpora import Dictionary >>> from gensim.test.utils import get_tmpfile >>> >>> tmp_fname = get_tmpfile("dictionary") >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]] >>> >>> dct = Dictionary(corpus) >>> dct.save_as_text(tmp_fname) >>> >>> loaded_dct = Dictionary.load_from_text(tmp_fname) >>> assert dct.token2id == loaded_dct.token2id
- values() an object providing a view on D's values ¶