gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

utils – Various utility functions

utils – Various utility functions

This module contains various general utility functions.

class gensim.utils.ClippedCorpus(corpus, max_docs=None)

Bases: gensim.utils.SaveLoad

Wrap a corpus and return max_doc element from it

Parameters:
  • corpus (iterable of iterable of (int, int)) – Input corpus.
  • max_docs (int) – Maximal number of documents in result corpus.

Warning

Any documents after max_docs are ignored. This effectively limits the length of the returned corpus to <= max_docs. Set max_docs=None for “no limit”, effectively wrapping the entire input corpus.

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.utils.FakeDict(num_terms)

Bases: object

Objects of this class act as dictionaries that map integer->str(integer), for a specified range of integers <0, num_terms).

This is meant to avoid allocating real dictionaries when num_terms is huge, which is a waste of memory.

Parameters:num_terms (int) – Number of terms.
get(val, default=None)
iteritems()

Iterate over all keys and values.

Yields:(int, str) – Pair of (id, token).
keys()

Override the dict.keys(), which is used to determine the maximum internal id of a corpus, i.e. the vocabulary dimensionality.

Returns:Highest id, packed in list.
Return type:list of int

Warning

To avoid materializing the whole range(0, self.num_terms), this returns the highest id = [self.num_terms - 1] only.

class gensim.utils.InputQueue(q, corpus, chunksize, maxsize, as_numpy)

Bases: multiprocessing.process.Process

authkey
daemon

Return whether process is a daemon

exitcode

Return exit code of process or None if it has yet to stop

ident

Return identifier (PID) of process or None if it has yet to start

is_alive()

Return whether process is alive

join(timeout=None)

Wait until child process terminates

name
pid

Return identifier (PID) of process or None if it has yet to start

run()

Method to be run in sub-process; can be overridden in sub-class

start()

Start child process

terminate()

Terminate process; sends SIGTERM signal or uses TerminateProcess()

class gensim.utils.RepeatCorpus(corpus, reps)

Bases: gensim.utils.SaveLoad

Wrap a corpus as another corpus of length reps. This is achieved by repeating documents from corpus over and over again, until the requested length len(result) == reps is reached. Repetition is done on-the-fly=efficiently, via itertools.

Examples

>>> from gensim.utils import RepeatCorpus
>>>
>>> corpus = [[(1, 2)], []] # 2 documents
>>> list(RepeatCorpus(corpus, 5)) # repeat 2.5 times to get 5 documents
[[(1, 2)], [], [(1, 2)], [], [(1, 2)]]
Parameters:
  • corpus (iterable of iterable of (int, int)) – Input corpus.
  • reps (int) – Number of repeats for documents from corpus.
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.utils.RepeatCorpusNTimes(corpus, n)

Bases: gensim.utils.SaveLoad

Wrap a corpus and repeat it n times.

Examples

>>> from gensim.utils import RepeatCorpusNTimes
>>>
>>> corpus = [[(1, 0.5)], []]
>>> list(RepeatCorpusNTimes(corpus, 3)) # repeat 3 times
[[(1, 0.5)], [], [(1, 0.5)], [], [(1, 0.5)], []]
Parameters:
  • corpus (iterable of iterable of (int, int)) – Input corpus.
  • n (int) – Number of repeats for corpus.
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.utils.SaveLoad

Bases: object

Class which inherit from this class have save/load functions, which un/pickle them to disk.

Warning

This uses pickle for de/serializing, so objects must not contain unpicklable attributes, such as lambda functions etc.

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.utils.SlicedCorpus(corpus, slice_)

Bases: gensim.utils.SaveLoad

Wrap corpus and return the slice of it

Parameters:
  • corpus (iterable of iterable of (int, int)) – Input corpus.
  • slice (slice or iterable) – Slice for corpus

Notes

Negative slicing can only be used if the corpus is indexable, otherwise, the corpus will be iterated over. Slice can also be a np.ndarray to support fancy indexing.

Calculating the size of a SlicedCorpus is expensive when using a slice as the corpus has to be iterated over once. Using a list or np.ndarray does not have this drawback, but consumes more memory.

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

gensim.utils.any2unicode(text, encoding='utf8', errors='strict')

Convert text to unicode.

Parameters:
  • text (str) – Input text.
  • errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).
  • encoding (str, optional) – Encoding of text for unicode function (python2 only).
Returns:

Unicode version of text.

Return type:

str

gensim.utils.any2utf8(text, errors='strict', encoding='utf8')

Convert text to bytestring in utf8.

Parameters:
  • text (str) – Input text.
  • errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).
  • encoding (str, optional) – Encoding of text for unicode function (python2 only).
Returns:

Bytestring in utf8.

Return type:

str

gensim.utils.call_on_class_only(*args, **kwargs)

Helper for raise AttributeError if method should be called from instance.

Parameters:
  • *args – Variable length argument list.
  • **kwargs – Arbitrary keyword arguments.
Raises:

AttributeError – If load method are called on instance.

gensim.utils.check_output(stdout=-1, *popenargs, **kwargs)

Run command with arguments and return its output as a byte string. Backported from Python 2.7 as it’s implemented as pure python on stdlib + small modification. Widely used for gensim.models.wrappers.

Very similar with [6]

Examples

>>> from gensim.utils import check_output
>>> check_output(args=['echo', '1'])
'1\n'
Raises:KeyboardInterrupt – If Ctrl+C pressed.

References

[6]https://docs.python.org/2/library/subprocess.html#subprocess.check_output
gensim.utils.chunkize(corpus, chunksize, maxsize=0, as_numpy=False)

Split corpus into smaller chunks, used chunkize_serial().

Parameters:
  • corpus (iterable of object) – Any iterable object.
  • chunksize (int) – Size of chunk from result.
  • maxsize (int, optional) – THIS PARAMETER IGNORED.
  • as_numpy (bool, optional) – If True - yield np.ndarray, otherwise - list

Notes

Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok, chunking is done efficiently via itertools.

If maxsize > 1, don’t wait idly in between successive chunk yields, but rather keep filling a short queue (of size at most maxsize) with forthcoming chunks in advance. This is realized by starting a separate process, and is meant to reduce I/O delays, which can be significant when corpus comes from a slow medium (like HDD).

If maxsize == 0, don’t fool around with parallelism and simply yield the chunksize via chunkize_serial() (no I/O optimizations).

Yields:list of object OR np.ndarray – Groups based on iterable
gensim.utils.chunkize_serial(iterable, chunksize, as_numpy=False)

Give elements from the iterable in chunksize-ed lists. The last returned element may be smaller (if length of collection is not divisible by chunksize).

Parameters:
  • iterable (iterable of object) – Any iterable.
  • chunksize (int) – Size of chunk from result.
  • as_numpy (bool, optional) – If True - yield np.ndarray, otherwise - list
Yields:

list of object OR np.ndarray – Groups based on iterable

Examples

>>> print(list(grouper(range(10), 3)))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

Recursively copy a directory ala shutils.copytree, but hardlink files instead of copying.

Parameters:
  • source (str) – Path to source directory
  • dest (str) – Path to destination directory

Warning

Available on UNIX systems only.

gensim.utils.deaccent(text)

Remove accentuation from the given string.

Parameters:text (str) – Input string.
Returns:Unicode string without accentuation.
Return type:str

Examples

>>> from gensim.utils import deaccent
>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
u'Sef chomutovskych komunistu dostal postou bily prasek'
gensim.utils.decode_htmlentities(text)

Decode HTML entities in text, coded as hex, decimal or named. This function from [3].

Parameters:text (str) – Input html text.

Examples

>>> from gensim.utils import decode_htmlentities
>>>
>>> u = u'E tu vivrai nel terrore - L&#x27;aldil&#xE0; (1981)'
>>> print(decode_htmlentities(u).encode('UTF-8'))
E tu vivrai nel terrore - L'aldilà (1981)
>>> print(decode_htmlentities("l&#39;eau"))
l'eau
>>> print(decode_htmlentities("foo &lt; bar"))
foo < bar

References

[3]http://github.com/sku/python-twitter-ircbot/blob/321d94e0e40d0acc92f5bf57d126b57369da70de/html_decode.py
gensim.utils.deprecated(reason)

Decorator which can be used to mark functions as deprecated.

Parameters:reason (str) – Reason of deprecation.
Returns:Decorated function
Return type:function

Notes

It will result in a warning being emitted when the function is used, base code from [4].

References

[4]https://stackoverflow.com/a/40301488/8001386
gensim.utils.dict_from_corpus(corpus)

Scan corpus for all word ids that appear in it, then construct a mapping which maps each word_id -> str(word_id).

Parameters:corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
Returns:id2word – “Fake” mapping which maps each word_id -> str(word_id).
Return type:FakeDict

Warning

This function is used whenever words need to be displayed (as opposed to just their ids) but no word_id -> word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest word_id found.

gensim.utils.file_or_filename(input)

Open file with smart_open.

Parameters:input (str or file-like) – Filename or file-like object.
Returns:input – Opened file OR seek out to 0 byte if input is already file-like object.
Return type:file-like object
gensim.utils.flatten(nested_list)

Recursively flatten out a nested list.

Parameters:nested_list (list) – Possibly nested list.
Returns:Flattened version of input, where any list elements have been unpacked into the top-level list in a recursive fashion.
Return type:list
gensim.utils.getNS(host=None, port=None, broadcast=True, hmac_key=None)

Get a Pyro4 name server proxy.

Parameters:
  • host (str, optional) – Hostname of ns.
  • port (int, optional) – Port of ns.
  • broadcast (bool, optional) – If True - use broadcast mechanism (i.e. all Pyro nodes in local network), not otherwise.
  • hmac_key (str, optional) – Private key.
Raises:

RuntimeError – when Pyro name server is not found

Returns:

Proxy from Pyro4.

Return type:

Pyro4.core.Proxy

gensim.utils.get_max_id(corpus)

Get the highest feature id that appears in the corpus.

Parameters:corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
Returns:Highest feature id.
Return type:int

Notes

For empty corpus return -1.

gensim.utils.get_my_ip()

Try to obtain our external ip (from the Pyro4 nameserver’s point of view)

Returns:IP address.
Return type:str

Warning

This tries to sidestep the issue of bogus /etc/hosts entries and other local misconfiguration, which often mess up hostname resolution. If all else fails, fall back to simple socket.gethostbyname() lookup.

gensim.utils.get_random_state(seed)

Generate numpy.random.RandomState based on input seed.

Parameters:seed ({None, int, array_like}) – Seed for random state.
Returns:Random state.
Return type:numpy.random.RandomState
Raises:AttributeError – If seed is not {None, int, array_like}.

Notes

Method originally from [1] and written by @joshloyal.

References

[1]https://github.com/maciejkula/glove-python
gensim.utils.grouper(iterable, chunksize, as_numpy=False)

Give elements from the iterable in chunksize-ed lists. The last returned element may be smaller (if length of collection is not divisible by chunksize).

Parameters:
  • iterable (iterable of object) – Any iterable.
  • chunksize (int) – Size of chunk from result.
  • as_numpy (bool, optional) – If True - yield np.ndarray, otherwise - list
Yields:

list of object OR np.ndarray – Groups based on iterable

Examples

>>> print(list(grouper(range(10), 3)))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
gensim.utils.has_pattern()

Check that pattern [5] package already installed.

Returns:True if pattern installed, False otherwise.
Return type:bool

References

[5](1, 2) https://github.com/clips/pattern
gensim.utils.identity(p)

Identity fnc, for flows that don’t accept lambda (pickling etc).

Parameters:p (object) – Input parameter.
Returns:Same as p.
Return type:object
gensim.utils.is_corpus(obj)

Check whether obj is a corpus.

Parameters:obj (object) – Something iterable of iterable that contains (int, int).
Returns:Pair of (is_corpus, obj), is_corpus True if obj is corpus.
Return type:(bool, object)

Warning

An “empty” corpus (empty input sequence) is ambiguous, so in this case the result is forcefully defined as (False, obj).

gensim.utils.iter_windows(texts, window_size, copy=False, ignore_below_size=True, include_doc_num=False)

Produce a generator over the given texts using a sliding window of window_size. The windows produced are views of some subsequence of a text. To use deep copies instead, pass copy=True.

Parameters:
  • texts (list of str) – List of string sentences.
  • window_size (int) – Size of sliding window.
  • copy (bool, optional) – If True - produce deep copies.
  • ignore_below_size (bool, optional) – If True - ignore documents that are not at least window_size in length.
  • include_doc_num (bool, optional) – If True - will be yield doc_num too.
gensim.utils.keep_vocab_item(word, count, min_count, trim_rule=None)

Check that should we keep word in vocab or remove.

Parameters:
  • word (str) – Input word.
  • count (int) – Number of times that word contains in corpus.
  • min_count (int) – Frequency threshold for word.
  • trim_rule (function, optional) – Function for trimming entities from vocab, default behaviour is vocab[w] <= min_reduce.
Returns:

True if word should stay, False otherwise.

Return type:

bool

gensim.utils.lazy_flatten(nested_list)

Lazy version of flatten().

Parameters:nested_list (list) – Possibly nested list.
Yields:object – Element of list
gensim.utils.lemmatize(content, allowed_tags=<_sre.SRE_Pattern object>, light=False, stopwords=frozenset([]), min_length=2, max_length=15)

Use the English lemmatizer from pattern [5] to extract UTF8-encoded tokens in their base form=lemma, e.g. “are, is, being” -> “be” etc. This is a smarter version of stemming, taking word context into account.

Parameters:
  • content (str) – Input string
  • allowed_tags (_sre.SRE_Pattern, optional) – Compiled regexp to select POS that will be used. Only considers nouns, verbs, adjectives and adverbs by default (=all other lemmas are discarded).
  • light (bool, optional) – DEPRECATED FLAG, DOESN’T SUPPORT BY pattern.
  • stopwords (frozenset) – Set of words that will be removed from output.
  • min_length (int) – Minimal token length in output (inclusive).
  • max_length (int) – Maximal token length in output (inclusive).
Returns:

List with tokens with POS tag.

Return type:

list of str

Warning

This function is only available when the optional ‘pattern’ package is installed.

Examples

>>> from gensim.utils import lemmatize
>>> lemmatize('Hello World! How is it going?! Nonexistentword, 21')
['world/NN', 'be/VB', 'go/VB', 'nonexistentword/NN']
>>> lemmatize('The study ranks high.')
['study/NN', 'rank/VB', 'high/JJ']
>>> lemmatize('The ranks study hard.')
['rank/NN', 'study/VB', 'hard/RB']
gensim.utils.mock_data(n_items=1000, dim=1000, prob_nnz=0.5, lam=1.0)

Create a random gensim-style corpus (BoW), used mock_data_row().

Parameters:
  • n_items (int) – Size of corpus
  • dim (int) – Dimension of vector, used for mock_data_row().
  • prob_nnz (float, optional) – Probability of each coordinate will be nonzero, will be drawn from Poisson distribution, used for mock_data_row().
  • lam (float, optional) – Parameter for Poisson distribution, used for mock_data_row().
Returns:

Gensim-style corpus.

Return type:

list of list of (int, float)

gensim.utils.mock_data_row(dim=1000, prob_nnz=0.5, lam=1.0)

Create a random gensim BoW vector.

Parameters:
  • dim (int, optional) – Dimension of vector.
  • prob_nnz (float, optional) – Probability of each coordinate will be nonzero, will be drawn from Poisson distribution.
  • lam (float, optional) – Parameter for Poisson distribution.
Returns:

Vector in BoW format.

Return type:

list of (int, float)

gensim.utils.open_file(*args, **kwds)

Provide “with-like” behaviour except closing the file object.

Parameters:input (str or file-like) – Filename or file-like object.
Yields:file – File-like object based on input (or input if this already file-like).
gensim.utils.pickle(obj, fname, protocol=2)

Pickle object obj to file fname.

Parameters:
  • obj (object) – Any python object.
  • fname (str) – Path to pickle file.
  • protocol (int, optional) – Pickle protocol number, default is 2 to support compatible across python 2.x and 3.x.
gensim.utils.prune_vocab(vocab, min_reduce, trim_rule=None)

Remove all entries from the vocab dictionary with count smaller than min_reduce.

Modifies vocab in place, returns the sum of all counts that were pruned. :param vocab: Input dictionary. :type vocab: dict :param min_reduce: Frequency threshold for tokens in vocab. :type min_reduce: int :param trim_rule: Function for trimming entities from vocab, default behaviour is vocab[w] <= min_reduce. :type trim_rule: function, optional

Returns:result – Sum of all counts that were pruned.
Return type:int
gensim.utils.pyro_daemon(name, obj, random_suffix=False, ip=None, port=None, ns_conf=None)

Register object with name server (starting the name server if not running yet) and block until the daemon is terminated. The object is registered under name, or name`+ some random suffix if `random_suffix is set.

gensim.utils.qsize(queue)

Get the (approximate) queue size where available.

Parameters:queue (queue.Queue) – Input queue.
Returns:Queue size, -1 if qsize method isn’t implemented (OS X).
Return type:int
gensim.utils.randfname(prefix='gensim')

Generate path with random filename/

Parameters:prefix (str) – Prefix of filename.
Returns:Full path with random filename (in temporary folder).
Return type:str
gensim.utils.revdict(d)

Reverse a dictionary mapping, i.e. {1: 2, 3: 4} -> {2: 1, 4: 3}.

Parameters:d (dict) – Input dictionary.
Returns:Reversed dictionary mapping.
Return type:dict

Notes

When two keys map to the same value, only one of them will be kept in the result (which one is kept is arbitrary).

Examples

>>> from gensim.utils import revdict
>>> d = {1: 2, 3: 4}
>>> revdict(d)
{2: 1, 4: 3}
gensim.utils.safe_unichr(intval)
Parameters:intval (int) – Integer code of character
Returns:Unicode string of character
Return type:string
gensim.utils.sample_dict(d, n=10, use_random=True)

Pick n items from dictionary d.

Parameters:
  • d (dict) – Input dictionary.
  • n (int, optional) – Number of items that will be picked.
  • use_random (bool, optional) – If True - pick items randomly, otherwise - according to natural dict iteration.
Returns:

Picked items from dictionary, represented as list.

Return type:

list of (object, object)

gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)

Convert a document into a list of tokens (also with lowercase and optional de-accents), used tokenize().

Parameters:
  • doc (str) – Input document.
  • deacc (bool, optional) – If True - remove accentuation from string by deaccent().
  • min_len (int, optional) – Minimal length of token in result (inclusive).
  • max_len (int, optional) – Maximal length of token in result (inclusive).
Returns:

Tokens extracted from doc.

Return type:

list of str

gensim.utils.simple_tokenize(text)

Tokenize input test using gensim.utils.PAT_ALPHABETIC.

Parameters:text (str) – Input text.
Yields:str – Tokens from text.
gensim.utils.smart_extension(fname, ext)

Generate filename with ext.

Parameters:
  • fname (str) – Path to file.
  • ext (str) – File extension.
Returns:

New path to file with ext.

Return type:

str

gensim.utils.strided_windows(ndarray, window_size)

Produce a numpy.ndarray of windows, as from a sliding window.

Parameters:
  • ndarray (numpy.ndarray) – Input array
  • window_size (int) – Sliding window size.
Returns:

Subsequences produced by sliding a window of the given size over the ndarray. Since this uses striding, the individual arrays are views rather than copies of ndarray. Changes to one view modifies the others and the original.

Return type:

numpy.ndarray

Examples

>>> from gensim.utils import strided_windows
>>> strided_windows(np.arange(5), 2)
array([[0, 1],
       [1, 2],
       [2, 3],
       [3, 4]])
>>> strided_windows(np.arange(10), 5)
array([[0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5],
       [2, 3, 4, 5, 6],
       [3, 4, 5, 6, 7],
       [4, 5, 6, 7, 8],
       [5, 6, 7, 8, 9]])
gensim.utils.synchronous(tlockname)

A decorator to place an instance-based lock around a method.

Notes

Adapted from [2]

References

[2]http://code.activestate.com/recipes/577105-synchronization-decorator-for-class-methods/
gensim.utils.to_unicode(text, encoding='utf8', errors='strict')

Convert text to unicode.

Parameters:
  • text (str) – Input text.
  • errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).
  • encoding (str, optional) – Encoding of text for unicode function (python2 only).
Returns:

Unicode version of text.

Return type:

str

gensim.utils.to_utf8(text, errors='strict', encoding='utf8')

Convert text to bytestring in utf8.

Parameters:
  • text (str) – Input text.
  • errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).
  • encoding (str, optional) – Encoding of text for unicode function (python2 only).
Returns:

Bytestring in utf8.

Return type:

str

gensim.utils.tokenize(text, lowercase=False, deacc=False, encoding='utf8', errors='strict', to_lower=False, lower=False)

Iteratively yield tokens as unicode strings, removing accent marks and optionally lowercasing string if any from lowercase, to_lower, lower set to True.

Parameters:
  • text (str) – Input string.
  • lowercase (bool, optional) – If True - lowercase input string.
  • deacc (bool, optional) – If True - remove accentuation from string by deaccent().
  • encoding (str, optional) – Encoding of input string, used as parameter for to_unicode().
  • errors (str, optional) – Error handling behaviour, used as parameter for to_unicode().
  • to_lower (bool, optional) – Same as lowercase.
  • lower (bool, optional) – Same as lowercase.
Yields:

str – Contiguous sequences of alphabetic characters (no digits!), using simple_tokenize()

Examples

>>> from gensim.utils import tokenize
>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc=True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']
gensim.utils.toptexts(*args, **kwargs)

Debug fnc to help inspect the top n most similar documents (according to a similarity index index), to see if they are actually related to the query.

Parameters:
  • query (list) – vector OR BoW (list of tuples)
  • texts (str) – object that can return something insightful for each document via texts[docid], such as its fulltext or snippet.
  • index (any) – a class from gensim.similarity.docsim
Returns:

a list of 3-tuples (docid, doc’s similarity to the query, texts[docid])

Return type:

list

gensim.utils.unpickle(fname)

Load object from fname.

Parameters:fname (str) – Path to pickle file.
Returns:Python object loaded from fname.
Return type:object
gensim.utils.upload_chunked(*args, **kwargs)

Memory-friendly upload of documents to a SimServer (or Pyro SimServer proxy). .. rubric:: Notes

Use this function to train or index large collections – avoid sending the entire corpus over the wire as a single Pyro in-memory object. The documents will be sent in smaller chunks, of chunksize documents each.