gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

downloader – Downloader API for gensim

downloader – Downloader API for gensim

This module is an API for downloading, getting information and loading datasets/models.

Give information about available models/datasets:

>>> import gensim.downloader as api
>>>
>>> api.info()  # return dict with info about available models/datasets
>>> api.info("text8")  # return dict with info about "text8" dataset

Model example:

>>> import gensim.downloader as api
>>>
>>> model = api.load("glove-twitter-25")  # load glove vectors
>>> model.most_similar("cat")  # show words that similar to word 'cat'

Dataset example:

>>> import gensim.downloader as api
>>> from gensim.models import Word2Vec
>>>
>>> dataset = api.load("text8")  # load dataset as iterable
>>> model = Word2Vec(dataset)  # train w2v model

Also, this API available via CLI:

python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)
gensim.downloader.info(name=None, show_only_latest=True)

Provide the information related to model/dataset.

Parameters:
  • name (str, optional) – Name of model/dataset. If not set - shows all available data.
  • show_only_latest (bool, optional) – If storage contains different versions for one data/model, this flag allow to hide outdated versions. Affects only if name is None.
Returns:

Detailed information about one or all models/datasets. If name is specified, return full information about concrete dataset/model, otherwise, return information about all available datasets/models.

Return type:

dict

Raises:

Exception – If name that has been passed is incorrect.

Examples

>>> import gensim.downloader as api
>>> api.info("text8")  # retrieve information about text8 dataset
{u'checksum': u'68799af40b6bda07dfa47a32612e5364',
 u'description': u'Cleaned small sample from wikipedia',
 u'file_name': u'text8.gz',
 u'parts': 1,
 u'source': u'http://mattmahoney.net/dc/text8.zip'}
>>>
>>> api.info()  # retrieve information about all available datasets and models
gensim.downloader.load(name, return_path=False)

Download (if needed) dataset/model and load it to memory (unless return_path is set).

Parameters:
  • name (str) – Name of the model/dataset.
  • return_path (bool, optional) – If True, return full path to file, otherwise, return loaded model / iterable dataset.
Returns:

  • Model – Requested model, if name is model and return_path == False.
  • Dataset (iterable) – Requested dataset, if name is dataset and return_path == False.
  • str – Path to file with dataset / model, only when return_path == True.

Raises:

Exception – Raised if name is incorrect.

Examples

Model example:

>>> import gensim.downloader as api
>>>
>>> model = api.load("glove-twitter-25")  # load glove vectors
>>> model.most_similar("cat")  # show words that similar to word 'cat'

Dataset example:

>>> import gensim.downloader as api
>>>
>>> wiki = api.load("wiki-en")  # load extracted Wikipedia dump, around 6 Gb
>>> for article in wiki:  # iterate over all wiki script
>>>     ...

Download only example >>> import gensim.downloader as api >>> >>> print(api.load(“wiki-en”, return_path=True)) # output: /home/user/gensim-data/wiki-en/wiki-en.gz