downloader
– Downloader API for gensim¶This module is an API for downloading, getting information and loading datasets/models.
See RaRe-Technologies/gensim-data repo for more information about models/datasets/how-to-add-new/etc.
Give information about available models/datasets:
>>> import gensim.downloader as api
>>>
>>> api.info() # return dict with info about available models/datasets
>>> api.info("text8") # return dict with info about "text8" dataset
Model example:
>>> import gensim.downloader as api
>>>
>>> model = api.load("glove-twitter-25") # load glove vectors
>>> model.most_similar("cat") # show words that similar to word 'cat'
Dataset example:
>>> import gensim.downloader as api
>>> from gensim.models import Word2Vec
>>>
>>> dataset = api.load("text8") # load dataset as iterable
>>> model = Word2Vec(dataset) # train w2v model
Also, this API available via CLI:
python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --info name # same as api.info(name_only=True)
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)
You may specify the local subdirectory for saving gensim data using the GENSIM_DATA_DIR environment variable. For example:
$ export GENSIM_DATA_DIR=/tmp/gensim-data $ python -m gensim.downloader –download <dataname>
By default, this subdirectory is ~/gensim-data.
gensim.downloader.
BASE_DIR
= '/home/misha/gensim-data'¶The default location to store downloaded data.
You may override this with the GENSIM_DATA_DIR environment variable.
gensim.downloader.
info
(name=None, show_only_latest=True, name_only=False)¶Provide the information related to model/dataset.
name (str, optional) – Name of model/dataset. If not set - shows all available data.
show_only_latest (bool, optional) – If storage contains different versions for one data/model, this flag allow to hide outdated versions. Affects only if name is None.
name_only (bool, optional) – If True, will return only the names of available models and corpora.
Detailed information about one or all models/datasets. If name is specified, return full information about concrete dataset/model, otherwise, return information about all available datasets/models.
dict
Exception – If name that has been passed is incorrect.
Examples
>>> import gensim.downloader as api
>>> api.info("text8") # retrieve information about text8 dataset
{u'checksum': u'68799af40b6bda07dfa47a32612e5364',
u'description': u'Cleaned small sample from wikipedia',
u'file_name': u'text8.gz',
u'parts': 1,
u'source': u'http://mattmahoney.net/dc/text8.zip'}
>>>
>>> api.info() # retrieve information about all available datasets and models
gensim.downloader.
load
(name, return_path=False)¶Download (if needed) dataset/model and load it to memory (unless return_path is set).
name (str) – Name of the model/dataset.
return_path (bool, optional) – If True, return full path to file, otherwise, return loaded model / iterable dataset.
Model – Requested model, if name is model and return_path == False.
Dataset (iterable) – Requested dataset, if name is dataset and return_path == False.
str – Path to file with dataset / model, only when return_path == True.
Exception – Raised if name is incorrect.
Examples
Model example:
>>> import gensim.downloader as api
>>>
>>> model = api.load("glove-twitter-25") # load glove vectors
>>> model.most_similar("cat") # show words that similar to word 'cat'
Dataset example:
>>> import gensim.downloader as api
>>>
>>> wiki = api.load("wiki-en") # load extracted Wikipedia dump, around 6 Gb
>>> for article in wiki: # iterate over all wiki script
>>> pass
Download only example:
>>> import gensim.downloader as api
>>>
>>> print(api.load("wiki-en", return_path=True)) # output: /home/user/gensim-data/wiki-en/wiki-en.gz