downloader
– Downloader API for gensim¶
This module is an API for downloading, getting information and loading datasets/models.
See RaRe-Technologies/gensim-data repo for more information about models/datasets/how-to-add-new/etc.
Give information about available models/datasets:
>>> import gensim.downloader as api
>>>
>>> api.info() # return dict with info about available models/datasets
>>> api.info("text8") # return dict with info about "text8" dataset
Model example:
>>> import gensim.downloader as api
>>>
>>> model = api.load("glove-twitter-25") # load glove vectors
>>> model.most_similar("cat") # show words that similar to word 'cat'
Dataset example:
>>> import gensim.downloader as api
>>> from gensim.models import Word2Vec
>>>
>>> dataset = api.load("text8") # load dataset as iterable
>>> model = Word2Vec(dataset) # train w2v model
Also, this API available via CLI:
python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --info name # same as api.info(name_only=True)
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)
You may specify the local subdirectory for saving gensim data using the GENSIM_DATA_DIR environment variable. For example:
$ export GENSIM_DATA_DIR=/tmp/gensim-data $ python -m gensim.downloader –download <dataname>
By default, this subdirectory is ~/gensim-data.
- gensim.downloader.BASE_DIR = '/home/misha/gensim-data'¶
The default location to store downloaded data.
You may override this with the GENSIM_DATA_DIR environment variable.
- gensim.downloader.info(name=None, show_only_latest=True, name_only=False)¶
Provide the information related to model/dataset.
- Parameters
name (str, optional) – Name of model/dataset. If not set - shows all available data.
show_only_latest (bool, optional) – If storage contains different versions for one data/model, this flag allow to hide outdated versions. Affects only if name is None.
name_only (bool, optional) – If True, will return only the names of available models and corpora.
- Returns
Detailed information about one or all models/datasets. If name is specified, return full information about concrete dataset/model, otherwise, return information about all available datasets/models.
- Return type
dict
- Raises
Exception – If name that has been passed is incorrect.
Examples
>>> import gensim.downloader as api >>> api.info("text8") # retrieve information about text8 dataset {u'checksum': u'68799af40b6bda07dfa47a32612e5364', u'description': u'Cleaned small sample from wikipedia', u'file_name': u'text8.gz', u'parts': 1, u'source': u'https://mattmahoney.net/dc/text8.zip'} >>> >>> api.info() # retrieve information about all available datasets and models
- gensim.downloader.load(name, return_path=False)¶
Download (if needed) dataset/model and load it to memory (unless return_path is set).
- Parameters
name (str) – Name of the model/dataset.
return_path (bool, optional) – If True, return full path to file, otherwise, return loaded model / iterable dataset.
- Returns
Model – Requested model, if name is model and return_path == False.
Dataset (iterable) – Requested dataset, if name is dataset and return_path == False.
str – Path to file with dataset / model, only when return_path == True.
- Raises
Exception – Raised if name is incorrect.
Examples
Model example:
>>> import gensim.downloader as api >>> >>> model = api.load("glove-twitter-25") # load glove vectors >>> model.most_similar("cat") # show words that similar to word 'cat'
Dataset example:
>>> import gensim.downloader as api >>> >>> wiki = api.load("wiki-en") # load extracted Wikipedia dump, around 6 Gb >>> for article in wiki: # iterate over all wiki script >>> pass
Download only example:
>>> import gensim.downloader as api >>> >>> print(api.load("wiki-en", return_path=True)) # output: /home/user/gensim-data/wiki-en/wiki-en.gz