gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.fasttext – FastText model

models.fasttext – FastText model

class gensim.models.fasttext.FastText(sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, loss='ns', sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000)

Bases: gensim.models.word2vec.Word2Vec

accuracy(questions, restrict_vocab=30000, most_similar=None, case_insensitive=True)
build_vocab(sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False)
build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)

Build vocabulary from a dictionary of word frequencies. Build model vocabulary from a passed dictionary that contains (word,word count). Words must be of type unicode strings.

Parameters:
  • word_freq (dict) – Word,Word_Count dictionary.
  • keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
  • corpus_count (int) – Even if no corpus is provided, this argument can set corpus_count explicitly.
  • = vocabulary trimming rule, specifies whether certain words should remain (trim_rule) –
  • the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count) (in) –
  • be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and (Can) –
  • either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. (returns) –
  • update (bool) – If true, the new provided words in word_freq dict will be added to model’s vocab.
Returns:

Return type:

None

Examples

>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
clear_sims()

Removes all L2-normalized vectors for words from the model. You will have to recompute them using init_sims method.

create_binary_tree()

Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().

delete_temporary_training_data(replace_word_vectors_with_normalized=False)

Discard parameters that are used in training and score. Use if you’re sure you’re done training a model. If replace_word_vectors_with_normalized is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

doesnt_match(words)

Deprecated. Use self.wv.doesnt_match() instead. Refer to the documentation for gensim.models.KeyedVectors.doesnt_match

estimate_memory(vocab_size=None, report=None)

Estimate required memory for a model using current settings and provided vocabulary size.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Deprecated. Use self.wv.evaluate_word_pairs() instead. Refer to the documentation for gensim.models.KeyedVectors.evaluate_word_pairs

finalize_vocab(update=False)

Build tables and model weights based on final vocabulary settings.

get_latest_training_loss()
get_vocab_word_vecs()
init_ngrams(update=False)
init_sims(replace=False)

init_sims() resides in KeyedVectors because it deals with syn0 mainly, but because syn1 is not an attribute of KeyedVectors, it has to be deleted in this class, and the normalizing of syn0 happens inside of KeyedVectors

initialize_word_vectors()
intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')

Merge the input-hidden weight matrix from the original C word2vec-tool format given, where it intersects with the current vocabulary. (No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.)

binary is a boolean indicating whether the data is in binary word2vec format.

lockf is a lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.

load(*args, **kwargs)
classmethod load_fasttext_format(*args, **kwargs)
load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)

Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

log_accuracy(section)
log_evaluate_word_pairs(pearson, spearman, oov, pairs)

Deprecated. Use self.wv.log_evaluate_word_pairs() instead. Refer to the documentation for gensim.models.KeyedVectors.log_evaluate_word_pairs

make_cum_table(power=0.75, domain=2147483647)

Create a cumulative-distribution table using stored vocabulary word counts for drawing random words in the negative-sampling training routines.

To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]), then finding that integer’s sorted insertion point (as if by bisect_left or ndarray.searchsorted()). That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.

Called internally from ‘build_vocab()’.

most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Deprecated. Use self.wv.most_similar() instead. Refer to the documentation for gensim.models.KeyedVectors.most_similar

most_similar_cosmul(positive=None, negative=None, topn=10)

Deprecated. Use self.wv.most_similar_cosmul() instead. Refer to the documentation for gensim.models.KeyedVectors.most_similar_cosmul

n_similarity(ws1, ws2)

Deprecated. Use self.wv.n_similarity() instead. Refer to the documentation for gensim.models.KeyedVectors.n_similarity

predict_output_word(context_words_list, topn=10)

Report the probability distribution of the center word given the context words as input to the trained model.

reset_from(other_model)

Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.

reset_ngram_weights()
reset_weights()

Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.

save(*args, **kwargs)
save_word2vec_format(fname, fvocab=None, binary=False)

Deprecated. Use model.wv.save_word2vec_format instead.

scale_vocab(min_count=None, sample=None, dry_run=False, keep_raw_vocab=False, trim_rule=None, update=False)

Apply vocabulary settings for min_count (discarding less-frequent words) and sample (controlling the downsampling of more-frequent words).

Calling with dry_run=True will only simulate the provided settings and report the size of the retained vocabulary, effective corpus length, and estimated memory requirements. Results are both printed via logging and returned as a dict.

Delete the raw vocabulary after the scaling is done to free up RAM, unless keep_raw_vocab is set.

scan_vocab(sentences, progress_per=10000, trim_rule=None)

Do an initial scan of all words appearing in sentences.

score(sentences, total_sentences=1000000, chunksize=100, queue_factor=2, report_delay=1)

Score the log probability for a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings. This does not change the fitted model in any way (see Word2Vec.train() for that).

We have currently only implemented score for the hierarchical softmax scheme, so you need to have run word2vec with hs=1 and negative=0 for this to work.

Note that you should specify total_sentences; we’ll run into problems if you ask to score more than this number of sentences but it is inefficient to set the value too high.

See the article by [1] and the gensim demo at [2] for examples of how to use such scores in document classification.

[1]Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
[2]https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
seeded_vector(seed_string)

Create one ‘random’ vector (but deterministic by seed_string)

similar_by_vector(vector, topn=10, restrict_vocab=None)

Deprecated. Use self.wv.similar_by_vector() instead. Refer to the documentation for gensim.models.KeyedVectors.similar_by_vector

similar_by_word(word, topn=10, restrict_vocab=None)

Deprecated. Use self.wv.similar_by_word() instead. Refer to the documentation for gensim.models.KeyedVectors.similar_by_word

similarity(w1, w2)

Deprecated. Use self.wv.similarity() instead. Refer to the documentation for gensim.models.KeyedVectors.similarity

sort_vocab()

Sort the vocabulary so the most frequent words have the lowest indexes.

train(sentences, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0)
update_weights()

Copy all the existing weights, and reset the weights for the newly added vocabulary.

wmdistance(document1, document2)

Deprecated. Use self.wv.wmdistance() instead. Refer to the documentation for gensim.models.KeyedVectors.wmdistance

word_vec(word, use_norm=False)
gensim.models.fasttext.train_batch_cbow(model, sentences, alpha, work=None, neu1=None)
gensim.models.fasttext.train_batch_sg(model, sentences, alpha, work=None)