Note
Click here to download the full example code
FastText Model¶
Introduces Gensim’s fastText model and demonstrates its use on the Lee Corpus.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
Here, we’ll learn to work with fastText library for training word-embedding models, saving & loading them and performing similarity operations & vector lookups analogous to Word2Vec.
When to use fastText?¶
The main principle behind fastText is that the morphological structure of a word carries important information about the meaning of the word. Such structure is not taken into account by traditional word embeddings like Word2Vec, which train a unique word embedding for every individual word. This is especially significant for morphologically rich languages (German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings.
fastText attempts to solve this by treating each word as the aggregation of its subwords. For the sake of simplicity and language-independence, subwords are taken to be the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams.
According to a detailed comparison of Word2Vec and fastText in this notebook, fastText does significantly better on syntactic tasks as compared to the original Word2Vec, especially when the size of the training corpus is small. Word2Vec slightly outperforms fastText on semantic tasks though. The differences grow smaller as the size of the training corpus increases.
fastText can obtain vectors even for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data.
Training models¶
For the following examples, we’ll use the Lee Corpus (which you already have if you’ve installed Gensim) for training our model.
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath
# Set file names for train and test data
corpus_file = datapath('lee_background.cor')
model = FastText(vector_size=100)
# build the vocabulary
model.build_vocab(corpus_file=corpus_file)
# train the model
model.train(
corpus_file=corpus_file, epochs=model.epochs,
total_examples=model.corpus_count, total_words=model.corpus_total_words,
)
print(model)
2022-10-23 11:05:20,779 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-10-23 11:05:20,779 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2022-10-23 11:05:20,782 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2022-10-23T11:05:20.780094', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'created'}
2022-10-23 11:05:20,858 : INFO : FastText lifecycle event {'params': 'FastText<vocab=0, vector_size=100, alpha=0.025>', 'datetime': '2022-10-23T11:05:20.858457', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'created'}
2022-10-23 11:05:20,858 : INFO : collecting all words and their counts
2022-10-23 11:05:20,858 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-10-23 11:05:20,874 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2022-10-23 11:05:20,874 : INFO : Creating a fresh vocabulary
2022-10-23 11:05:20,882 : INFO : FastText lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2022-10-23T11:05:20.882842', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'prepare_vocab'}
2022-10-23 11:05:20,882 : INFO : FastText lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2022-10-23T11:05:20.882944', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'prepare_vocab'}
2022-10-23 11:05:20,892 : INFO : deleting the raw counts dictionary of 10781 items
2022-10-23 11:05:20,892 : INFO : sample=0.001 downsamples 45 most-common words
2022-10-23 11:05:20,893 : INFO : FastText lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2022-10-23T11:05:20.893011', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'prepare_vocab'}
2022-10-23 11:05:20,927 : INFO : estimated required memory for 1762 words, 2000000 buckets and 100 dimensions: 802597824 bytes
2022-10-23 11:05:20,927 : INFO : resetting layer weights
2022-10-23 11:05:22,169 : INFO : FastText lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2022-10-23T11:05:22.169699', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'build_vocab'}
2022-10-23 11:05:22,169 : INFO : FastText lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2022-10-23T11:05:22.169966', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'train'}
2022-10-23 11:05:22,273 : INFO : EPOCH 0: training on 60387 raw words (32958 effective words) took 0.1s, 355842 effective words/s
2022-10-23 11:05:22,369 : INFO : EPOCH 1: training on 60387 raw words (32906 effective words) took 0.1s, 369792 effective words/s
2022-10-23 11:05:22,466 : INFO : EPOCH 2: training on 60387 raw words (32863 effective words) took 0.1s, 361340 effective words/s
2022-10-23 11:05:22,563 : INFO : EPOCH 3: training on 60387 raw words (32832 effective words) took 0.1s, 363904 effective words/s
2022-10-23 11:05:22,662 : INFO : EPOCH 4: training on 60387 raw words (32827 effective words) took 0.1s, 355536 effective words/s
2022-10-23 11:05:22,662 : INFO : FastText lifecycle event {'msg': 'training on 301935 raw words (164386 effective words) took 0.5s, 333704 effective words/s', 'datetime': '2022-10-23T11:05:22.662680', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'train'}
<gensim.models.fasttext.FastText object at 0x7f112f39db70>
Training hyperparameters¶
Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec:
model: Training architecture. Allowed values: cbow, skipgram (Default cbow)
vector_size: Dimensionality of vector embeddings to be learnt (Default 100)
alpha: Initial learning rate (Default 0.025)
window: Context window size (Default 5)
min_count: Ignore words with number of occurrences below this (Default 5)
loss: Training objective. Allowed values: ns, hs, softmax (Default ns)
sample: Threshold for downsampling higher-frequency words (Default 0.001)
negative: Number of negative words to sample, for ns (Default 5)
epochs: Number of epochs (Default 5)
sorted_vocab: Sort vocab by descending frequency (Default 1)
threads: Number of threads to use (Default 12)
In addition, fastText has three additional parameters:
min_n: min length of char ngrams (Default 3)
max_n: max length of char ngrams (Default 6)
bucket: number of buckets used for hashing ngrams (Default 2000000)
Parameters min_n
and max_n
control the lengths of character ngrams that each word is broken down into while training and looking up embeddings. If max_n
is set to 0, or to be lesser than min_n
, no character ngrams are used, and the model effectively reduces to Word2Vec.
To bound the memory requirements of the model being trained, a hashing function is used that maps ngrams to integers in 1 to K. For hashing these character sequences, the Fowler-Noll-Vo hashing function (FNV-1a variant) is employed.
Note: You can continue to train your model while using Gensim’s native implementation of fastText.
Saving/loading models¶
Models can be saved and loaded via the load
and save
methods, just like
any other model in Gensim.
# Save a model trained via Gensim's fastText implementation to temp.
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
model.save(tmp.name, separately=[])
# Load back the same model.
loaded_model = FastText.load(tmp.name)
print(loaded_model)
os.unlink(tmp.name) # demonstration complete, don't need the temp file anymore
2022-10-23 11:05:22,826 : INFO : FastText lifecycle event {'fname_or_handle': '/tmp/saved_model_gensim-grsw1xyt', 'separately': '[]', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-10-23T11:05:22.826086', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'saving'}
2022-10-23 11:05:22,827 : INFO : storing np array 'vectors_ngrams' to /tmp/saved_model_gensim-grsw1xyt.wv.vectors_ngrams.npy
2022-10-23 11:05:24,259 : INFO : not storing attribute vectors
2022-10-23 11:05:24,259 : INFO : not storing attribute buckets_word
2022-10-23 11:05:24,260 : INFO : not storing attribute cum_table
2022-10-23 11:05:24,289 : INFO : saved /tmp/saved_model_gensim-grsw1xyt
2022-10-23 11:05:24,289 : INFO : loading FastText object from /tmp/saved_model_gensim-grsw1xyt
2022-10-23 11:05:24,292 : INFO : loading wv recursively from /tmp/saved_model_gensim-grsw1xyt.wv.* with mmap=None
2022-10-23 11:05:24,292 : INFO : loading vectors_ngrams from /tmp/saved_model_gensim-grsw1xyt.wv.vectors_ngrams.npy with mmap=None
2022-10-23 11:05:24,594 : INFO : setting ignored attribute vectors to None
2022-10-23 11:05:24,594 : INFO : setting ignored attribute buckets_word to None
2022-10-23 11:05:24,673 : INFO : setting ignored attribute cum_table to None
2022-10-23 11:05:24,689 : INFO : FastText lifecycle event {'fname': '/tmp/saved_model_gensim-grsw1xyt', 'datetime': '2022-10-23T11:05:24.689800', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'loaded'}
<gensim.models.fasttext.FastText object at 0x7f112f022620>
The save_word2vec_format
is also available for fastText models, but will
cause all vectors for ngrams to be lost.
As a result, a model loaded in this way will behave as a regular word2vec model.
Word vector lookup¶
All information necessary for looking up fastText words (incl. OOV words) is
contained in its model.wv
attribute.
If you don’t need to continue training your model, you can export & save this .wv attribute and discard model, to save space and RAM.
wv = model.wv
print(wv)
#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print('night' in wv.key_to_index)
<gensim.models.fasttext.FastTextKeyedVectors object at 0x7f112f39c2e0>
True
print('nights' in wv.key_to_index)
False
print(wv['night'])
array([-0.19996722, 0.1813906 , -0.2631422 , -0.09450997, 0.0605551 ,
0.38595745, 0.30778143, 0.5067505 , 0.23698695, -0.23913051,
0.02506454, -0.15320891, -0.2434152 , 0.52560467, -0.38980618,
-0.55800015, 0.19291814, -0.23110117, -0.43341738, -0.53108984,
-0.4688596 , -0.04782811, -0.46767992, -0.1137548 , -0.20153292,
-0.31324366, -0.6708753 , -0.10945056, -0.31843412, 0.26011363,
-0.32820454, 0.32238692, 0.8404276 , -0.2502807 , 0.19792764,
0.37759355, 0.40180317, -0.09189364, -0.36985794, -0.33649284,
0.46887243, -0.43174997, 0.04100857, -0.39025533, -0.51651365,
-0.32087606, -0.05997978, 0.14294061, 0.360094 , -0.02155857,
0.37047735, -0.44327876, 0.28450134, -0.4054028 , -0.19731535,
-0.21376207, -0.1685454 , -0.12901361, 0.03528974, -0.35231775,
-0.35454988, -0.43326724, -0.21185161, 0.3519939 , -0.11108 ,
0.69391364, 0.05785353, 0.05663215, 0.42399758, 0.24977471,
-0.24918619, 0.3934391 , 0.5109367 , -0.6553013 , 0.33610865,
-0.09825795, 0.25878346, -0.03377685, 0.06902322, 0.37547323,
0.17450804, -0.5030028 , -0.82190335, -0.15457787, -0.12070727,
-0.78729135, 0.49075758, 0.19234893, -0.01774574, -0.28116694,
-0.02472195, 0.40292844, -0.14185381, 0.07625303, -0.20744859,
0.59728205, -0.2217386 , -0.29148448, -0.01873052, -0.2401561 ],
dtype=float32)
print(wv['nights'])
array([-0.17333212, 0.15747589, -0.22726758, -0.08140025, 0.05103909,
0.33196837, 0.2670658 , 0.43939307, 0.205082 , -0.20810795,
0.02336278, -0.13075203, -0.21126968, 0.45168898, -0.33789524,
-0.48235178, 0.16582203, -0.19900155, -0.3727986 , -0.4591713 ,
-0.401847 , -0.04239817, -0.40366223, -0.09961417, -0.17264459,
-0.26896393, -0.57774097, -0.09225026, -0.27459562, 0.22605109,
-0.28136173, 0.27779424, 0.72365224, -0.21562205, 0.17094932,
0.3253317 , 0.34816158, -0.07930711, -0.31941393, -0.29101238,
0.40383977, -0.3717381 , 0.03487907, -0.33628452, -0.4465965 ,
-0.27571818, -0.0488493 , 0.12399682, 0.31216368, -0.01752434,
0.32131058, -0.38280696, 0.24619998, -0.34979105, -0.16987896,
-0.18326469, -0.14740779, -0.1095791 , 0.03177686, -0.30144197,
-0.30499157, -0.37426412, -0.18248272, 0.3032632 , -0.09528783,
0.59990335, 0.05005969, 0.04626458, 0.36565247, 0.21673569,
-0.2155152 , 0.33764148, 0.4421136 , -0.56542957, 0.29158652,
-0.08375975, 0.22272962, -0.02998246, 0.05934277, 0.3240713 ,
0.1511237 , -0.43450487, -0.7087094 , -0.13446207, -0.10318276,
-0.6806781 , 0.42355484, 0.1661925 , -0.01327086, -0.2432955 ,
-0.02126789, 0.34654808, -0.12292334, 0.06645596, -0.1795192 ,
0.5156855 , -0.19275527, -0.24794976, -0.01581961, -0.2081413 ],
dtype=float32)
Similarity operations¶
Similarity operations work the same way as word2vec. Out-of-vocabulary words can also be used, provided they have at least one character ngram present in the training data.
print("nights" in wv.key_to_index)
False
print("night" in wv.key_to_index)
True
print(wv.similarity("night", "nights"))
0.9999918
Syntactically similar words generally have high similarity in fastText models, since a large number of the component char-ngrams will be the same. As a result, fastText generally does better at syntactic tasks than Word2Vec. A detailed comparison is provided here.
Other similarity operations¶
The example training corpus is a toy corpus, results are not expected to be good, for proof-of-concept only
print(wv.most_similar("nights"))
[('night', 0.9999918341636658),
('rights', 0.9999877214431763),
('flights', 0.9999877214431763),
('overnight', 0.999987006187439),
('fighting', 0.9999857544898987),
('fighters', 0.9999855160713196),
('fight', 0.9999852180480957),
('entered', 0.9999851584434509),
('fighter', 0.999984860420227),
('eight', 0.999984622001648)]
print(wv.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))
0.99993986
print(wv.doesnt_match("breakfast cereal dinner lunch".split()))
'lunch'
print(wv.most_similar(positive=['baghdad', 'england'], negative=['london']))
[('find', 0.9996394515037537),
('capital,', 0.999639093875885),
('findings', 0.9996339082717896),
('seekers.', 0.9996323585510254),
('field', 0.9996322393417358),
('finding', 0.9996311664581299),
('had', 0.9996305704116821),
('abuse', 0.9996281862258911),
('storm', 0.9996268153190613),
('heading', 0.9996247291564941)]
print(wv.evaluate_word_analogies(datapath('questions-words.txt')))
2022-10-23 11:05:26,790 : INFO : Evaluating word analogies for top 300000 words in the model on /home/thomas/Documents/FOSS/gensim-tlouf/gensim/test/test_data/questions-words.txt
2022-10-23 11:05:26,814 : INFO : family: 0.0% (0/2)
2022-10-23 11:05:26,822 : INFO : gram3-comparative: 8.3% (1/12)
2022-10-23 11:05:26,827 : INFO : gram4-superlative: 33.3% (4/12)
2022-10-23 11:05:26,832 : INFO : gram5-present-participle: 45.0% (9/20)
2022-10-23 11:05:26,845 : INFO : gram6-nationality-adjective: 30.0% (6/20)
2022-10-23 11:05:26,851 : INFO : gram7-past-tense: 5.0% (1/20)
2022-10-23 11:05:26,856 : INFO : gram8-plural: 33.3% (4/12)
2022-10-23 11:05:26,858 : INFO : Quadruplets with out-of-vocabulary words: 99.5%
2022-10-23 11:05:26,859 : INFO : NB: analogies containing OOV words were skipped from evaluation! To change this behavior, use "dummy4unknown=True"
2022-10-23 11:05:26,859 : INFO : Total accuracy: 25.5% (25/98)
(0.25510204081632654,
[{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
{'correct': [], 'incorrect': [], 'section': 'capital-world'},
{'correct': [], 'incorrect': [], 'section': 'currency'},
{'correct': [], 'incorrect': [], 'section': 'city-in-state'},
{'correct': [],
'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')],
'section': 'family'},
{'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
{'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
{'correct': [('LONG', 'LONGER', 'GREAT', 'GREATER')],
'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
('GOOD', 'BETTER', 'LONG', 'LONGER'),
('GOOD', 'BETTER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'LOW', 'LOWER'),
('LONG', 'LONGER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'LONG', 'LONGER')],
'section': 'gram3-comparative'},
{'correct': [('GOOD', 'BEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
('LARGE', 'LARGEST', 'BIG', 'BIGGEST')],
'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'),
('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GREAT', 'GREATEST')],
'section': 'gram4-superlative'},
{'correct': [('GO', 'GOING', 'PLAY', 'PLAYING'),
('GO', 'GOING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'SAY', 'SAYING'),
('PLAY', 'PLAYING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'GO', 'GOING'),
('SAY', 'SAYING', 'PLAY', 'PLAYING')],
'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'),
('GO', 'GOING', 'RUN', 'RUNNING'),
('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
('RUN', 'RUNNING', 'SAY', 'SAYING'),
('RUN', 'RUNNING', 'GO', 'GOING'),
('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'RUN', 'RUNNING')],
'section': 'gram5-present-participle'},
{'correct': [('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN')],
'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI')],
'section': 'gram6-nationality-adjective'},
{'correct': [('PAYING', 'PAID', 'SAYING', 'SAID')],
'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'),
('GOING', 'WENT', 'PLAYING', 'PLAYED'),
('GOING', 'WENT', 'SAYING', 'SAID'),
('GOING', 'WENT', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
('PAYING', 'PAID', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
('PLAYING', 'PLAYED', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'TAKING', 'TOOK'),
('SAYING', 'SAID', 'GOING', 'WENT'),
('SAYING', 'SAID', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'GOING', 'WENT'),
('TAKING', 'TOOK', 'PAYING', 'PAID'),
('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'SAYING', 'SAID')],
'section': 'gram7-past-tense'},
{'correct': [('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'CHILD', 'CHILDREN'),
('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'CHILD', 'CHILDREN')],
'incorrect': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
('CAR', 'CARS', 'MAN', 'MEN'),
('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'CAR', 'CARS'),
('MAN', 'MEN', 'CAR', 'CARS')],
'section': 'gram8-plural'},
{'correct': [], 'incorrect': [], 'section': 'gram9-plural-verbs'},
{'correct': [('LONG', 'LONGER', 'GREAT', 'GREATER'),
('GOOD', 'BEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
('GO', 'GOING', 'PLAY', 'PLAYING'),
('GO', 'GOING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'SAY', 'SAYING'),
('PLAY', 'PLAYING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'GO', 'GOING'),
('SAY', 'SAYING', 'PLAY', 'PLAYING'),
('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
('PAYING', 'PAID', 'SAYING', 'SAID'),
('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'CHILD', 'CHILDREN'),
('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'CHILD', 'CHILDREN')],
'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
('HIS', 'HER', 'HE', 'SHE'),
('GOOD', 'BETTER', 'GREAT', 'GREATER'),
('GOOD', 'BETTER', 'LONG', 'LONGER'),
('GOOD', 'BETTER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'LOW', 'LOWER'),
('LONG', 'LONGER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'LONG', 'LONGER'),
('BIG', 'BIGGEST', 'GOOD', 'BEST'),
('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GREAT', 'GREATEST'),
('GO', 'GOING', 'LOOK', 'LOOKING'),
('GO', 'GOING', 'RUN', 'RUNNING'),
('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
('RUN', 'RUNNING', 'SAY', 'SAYING'),
('RUN', 'RUNNING', 'GO', 'GOING'),
('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'RUN', 'RUNNING'),
('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
('GOING', 'WENT', 'PAYING', 'PAID'),
('GOING', 'WENT', 'PLAYING', 'PLAYED'),
('GOING', 'WENT', 'SAYING', 'SAID'),
('GOING', 'WENT', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
('PAYING', 'PAID', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
('PLAYING', 'PLAYED', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'TAKING', 'TOOK'),
('SAYING', 'SAID', 'GOING', 'WENT'),
('SAYING', 'SAID', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'GOING', 'WENT'),
('TAKING', 'TOOK', 'PAYING', 'PAID'),
('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'SAYING', 'SAID'),
('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
('CAR', 'CARS', 'MAN', 'MEN'),
('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'CAR', 'CARS'),
('MAN', 'MEN', 'CAR', 'CARS')],
'section': 'Total accuracy'}])
Word Movers distance¶
You’ll need the optional POT
library for this section, pip install POT
.
Let’s start with two sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
Remove their stopwords.
from gensim.parsing.preprocessing import STOPWORDS
sentence_obama = [w for w in sentence_obama if w not in STOPWORDS]
sentence_president = [w for w in sentence_president if w not in STOPWORDS]
Compute the Word Movers Distance between the two sentences.
distance = wv.wmdistance(sentence_obama, sentence_president)
print(f"Word Movers Distance is {distance} (lower means closer)")
2022-10-23 11:05:27,139 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-10-23 11:05:27,140 : INFO : built Dictionary<8 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...> from 2 documents (total 8 corpus positions)
2022-10-23 11:05:27,140 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<8 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...> from 2 documents (total 8 corpus positions)", 'datetime': '2022-10-23T11:05:27.140129', 'gensim': '4.2.1.dev0', 'python': '3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]', 'platform': 'Linux-5.19.0-76051900-generic-x86_64-with-glibc2.35', 'event': 'created'}
'Word Movers Distance is 0.01600033861640832 (lower means closer)'
That’s all! You’ve made it to the end of this tutorial.
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('fasttext-logo-color-web.png')
imgplot = plt.imshow(img)
_ = plt.axis('off')
Total running time of the script: ( 0 minutes 7.208 seconds)
Estimated memory usage: 1619 MB