PK PRaOښ;gA gA ) core/run_topics_and_transformations.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nTopics and Transformations\n===========================\n\nIntroduces transformations and demonstrates their use on a toy corpus.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, I will show how to transform documents from one vector representation\ninto another. This process serves two goals:\n\n1. To bring out hidden structure in the corpus, discover relationships between\n words and use them to describe the documents in a new and\n (hopefully) more semantic way.\n2. To make the document representation more compact. This both improves efficiency\n (new representation consumes less resources) and efficacy (marginal data\n trends are ignored, noise-reduction).\n\nCreating the Corpus\n-------------------\n\nFirst, we need to create a corpus to work with.\nThis step is the same as in the previous tutorial;\nif you completed it, feel free to skip to the next section.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from collections import defaultdict\nfrom gensim import corpora\n\ndocuments = [\n \"Human machine interface for lab abc computer applications\",\n \"A survey of user opinion of computer system response time\",\n \"The EPS user interface management system\",\n \"System and human system engineering testing of EPS\",\n \"Relation of user perceived response time to error measurement\",\n \"The generation of random binary unordered trees\",\n \"The intersection graph of paths in trees\",\n \"Graph minors IV Widths of trees and well quasi ordering\",\n \"Graph minors A survey\",\n]\n\n# remove common words and tokenize\nstoplist = set('for a of the and to in'.split())\ntexts = [\n [word for word in document.lower().split() if word not in stoplist]\n for document in documents\n]\n\n# remove words that appear only once\nfrequency = defaultdict(int)\nfor text in texts:\n for token in text:\n frequency[token] += 1\n\ntexts = [\n [token for token in text if frequency[token] > 1]\n for text in texts\n]\n\ndictionary = corpora.Dictionary(texts)\ncorpus = [dictionary.doc2bow(text) for text in texts]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a transformation\n++++++++++++++++++++++++++\n\nThe transformations are standard Python objects, typically initialized by means of\na :dfn:`training corpus`:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import models\n\ntfidf = models.TfidfModel(corpus) # step 1 -- initialize a model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We used our old corpus from tutorial 1 to initialize (train) the transformation model. Different\ntransformations may require different initialization parameters; in case of TfIdf, the\n\"training\" consists simply of going through the supplied corpus once and computing document frequencies\nof all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet\nAllocation, is much more involved and, consequently, takes much more time.\n\n
Note
Transformations always convert between two specific vector\n spaces. The same vector space (= the same set of feature ids) must be used for training\n as well as for subsequent vector transformations. Failure to use the same input\n feature space, such as applying a different string preprocessing, using different\n feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will\n result in feature mismatch during transformation calls and consequently in either\n garbage output and/or runtime exceptions.
\n\n\nTransforming vectors\n+++++++++++++++++++++\n\nFrom now on, ``tfidf`` is treated as a read-only object that can be used to convert\nany vector from the old representation (bag-of-words integer counts) to the new representation\n(TfIdf real-valued weights):\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"doc_bow = [(0, 1), (1, 1)]\nprint(tfidf[doc_bow]) # step 2 -- use the model to transform vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or to apply a transformation to a whole corpus:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpus_tfidf = tfidf[corpus]\nfor doc in corpus_tfidf:\n print(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this particular case, we are transforming the same corpus that we used\nfor training, but this is only incidental. Once the transformation model has been initialized,\nit can be used on any vectors (provided they come from the same vector space, of course),\neven if they were not used in the training corpus at all. This is achieved by a process called\nfolding-in for LSA, by topic inference for LDA etc.\n\nNote
Calling ``model[corpus]`` only creates a wrapper around the old ``corpus``\n document stream -- actual conversions are done on-the-fly, during document iteration.\n We cannot convert the entire corpus at the time of calling ``corpus_transformed = model[corpus]``,\n because that would mean storing the result in main memory, and that contradicts gensim's objective of memory-indepedence.\n If you will be iterating over the transformed ``corpus_transformed`` multiple times, and the\n transformation is costly, `serialize the resulting corpus to disk first ` and continue\n using that.
\n\nTransformations can also be serialized, one on top of another, in a sort of chain:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation\ncorpus_lsi = lsi_model[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we transformed our Tf-Idf corpus via `Latent Semantic Indexing `_\ninto a latent 2-D space (2-D because we set ``num_topics=2``). Now you're probably wondering: what do these two latent\ndimensions stand for? Let's inspect with :func:`models.LsiModel.print_topics`:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"lsi_model.print_topics(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(the topics are printed to log -- see the note at the top of this page about activating\nlogging)\n\nIt appears that according to LSI, \"trees\", \"graph\" and \"minors\" are all related\nwords (and contribute the most to the direction of the first topic), while the\nsecond topic practically concerns itself with all the other words. As expected,\nthe first five documents are more strongly related to the second topic while the\nremaining four documents to the first topic:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly\nfor doc, as_text in zip(corpus_lsi, documents):\n print(doc, as_text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Model persistency is achieved with the :func:`save` and :func:`load` functions:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os\nimport tempfile\n\nwith tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:\n lsi_model.save(tmp.name) # same for tfidf, lda, ...\n\nloaded_lsi_model = models.LsiModel.load(tmp.name)\n\nos.unlink(tmp.name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next question might be: just how exactly similar are those documents to each other?\nIs there a way to formalize the similarity, so that for a given input document, we can\norder some other set of documents according to their similarity? Similarity queries\nare covered in the next tutorial (`sphx_glr_auto_examples_core_run_similarity_queries.py`).\n\n\nAvailable transformations\n--------------------------\n\nGensim implements several popular Vector Space Model algorithms:\n\n* `Term Frequency * Inverse Document Frequency, Tf-Idf `_\n expects a bag-of-words (integer values) training corpus during initialization.\n During transformation, it will take a vector and return another vector of the\n same dimensionality, except that features which were rare in the training corpus\n will have their value increased.\n It therefore converts integer-valued vectors into real-valued ones, while leaving\n the number of dimensions intact. It can also optionally normalize the resulting\n vectors to (Euclidean) unit length.\n\n .. sourcecode:: pycon\n\n model = models.TfidfModel(corpus, normalize=True)\n\n* `Latent Semantic Indexing, LSI (or sometimes LSA) `_\n transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into\n a latent space of a lower dimensionality. For the toy corpus above we used only\n 2 latent dimensions, but on real corpora, target dimensionality of 200--500 is recommended\n as a \"golden standard\" [1]_.\n\n .. sourcecode:: pycon\n\n model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)\n\n LSI training is unique in that we can continue \"training\" at any point, simply\n by providing more training documents. This is done by incremental updates to\n the underlying model, in a process called `online training`. Because of this feature, the\n input document stream may even be infinite -- just keep feeding LSI new documents\n as they arrive, while using the computed transformation model as read-only in the meanwhile!\n\n .. sourcecode:: pycon\n\n model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus\n lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model\n\n model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents\n lsi_vec = model[tfidf_vec]\n\n See the :mod:`gensim.models.lsimodel` documentation for details on how to make\n LSI gradually \"forget\" old observations in infinite streams. If you want to get dirty,\n there are also parameters you can tweak that affect speed vs. memory footprint vs. numerical\n precision of the LSI algorithm.\n\n `gensim` uses a novel online incremental streamed distributed training algorithm (quite a mouthful!),\n which I published in [5]_. `gensim` also executes a stochastic multi-pass algorithm\n from Halko et al. [4]_ internally, to accelerate in-core part\n of the computations.\n See also `wiki` for further speed-ups by distributing the computation across\n a cluster of computers.\n\n* `Random Projections, RP `_ aim to\n reduce vector space dimensionality. This is a very efficient (both memory- and\n CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.\n Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.\n\n .. sourcecode:: pycon\n\n model = models.RpModel(tfidf_corpus, num_topics=500)\n\n* `Latent Dirichlet Allocation, LDA `_\n is yet another transformation from bag-of-words counts into a topic space of lower\n dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA),\n so LDA's topics can be interpreted as probability distributions over words. These distributions are,\n just like with LSA, inferred automatically from a training corpus. Documents\n are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).\n\n .. sourcecode:: pycon\n\n model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)\n\n `gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,\n modified to run in `distributed mode ` on a cluster of computers.\n\n* `Hierarchical Dirichlet Process, HDP `_\n is a non-parametric bayesian method (note the missing number of requested topics):\n\n .. sourcecode:: pycon\n\n model = models.HdpModel(corpus, id2word=dictionary)\n\n `gensim` uses a fast, online implementation based on [3]_.\n The HDP model is a new addition to `gensim`, and still rough around its academic edges -- use with care.\n\nAdding new :abbr:`VSM (Vector Space Model)` transformations (such as different weighting schemes) is rather trivial;\nsee the `apiref` or directly the `Python code `_\nfor more info and examples.\n\nIt is worth repeating that these are all unique, **incremental** implementations,\nwhich do not require the whole training corpus to be present in main memory all at once.\nWith memory taken care of, I am now improving `distributed`,\nto improve CPU efficiency, too.\nIf you feel you could contribute by testing, providing use-cases or code, see the `Gensim Developer guide `__.\n\nWhat Next?\n----------\n\nContinue on to the next tutorial on `sphx_glr_auto_examples_core_run_similarity_queries.py`.\n\nReferences\n----------\n\n.. [1] Bradford. 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications.\n\n.. [2] Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation.\n\n.. [3] Wang, Paisley, Blei. 2011. Online variational inference for the hierarchical Dirichlet process.\n\n.. [4] Halko, Martinsson, Tropp. 2009. Finding structure with randomness.\n\n.. [5] \u0158eh\u016f\u0159ek. 2011. Subspace tracking for Latent Semantic Analysis.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('run_topics_and_transformations.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZO!/ / ! core/run_similarity_queries.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nSimilarity Queries\n==================\n\nDemonstrates querying a corpus for similar documents.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating the Corpus\n-------------------\n\nFirst, we need to create a corpus to work with.\nThis step is the same as in the previous tutorial;\nif you completed it, feel free to skip to the next section.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from collections import defaultdict\nfrom gensim import corpora\n\ndocuments = [\n \"Human machine interface for lab abc computer applications\",\n \"A survey of user opinion of computer system response time\",\n \"The EPS user interface management system\",\n \"System and human system engineering testing of EPS\",\n \"Relation of user perceived response time to error measurement\",\n \"The generation of random binary unordered trees\",\n \"The intersection graph of paths in trees\",\n \"Graph minors IV Widths of trees and well quasi ordering\",\n \"Graph minors A survey\",\n]\n\n# remove common words and tokenize\nstoplist = set('for a of the and to in'.split())\ntexts = [\n [word for word in document.lower().split() if word not in stoplist]\n for document in documents\n]\n\n# remove words that appear only once\nfrequency = defaultdict(int)\nfor text in texts:\n for token in text:\n frequency[token] += 1\n\ntexts = [\n [token for token in text if frequency[token] > 1]\n for text in texts\n]\n\ndictionary = corpora.Dictionary(texts)\ncorpus = [dictionary.doc2bow(text) for text in texts]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarity interface\n--------------------\n\nIn the previous tutorials on\n`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py`\nand\n`sphx_glr_auto_examples_core_run_topics_and_transformations.py`,\nwe covered what it means to create a corpus in the Vector Space Model and how\nto transform it between different vector spaces. A common reason for such a\ncharade is that we want to determine **similarity between pairs of\ndocuments**, or the **similarity between a specific document and a set of\nother documents** (such as a user query vs. indexed documents).\n\nTo show how this can be done in gensim, let us consider the same corpus as in the\nprevious examples (which really originally comes from Deerwester et al.'s\n`\"Indexing by Latent Semantic Analysis\" `_\nseminal 1990 article).\nTo follow Deerwester's example, we first use this tiny corpus to define a 2-dimensional\nLSI space:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import models\nlsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the purposes of this tutorial, there are only two things you need to know about LSI.\nFirst, it's just another transformation: it transforms vectors from one space to another.\nSecond, the benefit of LSI is that enables identifying patterns and relationships between terms (in our case, words in a document) and topics.\nOur LSI space is two-dimensional (`num_topics = 2`) so there are two topics, but this is arbitrary.\nIf you're interested, you can read more about LSI here: `Latent Semantic Indexing `_:\n\nNow suppose a user typed in the query `\"Human computer interaction\"`. We would\nlike to sort our nine corpus documents in decreasing order of relevance to this query.\nUnlike modern search engines, here we only concentrate on a single aspect of possible\nsimilarities---on apparent semantic relatedness of their texts (words). No hyperlinks,\nno random-walk static ranks, just a semantic extension over the boolean keyword match:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"doc = \"Human computer interaction\"\nvec_bow = dictionary.doc2bow(doc.lower().split())\nvec_lsi = lsi[vec_bow] # convert the query to LSI space\nprint(vec_lsi)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition, we will be considering `cosine similarity `_\nto determine the similarity of two vectors. Cosine similarity is a standard measure\nin Vector Space Modeling, but wherever the vectors represent probability distributions,\n`different similarity measures `_\nmay be more appropriate.\n\nInitializing query structures\n++++++++++++++++++++++++++++++++\n\nTo prepare for similarity queries, we need to enter all documents which we want\nto compare against subsequent queries. In our case, they are the same nine documents\nused for training LSI, converted to 2-D LSA space. But that's only incidental, we\nmight also be indexing a different corpus altogether.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import similarities\nindex = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Warning
The class :class:`similarities.MatrixSimilarity` is only appropriate when the whole\n set of vectors fits into memory. For example, a corpus of one million documents\n would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.\n\n Without 2GB of free RAM, you would need to use the :class:`similarities.Similarity` class.\n This class operates in fixed memory, by splitting the index across multiple files on disk, called shards.\n It uses :class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity` internally,\n so it is still fast, although slightly more complex.
\n\nIndex persistency is handled via the standard :func:`save` and :func:`load` functions:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"index.save('/tmp/deerwester.index')\nindex = similarities.MatrixSimilarity.load('/tmp/deerwester.index')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is true for all similarity indexing classes (:class:`similarities.Similarity`,\n:class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity`).\nAlso in the following, `index` can be an object of any of these. When in doubt,\nuse :class:`similarities.Similarity`, as it is the most scalable version, and it also\nsupports adding more documents to the index later.\n\nPerforming queries\n++++++++++++++++++\n\nTo obtain similarities of our query document against the nine indexed documents:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sims = index[vec_lsi] # perform a similarity query against the corpus\nprint(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cosine measure returns similarities in the range `<-1, 1>` (the greater, the more similar),\nso that the first document has a score of 0.99809301 etc.\n\nWith some standard Python magic we sort these similarities into descending\norder, and obtain the final answer to the query `\"Human computer interaction\"`:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sims = sorted(enumerate(sims), key=lambda item: -item[1])\nfor i, s in enumerate(sims):\n print(s, documents[i])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The thing to note here is that documents no. 2 (``\"The EPS user interface management system\"``)\nand 4 (``\"Relation of user perceived response time to error measurement\"``) would never be returned by\na standard boolean fulltext search, because they do not share any common words with ``\"Human\ncomputer interaction\"``. However, after applying LSI, we can observe that both of\nthem received quite high similarity scores (no. 2 is actually the most similar!),\nwhich corresponds better to our intuition of\nthem sharing a \"computer-human\" related topic with the query. In fact, this semantic\ngeneralization is the reason why we apply transformations and do topic modelling\nin the first place.\n\nWhere next?\n------------\n\nCongratulations, you have finished the tutorials -- now you know how gensim works :-)\nTo delve into more details, you can browse through the `apiref`,\nsee the `wiki` or perhaps check out `distributed` in `gensim`.\n\nGensim is a fairly mature package that has been used successfully by many individuals and companies, both for rapid prototyping and in production.\nThat doesn't mean it's perfect though:\n\n* there are parts that could be implemented more efficiently (in C, for example), or make better use of parallelism (multiple machines cores)\n* new algorithms are published all the time; help gensim keep up by `discussing them `_ and `contributing code `_\n* your **feedback is most welcome** and appreciated (and it's not just the code!):\n `bug reports `_ or\n `user stories and general questions `_.\n\nGensim has no ambition to become an all-encompassing framework, across all NLP (or even Machine Learning) subfields.\nIts mission is to help NLP practitioners try out popular topic modelling algorithms\non large datasets easily, and to facilitate prototyping of new algorithms for researchers.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('run_similarity_queries.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZOZĒxG xG core/run_core_concepts.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nCore Concepts\n=============\n\nThis tutorial introduces Documents, Corpora, Vectors and Models: the basic concepts and terms needed to understand and use gensim.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The core concepts of ``gensim`` are:\n\n1. `core_concepts_document`: some text.\n2. `core_concepts_corpus`: a collection of documents.\n3. `core_concepts_vector`: a mathematically convenient representation of a document.\n4. `core_concepts_model`: an algorithm for transforming vectors from one representation to another.\n\nLet's examine each of these in slightly more detail.\n\n\nDocument\n--------\n\nIn Gensim, a *document* is an object of the `text sequence type `_ (commonly known as ``str`` in Python 3).\nA document could be anything from a short 140 character tweet, a single\nparagraph (i.e., journal article abstract), a news article, or a book.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"document = \"Human machine interface for lab abc computer applications\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nCorpus\n------\n\nA *corpus* is a collection of `core_concepts_document` objects.\nCorpora serve two roles in Gensim:\n\n1. Input for training a `core_concepts_model`.\n During training, the models use this *training corpus* to look for common\n themes and topics, initializing their internal model parameters.\n\n Gensim focuses on *unsupervised* models so that no human intervention,\n such as costly annotations or tagging documents by hand, is required.\n\n2. Documents to organize.\n After training, a topic model can be used to extract topics from new\n documents (documents not seen in the training corpus).\n\n Such corpora can be indexed for\n `sphx_glr_auto_examples_core_run_similarity_queries.py`,\n queried by semantic similarity, clustered etc.\n\nHere is an example corpus.\nIt consists of 9 documents, where each document is a string consisting of a single sentence.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"text_corpus = [\n \"Human machine interface for lab abc computer applications\",\n \"A survey of user opinion of computer system response time\",\n \"The EPS user interface management system\",\n \"System and human system engineering testing of EPS\",\n \"Relation of user perceived response time to error measurement\",\n \"The generation of random binary unordered trees\",\n \"The intersection graph of paths in trees\",\n \"Graph minors IV Widths of trees and well quasi ordering\",\n \"Graph minors A survey\",\n]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
".. Important::\n The above example loads the entire corpus into memory.\n In practice, corpora may be very large, so loading them into memory may be impossible.\n Gensim intelligently handles such corpora by *streaming* them one document at a time.\n See `corpus_streaming_tutorial` for details.\n\nThis is a particularly small example of a corpus for illustration purposes.\nAnother example could be a list of all the plays written by Shakespeare, list\nof all wikipedia articles, or all tweets by a particular person of interest.\n\nAfter collecting our corpus, there are typically a number of preprocessing\nsteps we want to undertake. We'll keep it simple and just remove some\ncommonly used English words (such as 'the') and words that occur only once in\nthe corpus. In the process of doing so, we'll tokenize our data.\nTokenization breaks up the documents into words (in this case using space as\na delimiter).\n\n.. Important::\n There are better ways to perform preprocessing than just lower-casing and\n splitting by space. Effective preprocessing is beyond the scope of this\n tutorial: if you're interested, check out the\n :py:func:`gensim.utils.simple_preprocess` function.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Create a set of frequent words\nstoplist = set('for a of the and to in'.split(' '))\n# Lowercase each document, split it by white space and filter out stopwords\ntexts = [[word for word in document.lower().split() if word not in stoplist]\n for document in text_corpus]\n\n# Count word frequencies\nfrom collections import defaultdict\nfrequency = defaultdict(int)\nfor text in texts:\n for token in text:\n frequency[token] += 1\n\n# Only keep words that appear more than once\nprocessed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]\npprint.pprint(processed_corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before proceeding, we want to associate each word in the corpus with a unique\ninteger ID. We can do this using the :py:class:`gensim.corpora.Dictionary`\nclass. This dictionary defines the vocabulary of all words that our\nprocessing knows about.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import corpora\n\ndictionary = corpora.Dictionary(processed_corpus)\nprint(dictionary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because our corpus is small, there are only 12 different tokens in this\n:py:class:`gensim.corpora.Dictionary`. For larger corpuses, dictionaries that\ncontains hundreds of thousands of tokens are quite common.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nVector\n------\n\nTo infer the latent structure in our corpus we need a way to represent\ndocuments that we can manipulate mathematically. One approach is to represent\neach document as a vector of *features*.\nFor example, a single feature may be thought of as a question-answer pair:\n\n1. How many times does the word *splonge* appear in the document? Zero.\n2. How many paragraphs does the document consist of? Two.\n3. How many fonts does the document use? Five.\n\nThe question is usually represented only by its integer id (such as `1`, `2` and `3`).\nThe representation of this document then becomes a series of pairs like ``(1, 0.0), (2, 2.0), (3, 5.0)``.\nThis is known as a *dense vector*, because it contains an explicit answer to each of the above questions.\n\nIf we know all the questions in advance, we may leave them implicit\nand simply represent the document as ``(0, 2, 5)``.\nThis sequence of answers is the **vector** for our document (in this case a 3-dimensional dense vector).\nFor practical purposes, only questions to which the answer is (or\ncan be converted to) a *single floating point number* are allowed in Gensim.\n\nIn practice, vectors often consist of many zero values.\nTo save memory, Gensim omits all vector elements with value 0.0.\nThe above example thus becomes ``(2, 2.0), (3, 5.0)``.\nThis is known as a *sparse vector* or *bag-of-words vector*.\nThe values of all missing features in this sparse representation can be unambiguously resolved to zero, ``0.0``.\n\nAssuming the questions are the same, we can compare the vectors of two different documents to each other.\nFor example, assume we are given two vectors ``(0.0, 2.0, 5.0)`` and ``(0.1, 1.9, 4.9)``.\nBecause the vectors are very similar to each other, we can conclude that the documents corresponding to those vectors are similar, too.\nOf course, the correctness of that conclusion depends on how well we picked the questions in the first place.\n\nAnother approach to represent a document as a vector is the *bag-of-words\nmodel*.\nUnder the bag-of-words model each document is represented by a vector\ncontaining the frequency counts of each word in the dictionary.\nFor example, assume we have a dictionary containing the words\n``['coffee', 'milk', 'sugar', 'spoon']``.\nA document consisting of the string ``\"coffee milk coffee\"`` would then\nbe represented by the vector ``[2, 1, 0, 0]`` where the entries of the vector\nare (in order) the occurrences of \"coffee\", \"milk\", \"sugar\" and \"spoon\" in\nthe document. The length of the vector is the number of entries in the\ndictionary. One of the main properties of the bag-of-words model is that it\ncompletely ignores the order of the tokens in the document that is encoded,\nwhich is where the name bag-of-words comes from.\n\nOur processed corpus has 12 unique words in it, which means that each\ndocument will be represented by a 12-dimensional vector under the\nbag-of-words model. We can use the dictionary to turn tokenized documents\ninto these 12-dimensional vectors. We can see what these IDs correspond to:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pprint.pprint(dictionary.token2id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For example, suppose we wanted to vectorize the phrase \"Human computer\ninteraction\" (note that this phrase was not in our original corpus). We can\ncreate the bag-of-word representation for a document using the ``doc2bow``\nmethod of the dictionary, which returns a sparse representation of the word\ncounts:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"new_doc = \"Human computer interaction\"\nnew_vec = dictionary.doc2bow(new_doc.lower().split())\nprint(new_vec)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first entry in each tuple corresponds to the ID of the token in the\ndictionary, the second corresponds to the count of this token.\n\nNote that \"interaction\" did not occur in the original corpus and so it was\nnot included in the vectorization. Also note that this vector only contains\nentries for words that actually appeared in the document. Because any given\ndocument will only contain a few words out of the many words in the\ndictionary, words that do not appear in the vectorization are represented as\nimplicitly zero as a space saving measure.\n\nWe can convert our entire original corpus to a list of vectors:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]\npprint.pprint(bow_corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that while this list lives entirely in memory, in most applications you\nwill want a more scalable solution. Luckily, ``gensim`` allows you to use any\niterator that returns a single document vector at a time. See the\ndocumentation for more details.\n\n.. Important::\n The distinction between a document and a vector is that the former is text,\n and the latter is a mathematically convenient representation of the text.\n Sometimes, people will use the terms interchangeably: for example, given\n some arbitrary document ``D``, instead of saying \"the vector that\n corresponds to document ``D``\", they will just say \"the vector ``D``\" or\n the \"document ``D``\". This achieves brevity at the cost of ambiguity.\n\n As long as you remember that documents exist in document space, and that\n vectors exist in vector space, the above ambiguity is acceptable.\n\n.. Important::\n Depending on how the representation was obtained, two different documents\n may have the same vector representations.\n\n\nModel\n-----\n\nNow that we have vectorized our corpus we can begin to transform it using\n*models*. We use model as an abstract term referring to a *transformation* from\none document representation to another. In ``gensim`` documents are\nrepresented as vectors so a model can be thought of as a transformation\nbetween two vector spaces. The model learns the details of this\ntransformation during training, when it reads the training\n`core_concepts_corpus`.\n\nOne simple example of a model is `tf-idf\n`_. The tf-idf model\ntransforms vectors from the bag-of-words representation to a vector space\nwhere the frequency counts are weighted according to the relative rarity of\neach word in the corpus.\n\nHere's a simple example. Let's initialize the tf-idf model, training it on\nour corpus and transforming the string \"system minors\":\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import models\n\n# train the model\ntfidf = models.TfidfModel(bow_corpus)\n\n# transform the \"system minors\" string\nwords = \"system minors\".lower().split()\nprint(tfidf[dictionary.doc2bow(words)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ``tfidf`` model again returns a list of tuples, where the first entry is\nthe token ID and the second entry is the tf-idf weighting. Note that the ID\ncorresponding to \"system\" (which occurred 4 times in the original corpus) has\nbeen weighted lower than the ID corresponding to \"minors\" (which only\noccurred twice).\n\nYou can save trained models to disk and later load them back, either to\ncontinue training on new training documents or to transform new documents.\n\n``gensim`` offers a number of different models/transformations.\nFor more, see `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\nOnce you've created the model, you can do all sorts of cool stuff with it.\nFor example, to transform the whole corpus via TfIdf and index it, in\npreparation for similarity queries:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import similarities\n\nindex = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and to query the similarity of our query document ``query_document`` against every document in the corpus:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"query_document = 'system engineering'.split()\nquery_bow = dictionary.doc2bow(query_document)\nsims = index[tfidf[query_bow]]\nprint(list(enumerate(sims)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How to read this output?\nDocument 3 has a similarity score of 0.718=72%, document 2 has a similarity score of 42% etc.\nWe can make this slightly more readable by sorting:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):\n print(document_number, score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Summary\n-------\n\nThe core concepts of ``gensim`` are:\n\n1. `core_concepts_document`: some text.\n2. `core_concepts_corpus`: a collection of documents.\n3. `core_concepts_vector`: a mathematically convenient representation of a document.\n4. `core_concepts_model`: an algorithm for transforming vectors from one representation to another.\n\nWe saw these concepts in action.\nFirst, we started with a corpus of documents.\nNext, we transformed these documents to a vector space representation.\nAfter that, we created a model that transformed our original vector representation to TfIdf.\nFinally, we used our model to calculate the similarity between some query document and all documents in the corpus.\n\nWhat Next?\n----------\n\nThere's still much more to learn about `sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py`.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('run_core_concepts.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZO2,M ,M ( core/run_corpora_and_vector_spaces.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nCorpora and Vector Spaces\n=========================\n\nDemonstrates transforming text into a vector space representation.\n\nAlso introduces corpus streaming and persistence to disk in various formats.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let\u2019s create a small corpus of nine short documents [1]_:\n\n\nFrom Strings to Vectors\n------------------------\n\nThis time, let's start from documents represented as strings:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"documents = [\n \"Human machine interface for lab abc computer applications\",\n \"A survey of user opinion of computer system response time\",\n \"The EPS user interface management system\",\n \"System and human system engineering testing of EPS\",\n \"Relation of user perceived response time to error measurement\",\n \"The generation of random binary unordered trees\",\n \"The intersection graph of paths in trees\",\n \"Graph minors IV Widths of trees and well quasi ordering\",\n \"Graph minors A survey\",\n]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a tiny corpus of nine documents, each consisting of only a single sentence.\n\nFirst, let's tokenize the documents, remove common words (using a toy stoplist)\nas well as words that only appear once in the corpus:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from pprint import pprint # pretty-printer\nfrom collections import defaultdict\n\n# remove common words and tokenize\nstoplist = set('for a of the and to in'.split())\ntexts = [\n [word for word in document.lower().split() if word not in stoplist]\n for document in documents\n]\n\n# remove words that appear only once\nfrequency = defaultdict(int)\nfor text in texts:\n for token in text:\n frequency[token] += 1\n\ntexts = [\n [token for token in text if frequency[token] > 1]\n for text in texts\n]\n\npprint(texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your way of processing the documents will likely vary; here, I only split on whitespace\nto tokenize, followed by lowercasing each word. In fact, I use this particular\n(simplistic and inefficient) setup to mimic the experiment done in Deerwester et al.'s\noriginal LSA article [1]_.\n\nThe ways to process documents are so varied and application- and language-dependent that I\ndecided to *not* constrain them by any interface. Instead, a document is represented\nby the features extracted from it, not by its \"surface\" string form: how you get to\nthe features is up to you. Below I describe one common, general-purpose approach (called\n:dfn:`bag-of-words`), but keep in mind that different application domains call for\ndifferent features, and, as always, it's `garbage in, garbage out `_...\n\nTo convert documents to vectors, we'll use a document representation called\n`bag-of-words `_. In this representation,\neach document is represented by one vector where each vector element represents\na question-answer pair, in the style of:\n\n- Question: How many times does the word `system` appear in the document?\n- Answer: Once.\n\nIt is advantageous to represent the questions only by their (integer) ids. The mapping\nbetween the questions and ids is called a dictionary:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import corpora\ndictionary = corpora.Dictionary(texts)\ndictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference\nprint(dictionary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we assigned a unique integer id to all words appearing in the corpus with the\n:class:`gensim.corpora.dictionary.Dictionary` class. This sweeps across the texts, collecting word counts\nand relevant statistics. In the end, we see there are twelve distinct words in the\nprocessed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector).\nTo see the mapping between words and their ids:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(dictionary.token2id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To actually convert tokenized documents to vectors:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"new_doc = \"Human computer interaction\"\nnew_vec = dictionary.doc2bow(new_doc.lower().split())\nprint(new_vec) # the word \"interaction\" does not appear in the dictionary and is ignored"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function :func:`doc2bow` simply counts the number of occurrences of\neach distinct word, converts the word to its integer word id\nand returns the result as a sparse vector. The sparse vector ``[(0, 1), (1, 1)]``\ntherefore reads: in the document `\"Human computer interaction\"`, the words `computer`\n(id 0) and `human` (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpus = [dictionary.doc2bow(text) for text in texts]\ncorpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use\nprint(corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By now it should be clear that the vector feature with ``id=10`` stands for the question \"How many\ntimes does the word `graph` appear in the document?\" and that the answer is \"zero\" for\nthe first six documents and \"one\" for the remaining three.\n\n\nCorpus Streaming -- One Document at a Time\n-------------------------------------------\n\nNote that `corpus` above resides fully in memory, as a plain Python list.\nIn this simple example, it doesn't matter much, but just to make things clear,\nlet's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.\nInstead, let's assume the documents are stored in a file on disk, one document per line. Gensim\nonly requires that a corpus must be able to return one document vector at a time:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus(object):\n def __iter__(self):\n for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The full power of Gensim comes from the fact that a corpus doesn't have to be\na ``list``, or a ``NumPy`` array, or a ``Pandas`` dataframe, or whatever.\nGensim *accepts any object that, when iterated over, successively yields\ndocuments*.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# This flexibility allows you to create your own corpus classes that stream the\n# documents directly from disk, network, database, dataframes... The models\n# in Gensim are implemented such that they don't require all vectors to reside\n# in RAM at once. You can even create the documents on the fly!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that\neach document occupies one line in a single file is not important; you can mold\nthe `__iter__` function to fit your input format, whatever it is.\nWalking directories, parsing XML, accessing the network...\nJust parse your input to retrieve a clean list of tokens in each document,\nthen convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!\nprint(corpus_memory_friendly)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Corpus is now an object. We didn't define any way to print it, so `print` just outputs address\nof the object in memory. Not very useful. To see the constituent vectors, let's\niterate over the corpus and print each document vector (one at a time):\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for vector in corpus_memory_friendly: # load one vector into memory at a time\n print(vector)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the output is the same as for the plain Python list, the corpus is now much\nmore memory friendly, because at most one vector resides in RAM at a time. Your\ncorpus can now be as large as you want.\n\nSimilarly, to construct the dictionary without loading all texts into memory:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from six import iteritems\n# collect statistics about all tokens\ndictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt'))\n# remove stop words and words that appear only once\nstop_ids = [\n dictionary.token2id[stopword]\n for stopword in stoplist\n if stopword in dictionary.token2id\n]\nonce_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]\ndictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once\ndictionary.compactify() # remove gaps in id sequence after words that were removed\nprint(dictionary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And that is all there is to it! At least as far as bag-of-words representation is concerned.\nOf course, what we do with such a corpus is another question; it is not at all clear\nhow counting the frequency of distinct words could be useful. As it turns out, it isn't, and\nwe will need to apply a transformation on this simple representation first, before\nwe can use it to compute any meaningful document vs. document similarities.\nTransformations are covered in the next tutorial\n(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),\nbut before that, let's briefly turn our attention to *corpus persistency*.\n\n\nCorpus Formats\n---------------\n\nThere exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.\n`Gensim` implements them via the *streaming corpus interface* mentioned earlier:\ndocuments are read from (resp. stored to) disk in a lazy fashion, one document at\na time, without the whole corpus being read into main memory at once.\n\nOne of the more notable file formats is the `Market Matrix format `_.\nTo save a corpus in the Matrix Market format:\n\ncreate a toy corpus of 2 documents, as a plain Python list\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it\n\ncorpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other formats include `Joachim's SVMlight format `_,\n`Blei's LDA-C format `_ and\n`GibbsLDA++ format `_.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)\ncorpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)\ncorpora.LowCorpus.serialize('/tmp/corpus.low', corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Conversely, to load a corpus iterator from a Matrix Market file:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpus = corpora.MmCorpus('/tmp/corpus.mm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Corpus objects are streams, so typically you won't be able to print them directly:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead, to view the contents of a corpus:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# one way of printing a corpus: load it entirely into memory\nprint(list(corpus)) # calling list() will convert any sequence to a plain Python list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# another way of doing it: print one document at a time, making use of the streaming interface\nfor doc in corpus:\n print(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The second way is obviously more memory-friendly, but for testing and development\npurposes, nothing beats the simplicity of calling ``list(corpus)``.\n\nTo save the same Matrix Market document stream in Blei's LDA-C format,\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:\njust load a document stream using one format and immediately save it in another format.\nAdding new formats is dead easy, check out the `code for the SVMlight corpus\n`_ for an example.\n\nCompatibility with NumPy and SciPy\n----------------------------------\n\nGensim also contains `efficient utility functions `_\nto help converting from/to numpy matrices\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import gensim\nimport numpy as np\nnumpy_matrix = np.random.randint(10, size=[5, 2]) # random matrix as an example\ncorpus = gensim.matutils.Dense2Corpus(numpy_matrix)\n# numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and from/to `scipy.sparse` matrices\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import scipy.sparse\nscipy_sparse_matrix = scipy.sparse.random(5, 2) # random sparse matrix as example\ncorpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)\nscipy_csc_matrix = gensim.matutils.corpus2csc(corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What Next\n---------\n\nRead about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\nReferences\n----------\n\nFor a complete reference (Want to prune the dictionary to a smaller size?\nOptimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.\n\n.. [1] This is the same corpus as used in\n `Deerwester et al. (1990): Indexing by Latent Semantic Analysis `_, Table 2.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we show a pretty fastText logo so that our gallery picks it up as a thumbnail.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('run_corpora_and_vector_spaces.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZO'] ] howtos/run_doc2vec_imdb.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nHow to Apply Doc2Vec to Reproduce the 'Paragraph Vector' paper\n==============================================================\n\nShows how to reproduce results of the \"Distributed Representation of Sentences and Documents\" paper by Le and Mikolov using Gensim.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Introduction\n------------\n\nThis guide shows you how to reproduce the results of the paper by `Le and\nMikolov 2014 `_ using Gensim. While the\nentire paper is worth reading (it's only 9 pages), we will be focusing on\nSection 3.2: \"Beyond One Sentence - Sentiment Analysis with the IMDB\ndataset\".\n\nThis guide follows the following steps:\n\n#. Load the IMDB dataset\n#. Train a variety of Doc2Vec models on the dataset\n#. Evaluate the performance of each model using a logistic regression\n#. Examine some of the results directly:\n\nWhen examining results, we will look for answers for the following questions:\n\n#. Are inferred vectors close to the precalculated ones?\n#. Do close documents seem more related than distant ones?\n#. Do the word vectors show useful similarities?\n#. Are the word vectors from this dataset any good at analogies?\n\nLoad corpus\n-----------\n\nOur data for the tutorial will be the `IMDB archive\n`_.\nIf you're not familiar with this dataset, then here's a brief intro: it\ncontains several thousand movie reviews.\n\nEach review is a single line of text containing multiple sentences, for example:\n\n```\nOne of the best movie-dramas I have ever seen. We do a lot of acting in the\nchurch and this is one that can be used as a resource that highlights all the\ngood things that actors can do in their work. I highly recommend this one,\nespecially for those who have an interest in acting, as a \"must see.\"\n```\n\nThese reviews will be the **documents** that we will work with in this tutorial.\nThere are 100 thousand reviews in total.\n\n#. 25k reviews for training (12.5k positive, 12.5k negative)\n#. 25k reviews for testing (12.5k positive, 12.5k negative)\n#. 50k unlabeled reviews\n\nOut of 100k reviews, 50k have a label: either positive (the reviewer liked\nthe movie) or negative.\nThe remaining 50k are unlabeled.\n\nOur first task will be to prepare the dataset.\n\nMore specifically, we will:\n\n#. Download the tar.gz file (it's only 84MB, so this shouldn't take too long)\n#. Unpack it and extract each movie review\n#. Split the reviews into training and test datasets\n\nFirst, let's define a convenient datatype for holding data for a single document:\n\n* words: The text of the document, as a ``list`` of words.\n* tags: Used to keep the index of the document in the entire dataset.\n* split: one of ``train``\\ , ``test`` or ``extra``. Determines how the document will be used (for training, testing, etc).\n* sentiment: either 1 (positive), 0 (negative) or None (unlabeled document).\n\nThis data type is helpful for later evaluation and reporting.\nIn particular, the ``index`` member will help us quickly and easily retrieve the vectors for a document from a model.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import collections\n\nSentimentDocument = collections.namedtuple('SentimentDocument', 'words tags split sentiment')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now proceed with loading the corpus.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import io\nimport re\nimport tarfile\nimport os.path\n\nimport smart_open\nimport gensim.utils\n\ndef download_dataset(url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'):\n fname = url.split('/')[-1]\n\n if os.path.isfile(fname):\n return fname\n\n # Download the file to local storage first.\n # We can't read it on the fly because of\n # https://github.com/RaRe-Technologies/smart_open/issues/331\n with smart_open.open(url, \"rb\", ignore_ext=True) as fin:\n with smart_open.open(fname, 'wb', ignore_ext=True) as fout:\n while True:\n buf = fin.read(io.DEFAULT_BUFFER_SIZE)\n if not buf:\n break\n fout.write(buf)\n\n return fname\n\ndef create_sentiment_document(name, text, index):\n _, split, sentiment_str, _ = name.split('/')\n sentiment = {'pos': 1.0, 'neg': 0.0, 'unsup': None}[sentiment_str]\n\n if sentiment is None:\n split = 'extra'\n\n tokens = gensim.utils.to_unicode(text).split()\n return SentimentDocument(tokens, [index], split, sentiment)\n\ndef extract_documents():\n fname = download_dataset()\n\n index = 0\n\n with tarfile.open(fname, mode='r:gz') as tar:\n for member in tar.getmembers():\n if re.match(r'aclImdb/(train|test)/(pos|neg|unsup)/\\d+_\\d+.txt$', member.name):\n member_bytes = tar.extractfile(member).read()\n member_text = member_bytes.decode('utf-8', errors='replace')\n assert member_text.count('\\n') == 0\n yield create_sentiment_document(member.name, member_text, index)\n index += 1\n\nalldocs = list(extract_documents())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's what a single document looks like\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(alldocs[27])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract our documents and split into training/test sets\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"train_docs = [doc for doc in alldocs if doc.split == 'train']\ntest_docs = [doc for doc in alldocs if doc.split == 'test']\nprint('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set-up Doc2Vec Training & Evaluation Models\n-------------------------------------------\nWe approximate the experiment of Le & Mikolov `\"Distributed Representations\nof Sentences and Documents\"\n`_ with guidance from\nMikolov's `example go.sh\n`_::\n\n ./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1\n\nWe vary the following parameter choices:\n\n* 100-dimensional vectors, as the 400-d vectors of the paper take a lot of\n memory and, in our tests of this task, don't seem to offer much benefit\n* Similarly, frequent word subsampling seems to decrease sentiment-prediction\n accuracy, so it's left out\n* ``cbow=0`` means skip-gram which is equivalent to the paper's 'PV-DBOW'\n mode, matched in gensim with ``dm=0``\n* Added to that DBOW model are two DM models, one which averages context\n vectors (\\ ``dm_mean``\\ ) and one which concatenates them (\\ ``dm_concat``\\ ,\n resulting in a much larger, slower, more data-hungry model)\n* A ``min_count=2`` saves quite a bit of model memory, discarding only words\n that appear in a single doc (and are thus no more expressive than the\n unique-to-each doc vectors themselves)\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import multiprocessing\nfrom collections import OrderedDict\n\nimport gensim.models.doc2vec\nassert gensim.models.doc2vec.FAST_VERSION > -1, \"This will be painfully slow otherwise\"\n\nfrom gensim.models.doc2vec import Doc2Vec\n\ncommon_kwargs = dict(\n vector_size=100, epochs=20, min_count=2,\n sample=0, workers=multiprocessing.cpu_count(), negative=5, hs=0,\n)\n\nsimple_models = [\n # PV-DBOW plain\n Doc2Vec(dm=0, **common_kwargs),\n # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes\n Doc2Vec(dm=1, window=10, alpha=0.05, comment='alpha=0.05', **common_kwargs),\n # PV-DM w/ concatenation - big, slow, experimental mode\n # window=5 (both sides) approximates paper's apparent 10-word total window size\n Doc2Vec(dm=1, dm_concat=1, window=5, **common_kwargs),\n]\n\nfor model in simple_models:\n model.build_vocab(alldocs)\n print(\"%s vocabulary scanned & state initialized\" % model)\n\nmodels_by_name = OrderedDict((str(model), model) for model in simple_models)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Le and Mikolov note that combining a paragraph vector from Distributed Bag of\nWords (DBOW) and Distributed Memory (DM) improves performance. We will\nfollow, pairing the models together for evaluation. Here, we concatenate the\nparagraph vectors obtained from each model with the help of a thin wrapper\nclass included in a gensim test module. (Note that this a separate, later\nconcatenation of output-vectors than the kind of input-window-concatenation\nenabled by the ``dm_concat=1`` mode above.)\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.test.test_doc2vec import ConcatenatedDoc2Vec\nmodels_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])\nmodels_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predictive Evaluation Methods\n-----------------------------\n\nGiven a document, our ``Doc2Vec`` models output a vector representation of the document.\nHow useful is a particular model?\nIn case of sentiment analysis, we want the ouput vector to reflect the sentiment in the input document.\nSo, in vector space, positive documents should be distant from negative documents.\n\nWe train a logistic regression from the training set:\n\n - regressors (inputs): document vectors from the Doc2Vec model\n - target (outpus): sentiment labels\n\nSo, this logistic regression will be able to predict sentiment given a document vector.\n\nNext, we test our logistic regression on the test set, and measure the rate of errors (incorrect predictions).\nIf the document vectors from the Doc2Vec model reflect the actual sentiment well, the error rate will be low.\n\nTherefore, the error rate of the logistic regression is indication of *how well* the given Doc2Vec model represents documents as vectors.\nWe can then compare different ``Doc2Vec`` models by looking at their error rates.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\nimport statsmodels.api as sm\nfrom random import sample\n\ndef logistic_predictor_from_data(train_targets, train_regressors):\n \"\"\"Fit a statsmodel logistic predictor on supplied data\"\"\"\n logit = sm.Logit(train_targets, train_regressors)\n predictor = logit.fit(disp=0)\n # print(predictor.summary())\n return predictor\n\ndef error_rate_for_model(test_model, train_set, test_set):\n \"\"\"Report error rate on test_doc sentiments, using supplied model and train_docs\"\"\"\n\n train_targets = [doc.sentiment for doc in train_set]\n train_regressors = [test_model.docvecs[doc.tags[0]] for doc in train_set]\n train_regressors = sm.add_constant(train_regressors)\n predictor = logistic_predictor_from_data(train_targets, train_regressors)\n\n test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_set]\n test_regressors = sm.add_constant(test_regressors)\n\n # Predict & evaluate\n test_predictions = predictor.predict(test_regressors)\n corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])\n errors = len(test_predictions) - corrects\n error_rate = float(errors) / len(test_predictions)\n return (error_rate, errors, len(test_predictions), predictor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bulk Training & Per-Model Evaluation\n------------------------------------\n\nNote that doc-vector training is occurring on *all* documents of the dataset,\nwhich includes all TRAIN/TEST/DEV docs. Because the native document-order\nhas similar-sentiment documents in large clumps \u2013 which is suboptimal for\ntraining \u2013 we work with once-shuffled copy of the training set.\n\nWe evaluate each model's sentiment predictive power based on error rate, and\nthe evaluation is done for each model.\n\n(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3\nmain models takes about an hour.)\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from collections import defaultdict\nerror_rates = defaultdict(lambda: 1.0) # To selectively print only best errors achieved"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from random import shuffle\nshuffled_alldocs = alldocs[:]\nshuffle(shuffled_alldocs)\n\nfor model in simple_models:\n print(\"Training %s\" % model)\n model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)\n\n print(\"\\nEvaluating %s\" % model)\n err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)\n error_rates[str(model)] = err_rate\n print(\"\\n%f %s\\n\" % (err_rate, model))\n\nfor model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:\n print(\"\\nEvaluating %s\" % model)\n err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)\n error_rates[str(model)] = err_rate\n print(\"\\n%f %s\\n\" % (err_rate, model))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Achieved Sentiment-Prediction Accuracy\n--------------------------------------\nCompare error rates achieved, best-to-worst\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"Err_rate Model\")\nfor rate, name in sorted((rate, name) for name, rate in error_rates.items()):\n print(\"%f %s\" % (rate, name))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our testing, contrary to the results of the paper, on this problem,\nPV-DBOW alone performs as good as anything else. Concatenating vectors from\ndifferent models only sometimes offers a tiny predictive improvement \u2013 and\nstays generally close to the best-performing solo model included.\n\nThe best results achieved here are just around 10% error rate, still a long\nway from the paper's reported 7.42% error rate.\n\n(Other trials not shown, with larger vectors and other changes, also don't\ncome close to the paper's reported value. Others around the net have reported\na similar inability to reproduce the paper's best numbers. The PV-DM/C mode\nimproves a bit with many more training epochs \u2013 but doesn't reach parity with\nPV-DBOW.)\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examining Results\n-----------------\n\nLet's look for answers to the following questions:\n\n#. Are inferred vectors close to the precalculated ones?\n#. Do close documents seem more related than distant ones?\n#. Do the word vectors show useful similarities?\n#. Are the word vectors from this dataset any good at analogies?\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Are inferred vectors close to the precalculated ones?\n-----------------------------------------------------\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"doc_id = np.random.randint(simple_models[0].docvecs.count) # Pick random doc; re-run cell for more examples\nprint('for doc %d...' % doc_id)\nfor model in simple_models:\n inferred_docvec = model.infer_vector(alldocs[doc_id].words)\n print('%s:\\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(Yes, here the stored vector from 20 epochs of training is usually one of the\nclosest to a freshly-inferred vector for the same words. Defaults for\ninference may benefit from tuning for each dataset or model parameters.)\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do close documents seem more related than distant ones?\n-------------------------------------------------------\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import random\n\ndoc_id = np.random.randint(simple_models[0].docvecs.count) # pick random doc, re-run cell for more examples\nmodel = random.choice(simple_models) # and a random model\nsims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count) # get *all* similar documents\nprint(u'TARGET (%d): \u00ab%s\u00bb\\n' % (doc_id, ' '.join(alldocs[doc_id].words)))\nprint(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\\n' % model)\nfor label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n s = sims[index]\n i = sims[index][0]\n words = ' '.join(alldocs[i].words)\n print(u'%s %s: \u00ab%s\u00bb\\n' % (label, s, words))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Somewhat, in terms of reviewer tone, movie genre, etc... the MOST\ncosine-similar docs usually seem more like the TARGET than the MEDIAN or\nLEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the\ncell to try another random target document.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do the word vectors show useful similarities?\n---------------------------------------------\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import random\n\nword_models = simple_models[:]\n\ndef pick_random_word(model, threshold=10):\n # pick a random word with a suitable number of occurences\n while True:\n word = random.choice(model.wv.index2word)\n if model.wv.vocab[word].count > threshold:\n return word\n\ntarget_word = pick_random_word(word_models[0])\n# or uncomment below line, to just pick a word from the relevant domain:\n# target_word = 'comedy/drama'\n\nfor model in word_models:\n print('target_word: %r model: %s similar words:' % (target_word, model))\n for i, (word, sim) in enumerate(model.wv.most_similar(target_word, topn=10), 1):\n print(' %d. %.2f %r' % (i, sim, word))\n print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do the DBOW words look meaningless? That's because the gensim DBOW model\ndoesn't train word vectors \u2013 they remain at their random initialized values \u2013\nunless you ask with the ``dbow_words=1`` initialization parameter. Concurrent\nword-training slows DBOW mode significantly, and offers little improvement\n(and sometimes a little worsening) of the error rate on this IMDB\nsentiment-prediction task, but may be appropriate on other tasks, or if you\nalso need word-vectors.\n\nWords from DM models tend to show meaningfully similar words when there are\nmany examples in the training data (as with 'plot' or 'actor'). (All DM modes\ninherently involve word-vector training concurrent with doc-vector training.)\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Are the word vectors from this dataset any good at analogies?\n-------------------------------------------------------------\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# grab the file if not already local\nquestions_filename = 'questions-words.txt'\nif not os.path.isfile(questions_filename):\n # Download IMDB archive\n print(\"Downloading analogy questions file...\")\n url = u'https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt'\n with smart_open.open(url, 'rb') as fin:\n with smart_open.open(questions_filename, 'wb') as fout:\n fout.write(fin.read())\nassert os.path.isfile(questions_filename), \"questions-words.txt unavailable\"\nprint(\"Success, questions-words.txt is available for next steps.\")\n\n# Note: this analysis takes many minutes\nfor model in word_models:\n score, sections = model.wv.evaluate_word_analogies('questions-words.txt')\n correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])\n print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even though this is a tiny, domain-specific dataset, it shows some meager\ncapability on the general word analogies \u2013 at least for the DM/mean and\nDM/concat models which actually train word vectors. (The untrained\nrandom-initialized words of the DBOW model of course fail miserably.)\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZO;LM, M, howtos/run_compare_lda.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nHow to Compare LDA Models\n=========================\n\nDemonstrates how you can compare a topic model with itself or other models.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# sphinx_gallery_thumbnail_number = 2\nimport logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, clean up the 20 Newsgroups dataset. We will use it to fit LDA.\n---------------------------------------------------------------------\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from string import punctuation\nfrom nltk import RegexpTokenizer\nfrom nltk.stem.porter import PorterStemmer\nfrom nltk.corpus import stopwords\nfrom sklearn.datasets import fetch_20newsgroups\n\n\nnewsgroups = fetch_20newsgroups()\neng_stopwords = set(stopwords.words('english'))\n\ntokenizer = RegexpTokenizer(r'\\s+', gaps=True)\nstemmer = PorterStemmer()\ntranslate_tab = {ord(p): u\" \" for p in punctuation}\n\ndef text2tokens(raw_text):\n \"\"\"Convert a raw text to a list of stemmed tokens.\"\"\"\n clean_text = raw_text.lower().translate(translate_tab)\n tokens = [token.strip() for token in tokenizer.tokenize(clean_text)]\n tokens = [token for token in tokens if token not in eng_stopwords]\n stemmed_tokens = [stemmer.stem(token) for token in tokens]\n return [token for token in stemmed_tokens if len(token) > 2] # skip short tokens\n\ndataset = [text2tokens(txt) for txt in newsgroups['data']] # convert a documents to list of tokens\n\nfrom gensim.corpora import Dictionary\ndictionary = Dictionary(documents=dataset, prune_at=None)\ndictionary.filter_extremes(no_below=5, no_above=0.3, keep_n=None) # use Dictionary to remove un-relevant tokens\ndictionary.compactify()\n\nd2b_dataset = [dictionary.doc2bow(doc) for doc in dataset] # convert list of tokens to bag of word representation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Second, fit two LDA models.\n---------------------------\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models import LdaMulticore\nnum_topics = 15\n\nlda_fst = LdaMulticore(\n corpus=d2b_dataset, num_topics=num_topics, id2word=dictionary,\n workers=4, eval_every=None, passes=10, batch=True\n)\n\nlda_snd = LdaMulticore(\n corpus=d2b_dataset, num_topics=num_topics, id2word=dictionary,\n workers=4, eval_every=None, passes=20, batch=True\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time to visualize, yay!\n-----------------------\n\nWe use two slightly different visualization methods depending on how you're running this tutorial.\nIf you're running via a Jupyter notebook, then you'll get a nice interactive Plotly heatmap.\nIf you're viewing the static version of the page, you'll get a similar matplotlib heatmap, but it won't be interactive.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def plot_difference_plotly(mdiff, title=\"\", annotation=None):\n \"\"\"Plot the difference between models.\n\n Uses plotly as the backend.\"\"\"\n import plotly.graph_objs as go\n import plotly.offline as py\n\n annotation_html = None\n if annotation is not None:\n annotation_html = [\n [\n \"+++ {}
--- {}\".format(\", \".join(int_tokens), \", \".join(diff_tokens))\n for (int_tokens, diff_tokens) in row\n ]\n for row in annotation\n ]\n\n data = go.Heatmap(z=mdiff, colorscale='RdBu', text=annotation_html)\n layout = go.Layout(width=950, height=950, title=title, xaxis=dict(title=\"topic\"), yaxis=dict(title=\"topic\"))\n py.iplot(dict(data=[data], layout=layout))\n\n\ndef plot_difference_matplotlib(mdiff, title=\"\", annotation=None):\n \"\"\"Helper function to plot difference between models.\n\n Uses matplotlib as the backend.\"\"\"\n import matplotlib.pyplot as plt\n fig, ax = plt.subplots(figsize=(18, 14))\n data = ax.imshow(mdiff, cmap='RdBu_r', origin='lower')\n plt.title(title)\n plt.colorbar(data)\n\n\ntry:\n get_ipython()\n import plotly.offline as py\nexcept Exception:\n #\n # Fall back to matplotlib if we're not in a notebook, or if plotly is\n # unavailable for whatever reason.\n #\n plot_difference = plot_difference_matplotlib\nelse:\n py.init_notebook_mode()\n plot_difference = plot_difference_plotly"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gensim can help you visualise the differences between topics. For this purpose, you can use the ``diff()`` method of LdaModel.\n\n``diff()`` returns a matrix with distances **mdiff** and a matrix with annotations **annotation**. Read the docstring for more detailed info.\n\nIn each **mdiff[i][j]** cell you'll find a distance between **topic_i** from the first model and **topic_j** from the second model.\n\nIn each **annotation[i][j]** cell you'll find **[tokens from intersection, tokens from difference** between **topic_i** from first model and **topic_j** from the second model.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(LdaMulticore.diff.__doc__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Case 1: How topics within ONE model correlate with each other.\n--------------------------------------------------------------\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Short description:\n\n* x-axis - topic;\n\n* y-axis - topic;\n\n.. role:: raw-html-m2r(raw)\n :format: html\n\n* :raw-html-m2r:`almost red cell` - strongly decorrelated topics;\n\n.. role:: raw-html-m2r(raw)\n :format: html\n\n* :raw-html-m2r:`almost blue cell` - strongly correlated topics.\n\nIn an ideal world, we would like to see different topics decorrelated between themselves. In this case, our matrix would look like this:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n\nmdiff = np.ones((num_topics, num_topics))\nnp.fill_diagonal(mdiff, 0.)\nplot_difference(mdiff, title=\"Topic difference (one model) in ideal world\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, in real life, not everything is so good, and the matrix looks different.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Short description (interactive annotations only):\n\n\n\n* ``+++ make, world, well`` - words from the intersection of topics = present in both topics;\n\n\n\n* ``--- money, day, still`` - words from the symmetric difference of topics = present in one topic but not the other.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"mdiff, annotation = lda_fst.diff(lda_fst, distance='jaccard', num_words=50)\nplot_difference(mdiff, title=\"Topic difference (one model) [jaccard distance]\", annotation=annotation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you compare a model with itself, you want to see as many red elements as possible (except diagonal). With this picture, you can look at the not very red elements and understand which topics in the model are very similar and why (you can read annotation if you move your pointer to cell).\n\n\n\n\nJaccard is stable and robust distance function, but this function not enough sensitive for some purposes. Let's try to use Hellinger distance now.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"mdiff, annotation = lda_fst.diff(lda_fst, distance='hellinger', num_words=50)\nplot_difference(mdiff, title=\"Topic difference (one model)[hellinger distance]\", annotation=annotation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You see that everything has become worse, but remember that everything depends on the task.\n\n\n\nYou need to choose the function with which your personal point of view about topics similarity and your task (from my experience, Jaccard is fine).\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Case 2: How topics from DIFFERENT models correlate with each other.\n-------------------------------------------------------------------\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, we want to look at the patterns between two different models and compare them.\n\nYou can do this by constructing a matrix with the difference.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"mdiff, annotation = lda_fst.diff(lda_snd, distance='jaccard', num_words=50)\nplot_difference(mdiff, title=\"Topic difference (two models)[jaccard distance]\", annotation=annotation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at this matrix, you can find similar and different topics (and relevant tokens which describe the intersection and difference).\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZO4P$ $ howtos/run_doc.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nHow to Author Gensim Documentation\n==================================\n\nSome tips of how to author documentation for ``gensim``.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import sys"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Background\n----------\n\nGensim is a large project with a wide range of functionality.\nUnfortunately, not all of this functionality is documented **well**, and some of it is not documented at all.\nWithout good documentation, users are unable to unlock Gensim's full potential.\nTherefore, authoring new documentation and improving existing documentation is of great value to the Gensim project.\n\nIf you implement new functionality in Gensim, please include **helpful** documentation.\nBy \"helpful\", we mean that your documentation answers questions that Gensim users may have.\nFor example:\n\n- What is this new functionality?\n- **Why** is it important?\n- **How** is it relevant to Gensim?\n- **What** can I do with it? What are some real-world applications?\n- **How** do I use it to achieve those things?\n- ... and others (if you can think of them, please add them here)\n\nBefore you author documentation, I suggest reading\n`\"What nobody tells you about documentation\" `__\nor watching its `accompanying video `__\n(or even both, if you're really keen).\n\nThe summary of the above presentation is: there are four distinct kinds of documentation, and you really need them all:\n\n1. Tutorials\n2. Howto guides\n3. Explanations\n4. References\n\nEach kind has its own intended audience, purpose, and writing style.\nWhen you make a PR with new functionality, please consider authoring each kind of documentation.\nAt the very least, you will (indirectly) author reference documentation through module, class and function docstrings.\n\nMechanisms\n----------\n\nWe keep our documentation as individual Python scripts.\nThese scripts live under :file:`docs/src/gallery` in one of several subdirectories:\n\n- core: core tutorials. We try to keep this part small, avoid putting stuff here.\n- tutorials: tutorials.\n- howtos: howto guides.\n\nPick a subdirectory and save your script under it.\nPrefix the name of the script with ``run_``: this way, the the documentation builder will run your script each time it builds our docs.\n\nThe contents of the script are straightforward.\nAt the very top, you need a docstring describing what your script does.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r\"\"\"\nTitle\n=====\n\nBrief description.\n\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The title is what will show up in the gallery.\nKeep this short and descriptive.\n\nThe description will appear as a tooltip in the gallery.\nWhen people mouse-over the title, they will see the description.\nKeep this short too.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The rest of the script is Python, formatted in a special way so that Sphinx Gallery can parse it.\nThe most important properties of this format are:\n\n- Sphinx Gallery will split your script into blocks\n- A block can be Python source or RST-formatted comments\n- To indicate that a block is in RST, prefix it with a line of 80 hash (#) characters.\n- All other blocks will be interpreted as Python source\n\nRead `this link `__ for more details.\nIf you need further examples, check out other ``gensim`` tutorials and guides.\nAll of them (including this one!) have a download link at the bottom of the page, which exposes the Python source they were generated from.\n\nYou should be able to run your script directly from the command line::\n\n python myscript.py\n\nand it should run to completion without error, occasionally printing stuff to standard output.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Authoring Workflow\n------------------\n\nThere are several ways to author documentation.\nThe simplest and most straightforward is to author your ``script.py`` from scratch.\nYou'll have the following cycle:\n\n1. Make changes\n2. Run ``python script.py``\n3. Check standard output, standard error and return code\n4. If everything works well, stop.\n5. Otherwise, go back to step 1).\n\nIf the above is not your cup of tea, you can also author your documentation as a Jupyter notebook.\nThis is a more flexible approach that enables you to tweak parts of the documentation and re-run them as necessary.\n\nOnce you're happy with the notebook, convert it to a script.py.\nThere's a helpful `script `__ that will do it for you.\nTo use it::\n\n python to_python.py < notebook.ipynb > script.py\n\nYou may have to touch up the resulting ``script.py``.\nMore specifically:\n\n- Update the title\n- Update the description\n- Fix any issues that the markdown-to-RST converter could not deal with\n\nOnce your script.py works, put it in a suitable subdirectory.\nPlease don't include your original Jupyter notebook in the repository - we won't be using it.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Correctness\n-----------\n\nIncorrect documentation can be worse than no documentation at all.\nTake the following steps to ensure correctness:\n\n- Run Python's doctest module on your docstrings\n- Run your documentation scripts from scratch, removing any temporary files/results\n\nUsing data in your documentation\n--------------------------------\n\nSome parts of the documentation require real-world data to be useful.\nFor example, you may need more than just a toy example to demonstrate the benefits of one model over another.\nThis subsection provides some tips for including data in your documentation.\n\nIf possible, use data available via Gensim's\n`downloader API `__.\nThis will reduce the risk of your documentation becoming obsolete because required data is no longer available.\n\nUse the smallest possible dataset: avoid making people unnecessarily load large datasets and models.\nThis will make your documentation faster to run and easier for people to use (they can modify your examples and re-run them quickly).\n\nFinalizing your contribution\n----------------------------\n\nFirst, get Sphinx Gallery to build your documentation::\n\n make -C docs/src html\n\nThis can take a while if your documentation uses a large dataset, or if you've changed many other tutorials or guides.\nOnce this completes successfully, open ``docs/auto_examples/index.html`` in your browser.\nYou should see your new tutorial or guide in the gallery.\n\nOnce your documentation script is working correctly, it's time to add it to the git repository::\n\n git add docs/src/gallery/tutorials/run_example.py\n git add docs/src/auto_examples/tutorials/run_example.{py,py.md5,rst,ipynb}\n git add docs/src/auto_examples/howtos/sg_execution_times.rst\n git commit -m \"enter a helpful commit message here\"\n git push origin branchname\n\n.. Note::\n You may be wondering what all those other files are.\n Sphinx Gallery puts a copy of your Python script in ``auto_examples/tutorials``.\n The .md5 contains MD5 hash of the script to enable easy detection of modifications.\n Gallery also generates .rst (RST for Sphinx) and .ipynb (Jupyter notebook) files from the script.\n Finally, ``sg_execution_times.rst`` contains the time taken to run each example.\n\nFinally, make a PR on `github `__.\nOne of our friendly maintainers will review it, make suggestions, and eventually merge it.\nYour documentation will then appear in the gallery alongside the rest of the example.\nAt that stage, give yourself a pat on the back: you're done!\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZOd-r howtos/run_downloader_api.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nHow to download pre-trained models and corpora\n==============================================\n\nDemonstrates simple and quick access to common corpora, models, and other data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of Gensim's features is simple and easy access to some common data.\nThe `gensim-data `_ project stores a variety of corpora, models and other data.\nGensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.\nThe module leverages a local cache that ensures data is downloaded at most once.\n\nThis tutorial:\n\n* Retrieves the text8 corpus, unless it is already on your local machine\n* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)\n* Leverages the model to calculate word similarity\n* Demonstrates using the API to load other models and corpora\n\nLet's start by importing the api module.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import gensim.downloader as api"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, lets download the text8 corpus and load it to memory (automatically)\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"corpus = api.load('text8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, corpus is an iterable.\nIf you look under the covers, it has the following definition:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import inspect\nprint(inspect.getsource(corpus.__class__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more details, look inside the file that defines the Dataset class for your particular resource.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(inspect.getfile(corpus.__class__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.word2vec import Word2Vec\nmodel = Word2Vec(corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our word2vec model, let's find words that are similar to 'tree'\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.most_similar('tree'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import json\ninfo = api.info()\nprint(json.dumps(info, indent=4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two types of data: corpora and models.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(info.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have a look at the available corpora:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for corpus_name, corpus_data in sorted(info['corpora'].items()):\n print(\n '%s (%d records): %s' % (\n corpus_name,\n corpus_data.get('num_records', -1),\n corpus_data['description'][:40] + '...',\n )\n )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... and the same for models:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for model_name, model_data in sorted(info['models'].items()):\n print(\n '%s (%d records): %s' % (\n model_name,\n model_data.get('num_records', -1),\n model_data['description'][:40] + '...',\n )\n )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get detailed information about the model/corpus, use:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fake_news_info = api.info('fake-news')\nprint(json.dumps(fake_news_info, indent=4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(api.load('glove-wiki-gigaword-50', return_path=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to load the model to memory, then:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model = api.load(\"glove-wiki-gigaword-50\")\nmodel.most_similar(\"glass\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class ``Dataset`` and provide ``__iter__`` method\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZORbU U $ tutorials/run_distance_metrics.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nDistance Metrics\n================\n\nIntroduces the concept of distance between two bags of words or distributions, and demonstrates its calculation using gensim.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you simply want to calculate the similarity between documents, then you\nmay want to check out the `Similarity Queries Tutorial\n`_ and the `API reference\n`_. The current\ntutorial shows the building block of these larger methods, which are a small\nsuite of distance metrics, including:\n\nHere's a brief summary of this tutorial:\n\n1. Set up a small corpus consisting of documents belonging to one of two topics\n2. Train an LDA model to distinguish between the two topics\n3. Use the model to obtain distributions for some sample words\n4. Compare the distributions to each other using a variety of distance metrics:\n\n * Hellinger\n * Kullback-Leibler\n * Jaccard\n\n5. Discuss the concept of distance metrics in slightly more detail\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.corpora import Dictionary\n\n# you can use any corpus, this is just illustratory\ntexts = [\n ['bank','river','shore','water'],\n ['river','water','flow','fast','tree'],\n ['bank','water','fall','flow'],\n ['bank','bank','water','rain','river'],\n ['river','water','mud','tree'],\n ['money','transaction','bank','finance'],\n ['bank','borrow','money'], \n ['bank','finance'],\n ['finance','money','sell','bank'],\n ['borrow','sell'],\n ['bank','loan','sell'],\n]\n\ndictionary = Dictionary(texts)\ncorpus = [dictionary.doc2bow(text) for text in texts]\n\nimport numpy\nnumpy.random.seed(1) # setting random seed to get the same results each time.\n\nfrom gensim.models import ldamodel\nmodel = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, minimum_probability=1e-8)\nmodel.show_topics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's call the 1st topic the **water** topic and the second topic the **finance** topic.\n\nLet's take a few sample documents and get them ready to test our distance functions.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"doc_water = ['river', 'water', 'shore']\ndoc_finance = ['finance', 'money', 'sell']\ndoc_bank = ['finance', 'bank', 'tree', 'water']\n\n# now let's make these into a bag of words format\nbow_water = model.id2word.doc2bow(doc_water) \nbow_finance = model.id2word.doc2bow(doc_finance) \nbow_bank = model.id2word.doc2bow(doc_bank) \n\n# we can now get the LDA topic distributions for these\nlda_bow_water = model[bow_water]\nlda_bow_finance = model[bow_finance]\nlda_bow_bank = model[bow_bank]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hellinger\n---------\n\nWe're now ready to apply our distance metrics. These metrics return a value between 0 and 1, where values closer to 0 indicate a smaller 'distance' and therefore a larger similarity.\n\nLet's start with the popular Hellinger distance. \n\nThe Hellinger distance metric gives an output in the range [0,1] for two probability distributions, with values closer to 0 meaning they are more similar.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.matutils import hellinger\nprint(hellinger(lda_bow_water, lda_bow_finance))\nprint(hellinger(lda_bow_finance, lda_bow_bank))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Makes sense, right? In the first example, Document 1 and Document 2 are hardly similar, so we get a value of roughly 0.5. \n\nIn the second case, the documents are a lot more similar, semantically. Trained with the model, they give a much less distance value.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kullback\u2013Leibler\n----------------\n\nLet's run similar examples down with Kullback Leibler.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.matutils import kullback_leibler\n\nprint(kullback_leibler(lda_bow_water, lda_bow_bank))\nprint(kullback_leibler(lda_bow_finance, lda_bow_bank))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
".. important::\n KL is not a Distance Metric in the mathematical sense, and hence is not\n symmetrical. This means that ``kullback_leibler(lda_bow_finance,\n lda_bow_bank)`` is not equal to ``kullback_leibler(lda_bow_bank,\n lda_bow_finance)``. \n\nAs you can see, the values are not equal. We'll get more into the details of\nthis later on in the notebook.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(kullback_leibler(lda_bow_bank, lda_bow_finance))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our previous examples we saw that there were lower distance values between\nbank and finance than for bank and water, even if it wasn't by a huge margin.\nWhat does this mean?\n\nThe ``bank`` document is a combination of both water and finance related\nterms - but as bank in this context is likely to belong to the finance topic,\nthe distance values are less between the finance and bank bows.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# just to confirm our suspicion that the bank bow is more to do with finance:\nmodel.get_document_topics(bow_bank)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's evident that while it isn't too skewed, it it more towards the finance topic.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Distance metrics (also referred to as similarity metrics), as suggested in\nthe examples above, are mainly for probability distributions, but the methods\ncan accept a bunch of formats for input. You can do some further reading on\n`Kullback Leibler `_ and `Hellinger\n`_ to figure out what suits\nyour needs.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jaccard\n-------\n\nLet us now look at the `Jaccard Distance\n`_ metric for similarity between\nbags of words (i.e, documents)\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.matutils import jaccard\n\nprint(jaccard(bow_water, bow_bank))\nprint(jaccard(doc_water, doc_bank))\nprint(jaccard(['word'], ['word']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The three examples above feature 2 different input methods. \n\nIn the first case, we present to jaccard document vectors already in bag of\nwords format. The distance can be defined as 1 minus the size of the\nintersection upon the size of the union of the vectors. \n\nWe can see (on manual inspection as well), that the distance is likely to be\nhigh - and it is. \n\nThe last two examples illustrate the ability for jaccard to accept even lists\n(i.e, documents) as inputs.\n\nIn the last case, because they are the same vectors, the value returned is 0\n- this means the distance is 0 and the two documents are identical. \n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Distance Metrics for Topic Distributions\n----------------------------------------\n\nWhile there are already standard methods to identify similarity of documents,\nour distance metrics has one more interesting use-case: topic distributions. \n\nLet's say we want to find out how similar our two topics are, water and finance.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"topic_water, topic_finance = model.show_topics()\n\n# some pre processing to get the topics in a format acceptable to our distance metrics\n\ndef parse_topic_string(topic):\n # takes the string returned by model.show_topics()\n # split on strings to get topics and the probabilities\n topic = topic.split('+')\n # list to store topic bows\n topic_bow = []\n for word in topic:\n # split probability and word\n prob, word = word.split('*')\n # get rid of spaces and quote marks\n word = word.replace(\" \",\"\").replace('\"', '')\n # convert to word_type\n word = model.id2word.doc2bow([word])[0][0]\n topic_bow.append((word, float(prob)))\n return topic_bow\n\nfinance_distribution = parse_topic_string(topic_finance[1])\nwater_distribution = parse_topic_string(topic_water[1])\n\n# the finance topic in bag of words format looks like this:\nprint(finance_distribution)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've got our topics in a format more acceptable by our functions,\nlet's use a Distance metric to see how similar the word distributions in the\ntopics are.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(hellinger(water_distribution, finance_distribution))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our value of roughly 0.36 means that the topics are not TOO distant with\nrespect to their word distributions.\n\nThis makes sense again, because of overlapping words like ``bank`` and a\nsmall size dictionary.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kullback-Leibler Gotchas\n------------------------\n\nIn our previous example we didn't use Kullback Leibler to test for similarity\nfor a reason - KL is not a Distance 'Metric' in the technical sense (you can\nsee what a metric is `here\n`_\\ ). The nature of it,\nmathematically also means we must be a little careful before using it,\nbecause since it involves the log function, a zero can mess things up. For\nexample:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# 16 here is the number of features the probability distribution draws from\nprint(kullback_leibler(water_distribution, finance_distribution, 16))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That wasn't very helpful, right? This just means that we have to be a bit\ncareful about our inputs. Our old example didn't work out because they were\nsome missing values for some words (because ``show_topics()`` only returned\nthe top 10 topics). \n\nThis can be remedied, though.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# return ALL the words in the dictionary for the topic-word distribution.\ntopic_water, topic_finance = model.show_topics(num_words=len(model.id2word))\n\n# do our bag of words transformation again\nfinance_distribution = parse_topic_string(topic_finance[1])\nwater_distribution = parse_topic_string(topic_water[1])\n\n# and voila!\nprint(kullback_leibler(water_distribution, finance_distribution))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may notice that the distance for this is quite less, indicating a high\nsimilarity. This may be a bit off because of the small size of the corpus,\nwhere all topics are likely to contain a decent overlap of word\nprobabilities. You will likely get a better value for a bigger corpus.\n\nSo, just remember, if you intend to use KL as a metric to measure similarity\nor distance between two distributions, avoid zeros by returning the ENTIRE\ndistribution. Since it's unlikely any probability distribution will ever have\nabsolute zeros for any feature/word, returning all the values like we did\nwill make you good to go.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are Distance Metrics?\n--------------------------\n\nHaving seen the practical usages of these measures (i.e, to find similarity),\nlet's learn a little about what exactly Distance Measures and Metrics are. \n\nI mentioned in the previous section that KL was not a distance metric. There\nare 4 conditons for for a distance measure to be a metric:\n\n1. d(x,y) >= 0\n2. d(x,y) = 0 <=> x = y\n3. d(x,y) = d(y,x)\n4. d(x,z) <= d(x,y) + d(y,z)\n\nThat is: it must be non-negative; if x and y are the same, distance must be\nzero; it must be symmetric; and it must obey the triangle inequality law. \n\nSimple enough, right? \n\nLet's test these out for our measures.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# normal Hellinger\na = hellinger(water_distribution, finance_distribution)\nb = hellinger(finance_distribution, water_distribution)\nprint(a)\nprint(b)\nprint(a == b)\n\n# if we pass the same values, it is zero.\nprint(hellinger(water_distribution, water_distribution))\n\n# for triangle inequality let's use LDA document distributions\nprint(hellinger(lda_bow_finance, lda_bow_bank))\n\n# Triangle inequality works too!\nprint(hellinger(lda_bow_finance, lda_bow_water) + hellinger(lda_bow_water, lda_bow_bank))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So Hellinger is indeed a metric. Let's check out KL. \n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"a = kullback_leibler(finance_distribution, water_distribution)\nb = kullback_leibler(water_distribution, finance_distribution)\nprint(a)\nprint(b)\nprint(a == b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We immediately notice that when we swap the values they aren't equal! One of\nthe four conditions not fitting is enough for it to not be a metric. \n\nHowever, just because it is not a metric, (strictly in the mathematical\nsense) does not mean that it is not useful to figure out the distance between\ntwo probability distributions. KL Divergence is widely used for this purpose,\nand is probably the most 'famous' distance measure in fields like Information\nTheory.\n\nFor a nice review of the mathematical differences between Hellinger and KL,\n`this\n`__\nlink does a very good job. \n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualizing Distance Metrics\n----------------------------\n\nLet's plot a graph of our toy dataset using the popular `networkx\n`_ library. \n\nEach node will be a document, where the color of the node will be its topic\naccording to the LDA model. Edges will connect documents to each other, where\nthe *weight* of the edge will be inversely proportional to the Jaccard\nsimilarity between two documents. We will also annotate the edges to further\naid visualization: **strong** edges will connect similar documents, and\n**weak (dashed)** edges will connect dissimilar documents.\n\nIn summary, similar documents will be closer together, different documents\nwill be further apart.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import itertools\nimport networkx as nx\n\ndef get_most_likely_topic(doc):\n bow = model.id2word.doc2bow(doc)\n topics, probabilities = zip(*model.get_document_topics(bow))\n max_p = max(probabilities)\n topic = topics[probabilities.index(max_p)]\n return topic\n\ndef get_node_color(i):\n return 'skyblue' if get_most_likely_topic(texts[i]) == 0 else 'pink'\n\nG = nx.Graph()\nfor i, _ in enumerate(texts):\n G.add_node(i)\n \nfor (i1, i2) in itertools.combinations(range(len(texts)), 2):\n bow1, bow2 = texts[i1], texts[i2]\n distance = jaccard(bow1, bow2)\n G.add_edge(i1, i2, weight=1/distance)\n \n#\n# https://networkx.github.io/documentation/networkx-1.9/examples/drawing/weighted_graph.html\n#\npos = nx.spring_layout(G)\n\nthreshold = 1.25\nelarge=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > threshold]\nesmall=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] <= threshold]\n\nnode_colors = [get_node_color(i) for (i, _) in enumerate(texts)]\nnx.draw_networkx_nodes(G, pos, node_size=700, node_color=node_colors)\nnx.draw_networkx_edges(G,pos,edgelist=elarge, width=2)\nnx.draw_networkx_edges(G,pos,edgelist=esmall, width=2, alpha=0.2, edge_color='b', style='dashed')\nnx.draw_networkx_labels(G, pos, font_size=20, font_family='sans-serif')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make several observations from this graph.\n\nFirst, the graph consists of two connected components (if you ignore the weak edges).\nNodes 0, 1, 2, 3, 4 (which all belong to the water topic) form the first connected component.\nThe other nodes, which all belong to the finance topic, form the second connected component.\n\nSecond, the LDA model didn't do a very good job of classifying our documents into topics.\nThere were many misclassifications, as you can confirm in the summary below:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print('id\\ttopic\\tdoc')\nfor i, t in enumerate(texts):\n print('%d\\t%d\\t%s' % (i, get_most_likely_topic(t), ' '.join(t)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is mostly because the corpus used to train the LDA model is so small.\nUsing a larger corpus should give you much better results, but that is beyond\nthe scope of this tutorial.\n\nConclusion\n----------\n\nThat brings us to the end of this small tutorial.\nTo recap, here's what we covered:\n\n1. Set up a small corpus consisting of documents belonging to one of two topics\n2. Train an LDA model to distinguish between the two topics\n3. Use the model to obtain distributions for some sample words\n4. Compare the distributions to each other using a variety of distance metrics: Hellinger, Kullback-Leibler, Jaccard\n5. Discuss the concept of distance metrics in slightly more detail\n\nThe scope for adding new similarity metrics is large, as there exist an even\nlarger suite of metrics and methods to add to the matutils.py file.\nFor more details, see `Similarity Measures for Text Document Clustering\n`_\nby A. Huang.\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZOْ]? ]? tutorials/run_lda.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nLDA Model\n=========\n\nIntroduces Gensim's LDA model and demonstrates its use on the NIPS corpus.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this tutorial is to demonstrate training an LDA model and\nobtaining good results.\n\nIn this tutorial we will:\n\n* Load data.\n* Pre-process data.\n* Transform documents to a vectorized form.\n* Train an LDA model.\n\nThis tutorial will **not**:\n\n* Explain how Latent Dirichlet Allocation works\n* Explain how the LDA model performs inference\n* Teach you how to use Gensim's LDA implementation in its entirety\n\nIf you are not familiar with the LDA model or how to use it in Gensim, I\nsuggest you read up on that before continuing with this tutorial. Basic\nunderstanding of the LDA model should suffice. Examples:\n\n* `Introduction to Latent Dirichlet Allocation `_\n* Gensim tutorial: `sphx_glr_auto_examples_core_run_topics_and_transformations.py`\n* Gensim's LDA model API docs: :py:class:`gensim.models.LdaModel`\n\nI would also encourage you to consider each step when applying the model to\nyour data, instead of just blindly applying my solution. The different steps\nwill depend on your data and possibly your goal with the model.\n\nData\n----\n\nI have used a corpus of NIPS papers in this tutorial, but if you're following\nthis tutorial just to learn about LDA I encourage you to consider picking a\ncorpus on a subject that you are familiar with. Qualitatively evaluating the\noutput of an LDA model is challenging and can require you to understand the\nsubject matter of your corpus (depending on your goal with the model).\n\nNIPS (Neural Information Processing Systems) is a machine learning conference\nso the subject matter should be well suited for most of the target audience\nof this tutorial. You can download the original data from Sam Roweis'\n`website `_. The code below will\nalso do that for you.\n\n.. Important::\n The corpus contains 1740 documents, and not particularly long ones.\n So keep in mind that this tutorial is not geared towards efficiency, and be\n careful before applying the code to a large dataset.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import io\nimport os.path\nimport re\nimport tarfile\n\nimport smart_open\n\ndef extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):\n fname = url.split('/')[-1]\n \n # Download the file to local storage first.\n # We can't read it on the fly because of \n # https://github.com/RaRe-Technologies/smart_open/issues/331\n if not os.path.isfile(fname):\n with smart_open.open(url, \"rb\") as fin:\n with smart_open.open(fname, 'wb') as fout:\n while True:\n buf = fin.read(io.DEFAULT_BUFFER_SIZE)\n if not buf:\n break\n fout.write(buf)\n \n with tarfile.open(fname, mode='r:gz') as tar:\n # Ignore directory entries, as well as files like README, etc.\n files = [\n m for m in tar.getmembers()\n if m.isfile() and re.search(r'nipstxt/nips\\d+/\\d+\\.txt', m.name)\n ]\n for member in sorted(files, key=lambda x: x.name):\n member_bytes = tar.extractfile(member).read()\n yield member_bytes.decode('utf-8', errors='replace')\n\ndocs = list(extract_documents())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we have a list of 1740 documents, where each document is a Unicode string. \nIf you're thinking about using your own corpus, then you need to make sure\nthat it's in the same format (list of Unicode strings) before proceeding\nwith the rest of this tutorial.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(len(docs))\nprint(docs[0][:500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pre-process and vectorize the documents\n---------------------------------------\n\nAs part of preprocessing, we will:\n\n* Tokenize (split the documents into tokens).\n* Lemmatize the tokens.\n* Compute bigrams.\n* Compute a bag-of-words representation of the data.\n\nFirst we tokenize the text using a regular expression tokenizer from NLTK. We\nremove numeric tokens and tokens that are only a single character, as they\ndon't tend to be useful, and the dataset contains a lot of them.\n\n.. Important::\n\n This tutorial uses the nltk library for preprocessing, although you can\n replace it with something else if you want.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Tokenize the documents.\nfrom nltk.tokenize import RegexpTokenizer\n\n# Split the documents into tokens.\ntokenizer = RegexpTokenizer(r'\\w+')\nfor idx in range(len(docs)):\n docs[idx] = docs[idx].lower() # Convert to lowercase.\n docs[idx] = tokenizer.tokenize(docs[idx]) # Split into words.\n\n# Remove numbers, but not words that contain numbers.\ndocs = [[token for token in doc if not token.isnumeric()] for doc in docs]\n\n# Remove words that are only one character.\ndocs = [[token for token in doc if len(token) > 1] for doc in docs]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the WordNet lemmatizer from NLTK. A lemmatizer is preferred over a\nstemmer in this case because it produces more readable words. Output that is\neasy to read is very desirable in topic modelling.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Lemmatize the documents.\nfrom nltk.stem.wordnet import WordNetLemmatizer\n\nlemmatizer = WordNetLemmatizer()\ndocs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We find bigrams in the documents. Bigrams are sets of two adjacent words.\nUsing bigrams we can get phrases like \"machine_learning\" in our output\n(spaces are replaced with underscores); without bigrams we would only get\n\"machine\" and \"learning\".\n\nNote that in the code below, we find bigrams and then add them to the\noriginal data, because we would like to keep the words \"machine\" and\n\"learning\" as well as the bigram \"machine_learning\".\n\n.. Important::\n Computing n-grams of large dataset can be very computationally\n and memory intensive.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Compute bigrams.\nfrom gensim.models import Phrases\n\n# Add bigrams and trigrams to docs (only ones that appear 20 times or more).\nbigram = Phrases(docs, min_count=20)\nfor idx in range(len(docs)):\n for token in bigram[docs[idx]]:\n if '_' in token:\n # Token is a bigram, add to document.\n docs[idx].append(token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We remove rare words and common words based on their *document frequency*.\nBelow we remove words that appear in less than 20 documents or in more than\n50% of the documents. Consider trying to remove words only based on their\nfrequency, or maybe combining that with this approach.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Remove rare and common tokens.\nfrom gensim.corpora import Dictionary\n\n# Create a dictionary representation of the documents.\ndictionary = Dictionary(docs)\n\n# Filter out words that occur less than 20 documents, or more than 50% of the documents.\ndictionary.filter_extremes(no_below=20, no_above=0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we transform the documents to a vectorized form. We simply compute\nthe frequency of each word, including the bigrams.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Bag-of-words representation of the documents.\ncorpus = [dictionary.doc2bow(doc) for doc in docs]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how many tokens and documents we have to train on.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print('Number of unique tokens: %d' % len(dictionary))\nprint('Number of documents: %d' % len(corpus))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training\n--------\n\nWe are ready to train the LDA model. We will first discuss how to set some of\nthe training parameters.\n\nFirst of all, the elephant in the room: how many topics do I need? There is\nreally no easy answer for this, it will depend on both your data and your\napplication. I have used 10 topics here because I wanted to have a few topics\nthat I could interpret and \"label\", and because that turned out to give me\nreasonably good results. You might not need to interpret all your topics, so\nyou could use a large number of topics, for example 100.\n\n``chunksize`` controls how many documents are processed at a time in the\ntraining algorithm. Increasing chunksize will speed up training, at least as\nlong as the chunk of documents easily fit into memory. I've set ``chunksize =\n2000``, which is more than the amount of documents, so I process all the\ndata in one go. Chunksize can however influence the quality of the model, as\ndiscussed in Hoffman and co-authors [2], but the difference was not\nsubstantial in this case.\n\n``passes`` controls how often we train the model on the entire corpus.\nAnother word for passes might be \"epochs\". ``iterations`` is somewhat\ntechnical, but essentially it controls how often we repeat a particular loop\nover each document. It is important to set the number of \"passes\" and\n\"iterations\" high enough.\n\nI suggest the following way to choose iterations and passes. First, enable\nlogging (as described in many Gensim tutorials), and set ``eval_every = 1``\nin ``LdaModel``. When training the model look for a line in the log that\nlooks something like this::\n\n 2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations\n\nIf you set ``passes = 20`` you will see this line 20 times. Make sure that by\nthe final passes, most of the documents have converged. So you want to choose\nboth passes and iterations to be high enough for this to happen.\n\nWe set ``alpha = 'auto'`` and ``eta = 'auto'``. Again this is somewhat\ntechnical, but essentially we are automatically learning two parameters in\nthe model that we usually would have to specify explicitly.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train LDA model.\nfrom gensim.models import LdaModel\n\n# Set training parameters.\nnum_topics = 10\nchunksize = 2000\npasses = 20\niterations = 400\neval_every = None # Don't evaluate model perplexity, takes too much time.\n\n# Make a index to word dictionary.\ntemp = dictionary[0] # This is only to \"load\" the dictionary.\nid2word = dictionary.id2token\n\nmodel = LdaModel(\n corpus=corpus,\n id2word=id2word,\n chunksize=chunksize,\n alpha='auto',\n eta='auto',\n iterations=iterations,\n num_topics=num_topics,\n passes=passes,\n eval_every=eval_every\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can compute the topic coherence of each topic. Below we display the\naverage topic coherence and print the topics in order of topic coherence.\n\nNote that we use the \"Umass\" topic coherence measure here (see\n:py:func:`gensim.models.ldamodel.LdaModel.top_topics`), Gensim has recently\nobtained an implementation of the \"AKSW\" topic coherence measure (see\naccompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).\n\nIf you are familiar with the subject of the articles in this dataset, you can\nsee that the topics below make a lot of sense. However, they are not without\nflaws. We can see that there is substantial overlap between some topics,\nothers are hard to interpret, and most of them have at least some terms that\nseem out of place. If you were able to do better, feel free to share your\nmethods on the blog at http://rare-technologies.com/lda-training-tips/ !\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"top_topics = model.top_topics(corpus) #, num_words=20)\n\n# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.\navg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics\nprint('Average topic coherence: %.4f.' % avg_topic_coherence)\n\nfrom pprint import pprint\npprint(top_topics)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Things to experiment with\n-------------------------\n\n* ``no_above`` and ``no_below`` parameters in ``filter_extremes`` method.\n* Adding trigrams or even higher order n-grams.\n* Consider whether using a hold-out set or cross-validation is the way to go for you.\n* Try other datasets.\n\nWhere to go from here\n---------------------\n\n* Check out a RaRe blog post on the AKSW topic coherence measure (http://rare-technologies.com/what-is-topic-coherence/).\n* pyLDAvis (https://pyldavis.readthedocs.io/en/latest/index.html).\n* Read some more Gensim tutorials (https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).\n* If you haven't already, read [1] and [2] (see references).\n\nReferences\n----------\n\n1. \"Latent Dirichlet Allocation\", Blei et al. 2003.\n2. \"Online Learning for Latent Dirichlet Allocation\", Hoffman et al. 2010.\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZOEAĽ" " tutorials/run_wmd.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nWord Movers' Distance\n=====================\n\nDemonstrates using Gensim's implemenation of the WMD.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word Mover's Distance (WMD) is a promising new tool in machine learning that\nallows us to submit a query and return the most relevant documents. This\ntutorial introduces WMD and shows how you can compute the WMD distance\nbetween two documents using ``wmdistance``.\n\nWMD Basics\n----------\n\nWMD enables us to assess the \"distance\" between two documents in a meaningful\nway, even when they have no words in common. It uses `word2vec\n`_ [4] vector embeddings of\nwords. It been shown to outperform many of the state-of-the-art methods in\n*k*\\ -nearest neighbors classification [3].\n\nWMD is illustrated below for two very similar sentences (illustration taken\nfrom `Vlad Niculae's blog\n`_\\ ). The sentences\nhave no words in common, but by matching the relevant words, WMD is able to\naccurately measure the (dis)similarity between the two sentences. The method\nalso uses the bag-of-words representation of the documents (simply put, the\nword's frequencies in the documents), noted as $d$ in the figure below. The\nintuition behind the method is that we find the minimum \"traveling distance\"\nbetween documents, in other words the most efficient way to \"move\" the\ndistribution of document 1 to the distribution of document 2.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Image from https://vene.ro/images/wmd-obama.png\nimport matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('wmd-obama.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This method was introduced in the article \"From Word Embeddings To Document\nDistances\" by Matt Kusner et al. (\\ `link to PDF\n`_\\ ). It is inspired\nby the \"Earth Mover's Distance\", and employs a solver of the \"transportation\nproblem\".\n\nIn this tutorial, we will learn how to use Gensim's WMD functionality, which\nconsists of the ``wmdistance`` method for distance computation, and the\n``WmdSimilarity`` class for corpus based similarity queries.\n\n.. Important::\n If you use Gensim's WMD functionality, please consider citing [1], [2] and [3].\n\nComputing the Word Mover's Distance\n-----------------------------------\n\nTo use WMD, you need some existing word embeddings.\nYou could train your own Word2Vec model, but that is beyond the scope of this tutorial\n(check out `sphx_glr_auto_examples_tutorials_run_word2vec.py` if you're interested).\nFor this tutorial, we'll be using an existing Word2Vec model.\n\nLet's take some sentences to compute the distance between.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Initialize logging.\nimport logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n\nsentence_obama = 'Obama speaks to the media in Illinois'\nsentence_president = 'The president greets the press in Chicago'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These sentences have very similar content, and as such the WMD should be low.\nBefore we compute the WMD, we want to remove stopwords (\"the\", \"to\", etc.),\nas these do not contribute a lot to the information in the sentences.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Import and download stopwords from NLTK.\nfrom nltk.corpus import stopwords\nfrom nltk import download\ndownload('stopwords') # Download stopwords list.\nstop_words = stopwords.words('english')\n\ndef preprocess(sentence):\n return [w for w in sentence.lower().split() if w not in stop_words]\n\nsentence_obama = preprocess(sentence_obama)\nsentence_president = preprocess(sentence_president)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, as mentioned earlier, we will be using some downloaded pre-trained\nembeddings. We load these into a Gensim Word2Vec model class.\n\n.. Important::\n The embeddings we have chosen here require a lot of memory.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import gensim.downloader as api\nmodel = api.load('word2vec-google-news-300')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So let's compute WMD using the ``wmdistance`` method.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"distance = model.wmdistance(sentence_obama, sentence_president)\nprint('distance = %.4f' % distance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sentence_orange = preprocess('Oranges are my favorite fruit')\ndistance = model.wmdistance(sentence_obama, sentence_orange)\nprint('distance = %.4f' % distance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Normalizing word2vec vectors\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nWhen using the ``wmdistance`` method, it is beneficial to normalize the\nword2vec vectors first, so they all have equal length. To do this, simply\ncall ``model.init_sims(replace=True)`` and Gensim will take care of that for\nyou.\n\nUsually, one measures the distance between two word2vec vectors using the\ncosine distance (see `cosine similarity\n`_\\ ), which measures the\nangle between vectors. WMD, on the other hand, uses the Euclidean distance.\nThe Euclidean distance between two vectors might be large because their\nlengths differ, but the cosine distance is small because the angle between\nthem is small; we can mitigate some of this by normalizing the vectors.\n\n.. Important::\n Note that normalizing the vectors can take some time, especially if you have\n a large vocabulary and/or large vectors.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model.init_sims(replace=True) # Normalizes the vectors in the word2vec class.\n\ndistance = model.wmdistance(sentence_obama, sentence_president) # Compute WMD as normal.\nprint('distance: %r' % distance)\n\ndistance = model.wmdistance(sentence_obama, sentence_orange)\nprint('distance = %.4f' % distance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"References\n----------\n\n1. Ofir Pele and Michael Werman, *A linear time histogram metric for improved SIFT matching*\\ , 2008.\n2. Ofir Pele and Michael Werman, *Fast and robust earth mover's distances*\\ , 2009.\n3. Matt Kusner et al. *From Embeddings To Document Distances*\\ , 2015.\n4. Thomas Mikolov et al. *Efficient Estimation of Word Representations in Vector Space*\\ , 2013.\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZO{nL nL tutorials/run_doc2vec_lee.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nDoc2Vec Model\n=============\n\nIntroduces Gensim's Doc2Vec model and demonstrates its use on the Lee Corpus.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Doc2Vec is a `core_concepts_model` that represents each\n`core_concepts_document` as a `core_concepts_vector`. This\ntutorial introduces the model and demonstrates how to train and assess it.\n\nHere's a list of what we'll be doing:\n\n0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n4. Assess the model\n5. Test the model on the test corpus\n\nReview: Bag-of-words\n--------------------\n\n.. Note:: Feel free to skip these review sections if you're already familiar with the models.\n\nYou may be familiar with the `bag-of-words model\n`_ from the\n`core_concepts_vector` section.\nThis model transforms each document to a fixed-length vector of integers.\nFor example, given the sentences:\n\n- ``John likes to watch movies. Mary likes movies too.``\n- ``John also likes to watch football games. Mary hates football.``\n\nThe model outputs the vectors:\n\n- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``\n- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``\n\nEach vector has 10 elements, where each element counts the number of times a\nparticular word occurred in the document.\nThe order of elements is arbitrary.\nIn the example above, the order of the elements corresponds to the words:\n``[\"John\", \"likes\", \"to\", \"watch\", \"movies\", \"Mary\", \"too\", \"also\", \"football\", \"games\", \"hates\"]``.\n\nBag-of-words models are surprisingly effective, but have several weaknesses.\n\nFirst, they lose all information about word order: \"John likes Mary\" and\n\"Mary likes John\" correspond to identical vectors. There is a solution: bag\nof `n-grams `__\nmodels consider word phrases of length n to represent documents as\nfixed-length vectors to capture local word order but suffer from data\nsparsity and high dimensionality.\n\nSecond, the model does not attempt to learn the meaning of the underlying\nwords, and as a consequence, the distance between vectors doesn't always\nreflect the difference in meaning. The ``Word2Vec`` model addresses this\nsecond problem.\n\nReview: ``Word2Vec`` Model\n--------------------------\n\n``Word2Vec`` is a more recent model that embeds words in a lower-dimensional\nvector space using a shallow neural network. The result is a set of\nword-vectors where vectors close together in vector space have similar\nmeanings based on context, and word-vectors distant to each other have\ndiffering meanings. For example, ``strong`` and ``powerful`` would be close\ntogether and ``strong`` and ``Paris`` would be relatively far.\n\nGensim's :py:class:`~gensim.models.word2vec.Word2Vec` class implements this model.\n\nWith the ``Word2Vec`` model, we can calculate the vectors for each **word** in a document.\nBut what if we want to calculate a vector for the **entire document**\\ ?\nWe could average the vectors for each word in the document - while this is quick and crude, it can often be useful.\nHowever, there is a better way...\n\nIntroducing: Paragraph Vector\n-----------------------------\n\n.. Important:: In Gensim, we refer to the Paragraph Vector model as ``Doc2Vec``.\n\nLe and Mikolov in 2014 introduced the `Doc2Vec algorithm `__, which usually outperforms such simple-averaging of ``Word2Vec`` vectors.\n\nThe basic idea is: act as if a document has another floating word-like\nvector, which contributes to all training predictions, and is updated like\nother word-vectors, but we will call it a doc-vector. Gensim's\n:py:class:`~gensim.models.doc2vec.Doc2Vec` class implements this algorithm.\n\nThere are two implementations:\n\n1. Paragraph Vector - Distributed Memory (PV-DM)\n2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)\n\n.. Important::\n Don't let the implementation details below scare you.\n They're advanced material: if it's too much, then move on to the next section.\n\nPV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training\na neural network on the synthetic task of predicting a center word based an\naverage of both context word-vectors and the full document's doc-vector.\n\nPV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training\na neural network on the synthetic task of predicting a target word just from\nthe full document's doc-vector. (It is also common to combine this with\nskip-gram testing, using both the doc-vector and nearby word-vectors to\npredict a single target word, but only one at a time.)\n\nPrepare the Training and Test Data\n----------------------------------\n\nFor this tutorial, we'll be training our model using the `Lee Background\nCorpus\n`_\nincluded in gensim. This corpus contains 314 documents selected from the\nAustralian Broadcasting Corporation\u2019s news mail service, which provides text\ne-mails of headline stories and covers a number of broad topics.\n\nAnd we'll test our model by eye using the much shorter `Lee Corpus\n`_\nwhich contains 50 documents.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os\nimport gensim\n# Set file names for train and test data\ntest_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')\nlee_train_file = os.path.join(test_data_dir, 'lee_background.cor')\nlee_test_file = os.path.join(test_data_dir, 'lee.cor')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a Function to Read and Preprocess Text\n---------------------------------------------\n\nBelow, we define a function to:\n\n- open the train/test file (with latin encoding)\n- read the file line-by-line\n- pre-process each line (tokenize text into individual words, remove punctuation, set to lowercase, etc)\n\nThe file we're reading is a **corpus**.\nEach line of the file is a **document**.\n\n.. Important::\n To train the model, we'll need to associate a tag/number with each document\n of the training corpus. In our case, the tag is simply the zero-based line\n number.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import smart_open\n\ndef read_corpus(fname, tokens_only=False):\n with smart_open.open(fname, encoding=\"iso-8859-1\") as f:\n for i, line in enumerate(f):\n tokens = gensim.utils.simple_preprocess(line)\n if tokens_only:\n yield tokens\n else:\n # For training data, add tags\n yield gensim.models.doc2vec.TaggedDocument(tokens, [i])\n\ntrain_corpus = list(read_corpus(lee_train_file))\ntest_corpus = list(read_corpus(lee_test_file, tokens_only=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the training corpus\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(train_corpus[:2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the testing corpus looks like this:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(test_corpus[:2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the testing corpus is just a list of lists and does not contain\nany tags.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training the Model\n------------------\n\nNow, we'll instantiate a Doc2Vec model with a vector size with 50 dimensions and\niterating over the training corpus 40 times. We set the minimum word count to\n2 in order to discard words with very few occurrences. (Without a variety of\nrepresentative examples, retaining such infrequent words can often make a\nmodel worse!) Typical iteration counts in the published `Paragraph Vector paper `__\nresults, using 10s-of-thousands to millions of docs, are 10-20. More\niterations take more time and eventually reach a point of diminishing\nreturns.\n\nHowever, this is a very very small dataset (300 documents) with shortish\ndocuments (a few hundred words). Adding training passes can sometimes help\nwith such small datasets.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build a vocabulary\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model.build_vocab(train_corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Essentially, the vocabulary is a dictionary (accessible via\n``model.wv.vocab``\\ ) of all of the unique words extracted from the training\ncorpus along with the count (e.g., ``model.wv.vocab['penalty'].count`` for\ncounts for the word ``penalty``\\ ).\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, train the model on the corpus.\nIf the BLAS library is being used, this should take no more than 3 seconds.\nIf the BLAS library is not being used, this should take no more than 2\nminutes, so use BLAS if you value your time.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can use the trained model to infer a vector for any piece of text\nby passing a list of words to the ``model.infer_vector`` function. This\nvector can then be compared with other vectors via cosine similarity.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])\nprint(vector)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that ``infer_vector()`` does *not* take a string, but rather a list of\nstring tokens, which should have already been tokenized the same way as the\n``words`` property of original training document objects.\n\nAlso note that because the underlying training/inference algorithms are an\niterative approximation problem that makes use of internal randomization,\nrepeated inferences of the same text will return slightly different vectors.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assessing the Model\n-------------------\n\nTo assess our new model, we'll first infer new vectors for each document of\nthe training corpus, compare the inferred vectors with the training corpus,\nand then returning the rank of the document based on self-similarity.\nBasically, we're pretending as if the training corpus is some new unseen data\nand then seeing how they compare with the trained model. The expectation is\nthat we've likely overfit our model (i.e., all of the ranks will be less than\n2) and so we should be able to find similar documents very easily.\nAdditionally, we'll keep track of the second ranks for a comparison of less\nsimilar documents.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ranks = []\nsecond_ranks = []\nfor doc_id in range(len(train_corpus)):\n inferred_vector = model.infer_vector(train_corpus[doc_id].words)\n sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))\n rank = [docid for docid, sim in sims].index(doc_id)\n ranks.append(rank)\n\n second_ranks.append(sims[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's count how each document ranks with respect to the training corpus\n\nNB. Results vary between runs due to random seeding and very small corpus\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import collections\n\ncounter = collections.Counter(ranks)\nprint(counter)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Basically, greater than 95% of the inferred documents are found to be most\nsimilar to itself and about 5% of the time it is mistakenly most similar to\nanother document. Checking the inferred-vector against a\ntraining-vector is a sort of 'sanity check' as to whether the model is\nbehaving in a usefully consistent manner, though not a real 'accuracy' value.\n\nThis is great and not entirely surprising. We can take a look at an example:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print('Document ({}): \u00ab{}\u00bb\\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))\nprint(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\\n' % model)\nfor label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n print(u'%s %s: \u00ab%s\u00bb\\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice above that the most similar document (usually the same text) is has a\nsimilarity score approaching 1.0. However, the similarity score for the\nsecond-ranked documents should be significantly lower (assuming the documents\nare in fact different) and the reasoning becomes obvious when we examine the\ntext itself.\n\nWe can run the next cell repeatedly to see a sampling other target-document\ncomparisons.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Pick a random document from the corpus and infer a vector from the model\nimport random\ndoc_id = random.randint(0, len(train_corpus) - 1)\n\n# Compare and print the second-most-similar document\nprint('Train Document ({}): \u00ab{}\u00bb\\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))\nsim_id = second_ranks[doc_id]\nprint('Similar Document {}: \u00ab{}\u00bb\\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Testing the Model\n-----------------\n\nUsing the same approach above, we'll infer the vector for a randomly chosen\ntest document, and compare the document to our model by eye.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Pick a random document from the test corpus and infer a vector from the model\ndoc_id = random.randint(0, len(test_corpus) - 1)\ninferred_vector = model.infer_vector(test_corpus[doc_id])\nsims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))\n\n# Compare and print the most/median/least similar documents from the train corpus\nprint('Test Document ({}): \u00ab{}\u00bb\\n'.format(doc_id, ' '.join(test_corpus[doc_id])))\nprint(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\\n' % model)\nfor label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n print(u'%s %s: \u00ab%s\u00bb\\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Conclusion\n----------\n\nLet's review what we've seen in this tutorial:\n\n0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n4. Assess the model\n5. Test the model on the test corpus\n\nThat's it! Doc2Vec is a great way to explore relationships between documents.\n\nAdditional Resources\n--------------------\n\nIf you'd like to know more about the subject matter of this tutorial, check out the links below.\n\n* `Word2Vec Paper `_\n* `Doc2Vec Paper `_\n* `Dr. Michael D. Lee's Website `_\n* `Lee Corpus `__\n* `IMDB Doc2Vec Tutorial `_\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZOtRg, , $ tutorials/run_pivoted_doc_norm.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nPivoted Document Length Normalization\n=====================================\n\nThis tutorial demonstrates using Pivoted Document Length Normalization to\ncounter the effect of short document bias when working with TfIdf, thereby\nincreasing the classification accuracy.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In many cases, normalizing the tfidf weights for each term favors weight of terms of the documents with shorter length. The *pivoted document length normalization* scheme counters the effect of this bias for short documents by making tfidf independent of the document length.\n\nThis is achieved by *tilting* the normalization curve along the pivot point defined by user with some slope.\n\nRoughly following the equation:\n\n``pivoted_norm = (1 - slope) * pivot + slope * old_norm``\n\nThis scheme is proposed in the paper `Pivoted Document Length Normalization `_ by Singhal, Buckley and Mitra.\n\nOverall this approach can increase the accuracy of the model where the document lengths are hugely varying in the entire corpus.\n\nIntroduction\n------------\n\nThis guide demonstrates how to perform pivoted document length normalization.\n\nWe will train a logistic regression to distinguish between text from two different newsgroups.\n\nOur results will show that using pivoted document length normalization yields a better model (higher classification accuracy).\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#\n# Download our dataset\n#\nimport gensim.downloader as api\nnws = api.load(\"20-newsgroups\")\n\n#\n# Pick texts from relevant newsgroups, split into training and test set.\n#\ncat1, cat2 = ('sci.electronics', 'sci.space')\n\n#\n# X_* contain the actual texts as strings.\n# Y_* contain labels, 0 for cat1 (sci.electronics) and 1 for cat2 (sci.space)\n#\nX_train = []\nX_test = []\ny_train = []\ny_test = []\n\nfor i in nws:\n if i[\"set\"] == \"train\" and i[\"topic\"] == cat1:\n X_train.append(i[\"data\"])\n y_train.append(0)\n elif i[\"set\"] == \"train\" and i[\"topic\"] == cat2:\n X_train.append(i[\"data\"])\n y_train.append(1)\n elif i[\"set\"] == \"test\" and i[\"topic\"] == cat1:\n X_test.append(i[\"data\"])\n y_test.append(0)\n elif i[\"set\"] == \"test\" and i[\"topic\"] == cat2:\n X_test.append(i[\"data\"])\n y_test.append(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preprocess the data\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.parsing.preprocessing import preprocess_string\nfrom gensim.corpora import Dictionary\n\nid2word = Dictionary([preprocess_string(doc) for doc in X_train])\ntrain_corpus = [id2word.doc2bow(preprocess_string(doc)) for doc in X_train]\ntest_corpus = [id2word.doc2bow(preprocess_string(doc)) for doc in X_test]\n\nprint(len(X_train), len(X_test))\n\n# We perform our analysis on top k documents which is almost top 10% most scored documents\nk = len(X_test) // 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Prepare our evaluation function\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.sklearn_api.tfidf import TfIdfTransformer\nfrom sklearn.linear_model import LogisticRegression\nfrom gensim.matutils import corpus2csc\n\n# This function returns the model accuracy and indivitual document prob values using\n# gensim's TfIdfTransformer and sklearn's LogisticRegression\ndef get_tfidf_scores(kwargs):\n tfidf_transformer = TfIdfTransformer(**kwargs).fit(train_corpus)\n\n X_train_tfidf = corpus2csc(tfidf_transformer.transform(train_corpus), num_terms=len(id2word)).T\n X_test_tfidf = corpus2csc(tfidf_transformer.transform(test_corpus), num_terms=len(id2word)).T\n\n clf = LogisticRegression().fit(X_train_tfidf, y_train)\n\n model_accuracy = clf.score(X_test_tfidf, y_test)\n doc_scores = clf.decision_function(X_test_tfidf)\n\n return model_accuracy, doc_scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get TFIDF scores for corpus without pivoted document length normalisation\n-------------------------------------------------------------------------\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"params = {}\nmodel_accuracy, doc_scores = get_tfidf_scores(params)\nprint(model_accuracy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the bias towards shorter documents\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n\n# Sort the document scores by their scores and return a sorted list\n# of document score and corresponding document lengths.\ndef sort_length_by_score(doc_scores, X_test):\n doc_scores = sorted(enumerate(doc_scores), key=lambda x: x[1])\n doc_leng = np.empty(len(doc_scores))\n\n ds = np.empty(len(doc_scores))\n\n for i, _ in enumerate(doc_scores):\n doc_leng[i] = len(X_test[_[0]])\n ds[i] = _[1]\n\n return ds, doc_leng\n\n\nprint(\n \"Normal cosine normalisation favors short documents as our top {} \"\n \"docs have a smaller mean doc length of {:.3f} compared to the corpus mean doc length of {:.3f}\"\n .format(\n k, sort_length_by_score(doc_scores, X_test)[1][:k].mean(), \n sort_length_by_score(doc_scores, X_test)[1].mean()\n )\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get TFIDF scores for corpus with pivoted document length normalisation\n----------------------------------------------------------------------\n\nTest various values of alpha (slope) and pick the best one.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"best_model_accuracy = 0\noptimum_slope = 0\nfor slope in np.arange(0, 1.1, 0.1):\n params = {\"pivot\": 10, \"slope\": slope}\n\n model_accuracy, doc_scores = get_tfidf_scores(params)\n\n if model_accuracy > best_model_accuracy:\n best_model_accuracy = model_accuracy\n optimum_slope = slope\n\n print(\"Score for slope {} is {}\".format(slope, model_accuracy))\n\nprint(\"We get best score of {} at slope {}\".format(best_model_accuracy, optimum_slope))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate the model with optimum slope\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"params = {\"pivot\": 10, \"slope\": optimum_slope}\nmodel_accuracy, doc_scores = get_tfidf_scores(params)\nprint(model_accuracy)\n\nprint(\n \"With pivoted normalisation top {} docs have mean length of {:.3f} \"\n \"which is much closer to the corpus mean doc length of {:.3f}\"\n .format(\n k, sort_length_by_score(doc_scores, X_test)[1][:k].mean(), \n sort_length_by_score(doc_scores, X_test)[1].mean()\n )\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualizing the pivoted normalization\n-------------------------------------\n\nSince cosine normalization favors retrieval of short documents from the plot\nwe can see that when slope was 1 (when pivoted normalisation was not applied)\nshort documents with length of around 500 had very good score hence the bias\nfor short documents can be seen. As we varied the value of slope from 1 to 0\nwe introdcued a new bias for long documents to counter the bias caused by\ncosine normalisation. Therefore at a certain point we got an optimum value of\nslope which is 0.5 where the overall accuracy of the model is increased.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as py\n\nbest_model_accuracy = 0\noptimum_slope = 0\n\nw = 2\nh = 2\nf, axarr = py.subplots(h, w, figsize=(15, 7))\n\nit = 0\nfor slope in [1, 0.2]:\n params = {\"pivot\": 10, \"slope\": slope}\n\n model_accuracy, doc_scores = get_tfidf_scores(params)\n\n if model_accuracy > best_model_accuracy:\n best_model_accuracy = model_accuracy\n optimum_slope = slope\n\n doc_scores, doc_leng = sort_length_by_score(doc_scores, X_test)\n\n y = abs(doc_scores[:k, np.newaxis])\n x = doc_leng[:k, np.newaxis]\n\n py.subplot(1, 2, it+1).bar(x, y, width=20, linewidth=0)\n py.title(\"slope = \" + str(slope) + \" Model accuracy = \" + str(model_accuracy))\n py.ylim([0, 4.5])\n py.xlim([0, 3200])\n py.xlabel(\"document length\")\n py.ylabel(\"confidence score\")\n \n it += 1\n\npy.tight_layout()\npy.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above histogram plot helps us visualize the effect of ``slope``. For top\nk documents we have document length on the x axis and their respective scores\nof belonging to a specific class on y axis. \n\nAs we decrease the slope the density of bins is shifted from low document\nlength (around ~250-500) to over ~500 document length. This suggests that the\npositive biasness which was seen at ``slope=1`` (or when regular tfidf was\nused) for short documents is now reduced. We get the optimum slope or the max\nmodel accuracy when slope is 0.2.\n\nConclusion\n==========\n\nUsing pivoted document normalization improved the classification accuracy significantly:\n\n* Before (slope=1, identical to default cosine normalization): 0.9682\n* After (slope=0.2): 0.9771\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK PRaO@UX߄ ߄ tutorials/run_word2vec.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nWord2Vec Model\n==============\n\nIntroduces Gensim's Word2Vec model and demonstrates its use on the Lee Corpus.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In case you missed the buzz, word2vec is a widely featured as a member of the\n\u201cnew wave\u201d of machine learning algorithms based on neural networks, commonly\nreferred to as \"deep learning\" (though word2vec itself is rather shallow).\nUsing large amounts of unannotated plain text, word2vec learns relationships\nbetween words automatically. The output are vectors, one vector per word,\nwith remarkable linear relationships that allow us to do things like:\n\n* vec(\"king\") - vec(\"man\") + vec(\"woman\") =~ vec(\"queen\")\n* vec(\"Montreal Canadiens\") \u2013 vec(\"Montreal\") + vec(\"Toronto\") =~ vec(\"Toronto Maple Leafs\").\n\nWord2vec is very useful in `automatic text tagging\n`_\\ , recommender\nsystems and machine translation.\n\nThis tutorial:\n\n#. Introduces ``Word2Vec`` as an improvement over traditional bag-of-words\n#. Shows off a demo of ``Word2Vec`` using a pre-trained model\n#. Demonstrates training a new model from your own data\n#. Demonstrates loading and saving models\n#. Introduces several training parameters and demonstrates their effect\n#. Discusses memory requirements\n#. Visualizes Word2Vec embeddings by applying dimensionality reduction\n\nReview: Bag-of-words\n--------------------\n\n.. Note:: Feel free to skip these review sections if you're already familiar with the models.\n\nYou may be familiar with the `bag-of-words model\n`_ from the\n`core_concepts_vector` section.\nThis model transforms each document to a fixed-length vector of integers.\nFor example, given the sentences:\n\n- ``John likes to watch movies. Mary likes movies too.``\n- ``John also likes to watch football games. Mary hates football.``\n\nThe model outputs the vectors:\n\n- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``\n- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``\n\nEach vector has 10 elements, where each element counts the number of times a\nparticular word occurred in the document.\nThe order of elements is arbitrary.\nIn the example above, the order of the elements corresponds to the words:\n``[\"John\", \"likes\", \"to\", \"watch\", \"movies\", \"Mary\", \"too\", \"also\", \"football\", \"games\", \"hates\"]``.\n\nBag-of-words models are surprisingly effective, but have several weaknesses.\n\nFirst, they lose all information about word order: \"John likes Mary\" and\n\"Mary likes John\" correspond to identical vectors. There is a solution: bag\nof `n-grams `__\nmodels consider word phrases of length n to represent documents as\nfixed-length vectors to capture local word order but suffer from data\nsparsity and high dimensionality.\n\nSecond, the model does not attempt to learn the meaning of the underlying\nwords, and as a consequence, the distance between vectors doesn't always\nreflect the difference in meaning. The ``Word2Vec`` model addresses this\nsecond problem.\n\nIntroducing: the ``Word2Vec`` Model\n-----------------------------------\n\n``Word2Vec`` is a more recent model that embeds words in a lower-dimensional\nvector space using a shallow neural network. The result is a set of\nword-vectors where vectors close together in vector space have similar\nmeanings based on context, and word-vectors distant to each other have\ndiffering meanings. For example, ``strong`` and ``powerful`` would be close\ntogether and ``strong`` and ``Paris`` would be relatively far.\n\nThe are two versions of this model and :py:class:`~gensim.models.word2vec.Word2Vec`\nclass implements them both:\n\n1. Skip-grams (SG)\n2. Continuous-bag-of-words (CBOW)\n\n.. Important::\n Don't let the implementation details below scare you.\n They're advanced material: if it's too much, then move on to the next section.\n\nThe `Word2Vec Skip-gram `__\nmodel, for example, takes in pairs (word1, word2) generated by moving a\nwindow across text data, and trains a 1-hidden-layer neural network based on\nthe synthetic task of given an input word, giving us a predicted probability\ndistribution of nearby words to the input. A virtual `one-hot\n`__ encoding of words\ngoes through a 'projection layer' to the hidden layer; these projection\nweights are later interpreted as the word embeddings. So if the hidden layer\nhas 300 neurons, this network will give us 300-dimensional word embeddings.\n\nContinuous-bag-of-words Word2vec is very similar to the skip-gram model. It\nis also a 1-hidden-layer neural network. The synthetic training task now uses\nthe average of multiple input context words, rather than a single word as in\nskip-gram, to predict the center word. Again, the projection weights that\nturn one-hot words into averageable vectors, of the same width as the hidden\nlayer, are interpreted as the word embeddings.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word2Vec Demo\n-------------\n\nTo see what ``Word2Vec`` can do, let's download a pre-trained model and play\naround with it. We will fetch the Word2Vec model trained on part of the\nGoogle News dataset, covering approximately 3 million words and phrases. Such\na model can take hours to train, but since it's already available,\ndownloading and loading it with Gensim takes minutes.\n\n.. Important::\n The model is approximately 2GB, so you'll need a decent network connection\n to proceed. Otherwise, skip ahead to the \"Training Your Own Model\" section\n below.\n\nYou may also check out an `online word2vec demo\n`_ where you can try\nthis vector algebra for yourself. That demo runs ``word2vec`` on the\n**entire** Google News dataset, of **about 100 billion words**.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import gensim.downloader as api\nwv = api.load('word2vec-google-news-300')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common operation is to retrieve the vocabulary of a model. That is trivial:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for i, word in enumerate(wv.vocab):\n if i == 10:\n break\n print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can easily obtain vectors for terms the model is familiar with:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"vec_king = wv['king']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, the model is unable to infer vectors for unfamiliar words.\nThis is one limitation of Word2Vec: if this limitation matters to you, check\nout the FastText model.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"try:\n vec_cameroon = wv['cameroon']\nexcept KeyError:\n print(\"The word 'cameroon' does not appear in this model\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Moving on, ``Word2Vec`` supports several word similarity tasks out of the\nbox. You can see how the similarity intuitively decreases as the words get\nless and less similar.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pairs = [\n ('car', 'minivan'), # a minivan is a kind of car\n ('car', 'bicycle'), # still a wheeled vehicle\n ('car', 'airplane'), # ok, no wheels, but still a vehicle\n ('car', 'cereal'), # ... and so on\n ('car', 'communism'),\n]\nfor w1, w2 in pairs:\n print('%r\\t%r\\t%.2f' % (w1, w2, wv.similarity(w1, w2)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the 5 most similar words to \"car\" or \"minivan\"\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(wv.most_similar(positive=['car', 'minivan'], topn=5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which of the below does not belong in the sequence?\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training Your Own Model\n-----------------------\n\nTo start, you'll need some data for training the model. For the following\nexamples, we'll use the `Lee Corpus\n`_\n(which you already have if you've installed gensim).\n\nThis corpus is small enough to fit entirely in memory, but we'll implement a\nmemory-friendly iterator that reads it line-by-line to demonstrate how you\nwould handle a larger corpus.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.test.utils import datapath\nfrom gensim import utils\n\nclass MyCorpus(object):\n \"\"\"An interator that yields sentences (lists of str).\"\"\"\n\n def __iter__(self):\n corpus_path = datapath('lee_background.cor')\n for line in open(corpus_path):\n # assume there's one document per line, tokens separated by whitespace\n yield utils.simple_preprocess(line)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to do any custom preprocessing, e.g. decode a non-standard\nencoding, lowercase, remove numbers, extract named entities... All of this can\nbe done inside the ``MyCorpus`` iterator and ``word2vec`` doesn\u2019t need to\nknow. All that is required is that the input yields one sentence (list of\nutf8 words) after another.\n\nLet's go ahead and train a model on our corpus. Don't worry about the\ntraining parameters much for now, we'll revisit them later.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import gensim.models\n\nsentences = MyCorpus()\nmodel = gensim.models.Word2Vec(sentences=sentences)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we have our model, we can use it in the same way as in the demo above.\n\nThe main part of the model is ``model.wv``\\ , where \"wv\" stands for \"word vectors\".\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"vec_king = model.wv['king']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Retrieving the vocabulary works the same way:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for i, word in enumerate(model.wv.vocab):\n if i == 10:\n break\n print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Storing and loading models\n--------------------------\n\nYou'll notice that training non-trivial models can take time. Once you've\ntrained your model and it works as expected, you can save it to disk. That\nway, you don't have to spend time training it all over again later.\n\nYou can store/load models using the standard gensim methods:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import tempfile\n\nwith tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:\n temporary_filepath = tmp.name\n model.save(temporary_filepath)\n #\n # The model is now safely stored in the filepath.\n # You can copy it to other machines, share it with others, etc.\n #\n # To load a saved model:\n #\n new_model = gensim.models.Word2Vec.load(temporary_filepath)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"which uses pickle internally, optionally ``mmap``\\ \u2018ing the model\u2019s internal\nlarge NumPy matrices into virtual memory directly from disk files, for\ninter-process memory sharing.\n\nIn addition, you can load models created by the original C tool, both using\nits text and binary formats::\n\n model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)\n # using gzipped/bz2 input works too, no need to unzip\n model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training Parameters\n-------------------\n\n``Word2Vec`` accepts several parameters that affect both training speed and quality.\n\nmin_count\n---------\n\n``min_count`` is for pruning the internal dictionary. Words that appear only\nonce or twice in a billion-word corpus are probably uninteresting typos and\ngarbage. In addition, there\u2019s not enough data to make any meaningful training\non those words, so it\u2019s best to ignore them:\n\ndefault value of min_count=5\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model = gensim.models.Word2Vec(sentences, min_count=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"size\n----\n\n``size`` is the number of dimensions (N) of the N-dimensional space that\ngensim Word2Vec maps the words onto.\n\nBigger size values require more training data, but can lead to better (more\naccurate) models. Reasonable values are in the tens to hundreds.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# default value of size=100\nmodel = gensim.models.Word2Vec(sentences, size=200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"workers\n-------\n\n``workers`` , the last of the major parameters (full list `here\n`_)\nis for training parallelization, to speed up training:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# default value of workers=3 (tutorial says 1...)\nmodel = gensim.models.Word2Vec(sentences, workers=4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ``workers`` parameter only has an effect if you have `Cython\n`_ installed. Without Cython, you\u2019ll only be able to use\none core because of the `GIL\n`_ (and ``word2vec``\ntraining will be `miserably slow\n`_\\ ).\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Memory\n------\n\nAt its core, ``word2vec`` model parameters are stored as matrices (NumPy\narrays). Each array is **#vocabulary** (controlled by min_count parameter)\ntimes **#size** (size parameter) of floats (single precision aka 4 bytes).\n\nThree such matrices are held in RAM (work is underway to reduce that number\nto two, or even one). So if your input contains 100,000 unique words, and you\nasked for layer ``size=200``\\ , the model will require approx.\n``100,000*200*4*3 bytes = ~229MB``.\n\nThere\u2019s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluating\n----------\n\n``Word2Vec`` training is an unsupervised task, there\u2019s no good way to\nobjectively evaluate the result. Evaluation depends on your end application.\n\nGoogle has released their testing set of about 20,000 syntactic and semantic\ntest examples, following the \u201cA is to B as C is to D\u201d task. It is provided in\nthe 'datasets' folder.\n\nFor example a syntactic analogy of comparative type is bad:worse;good:?.\nThere are total of 9 types of syntactic comparisons in the dataset like\nplural nouns and nouns of opposite meaning.\n\nThe semantic questions contain five types of semantic analogies, such as\ncapital cities (Paris:France;Tokyo:?) or family members\n(brother:sister;dad:?).\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gensim supports the same evaluation set, in exactly the same format:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model.accuracy('./datasets/questions-words.txt')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This ``accuracy`` takes an `optional parameter\n`_\n``restrict_vocab`` which limits which test examples are to be considered.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.\n\nBy default it uses an academic dataset WS-353 but one can create a dataset\nspecific to your business based on it. It contains word pairs together with\nhuman-assigned similarity judgments. It measures the relatedness or\nco-occurrence of two words. For example, 'coast' and 'shore' are very similar\nas they appear in the same context. At the same time 'clothes' and 'closet'\nare less similar because they are related but not interchangeable.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model.evaluate_word_pairs(datapath('wordsim353.tsv'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
".. Important::\n Good performance on Google's or WS-353 test set doesn\u2019t mean word2vec will\n work well in your application, or vice versa. It\u2019s always best to evaluate\n directly on your intended task. For an example of how to use word2vec in a\n classifier pipeline, see this `tutorial\n `_.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Online training / Resuming training\n-----------------------------------\n\nAdvanced users can load a model and continue training it with more sentences\nand `new vocabulary words `_:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model = gensim.models.Word2Vec.load(temporary_filepath)\nmore_sentences = [\n ['Advanced', 'users', 'can', 'load', 'a', 'model',\n 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']\n]\nmodel.build_vocab(more_sentences, update=True)\nmodel.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)\n\n# cleaning up temporary file\nimport os\nos.remove(temporary_filepath)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may need to tweak the ``total_words`` parameter to ``train()``,\ndepending on what learning rate decay you want to simulate.\n\nNote that it\u2019s not possible to resume training with models generated by the C\ntool, ``KeyedVectors.load_word2vec_format()``. You can still use them for\nquerying/similarity, but information vital for training (the vocab tree) is\nmissing there.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training Loss Computation\n-------------------------\n\nThe parameter ``compute_loss`` can be used to toggle computation of loss\nwhile training the Word2Vec model. The computed loss is stored in the model\nattribute ``running_training_loss`` and can be retrieved using the function\n``get_latest_training_loss`` as follows :\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# instantiating and training the Word2Vec model\nmodel_with_loss = gensim.models.Word2Vec(\n sentences,\n min_count=1,\n compute_loss=True,\n hs=0,\n sg=1,\n seed=42\n)\n\n# getting the training loss value\ntraining_loss = model_with_loss.get_latest_training_loss()\nprint(training_loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Benchmarks\n----------\n\nLet's run some benchmarks to see effect of the training loss computation code\non training time.\n\nWe'll use the following data for the benchmarks:\n\n#. Lee Background corpus: included in gensim's test data\n#. Text8 corpus. To demonstrate the effect of corpus size, we'll look at the\n first 1MB, 10MB, 50MB of the corpus, as well as the entire thing.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import io\nimport os\n\nimport gensim.models.word2vec\nimport gensim.downloader as api\nimport smart_open\n\n\ndef head(path, size):\n with smart_open.open(path) as fin:\n return io.StringIO(fin.read(size))\n\n\ndef generate_input_data():\n lee_path = datapath('lee_background.cor')\n ls = gensim.models.word2vec.LineSentence(lee_path)\n ls.name = '25kB'\n yield ls\n\n text8_path = api.load('text8').fn\n labels = ('1MB', '10MB', '50MB', '100MB')\n sizes = (1024 ** 2, 10 * 1024 ** 2, 50 * 1024 ** 2, 100 * 1024 ** 2)\n for l, s in zip(labels, sizes):\n ls = gensim.models.word2vec.LineSentence(head(text8_path, s))\n ls.name = l\n yield ls\n\n\ninput_data = list(generate_input_data())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now compare the training time taken for different combinations of input\ndata and model training parameters like ``hs`` and ``sg``.\n\nFor each combination, we repeat the test several times to obtain the mean and\nstandard deviation of the test duration.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Temporarily reduce logging verbosity\nlogging.root.level = logging.ERROR\n\nimport time\nimport numpy as np\nimport pandas as pd\n\ntrain_time_values = []\nseed_val = 42\nsg_values = [0, 1]\nhs_values = [0, 1]\n\nfast = True\nif fast:\n input_data_subset = input_data[:3]\nelse:\n input_data_subset = input_data\n\n\nfor data in input_data_subset:\n for sg_val in sg_values:\n for hs_val in hs_values:\n for loss_flag in [True, False]:\n time_taken_list = []\n for i in range(3):\n start_time = time.time()\n w2v_model = gensim.models.Word2Vec(\n data,\n compute_loss=loss_flag,\n sg=sg_val,\n hs=hs_val,\n seed=seed_val,\n )\n time_taken_list.append(time.time() - start_time)\n\n time_taken_list = np.array(time_taken_list)\n time_mean = np.mean(time_taken_list)\n time_std = np.std(time_taken_list)\n\n model_result = {\n 'train_data': data.name,\n 'compute_loss': loss_flag,\n 'sg': sg_val,\n 'hs': hs_val,\n 'train_time_mean': time_mean,\n 'train_time_std': time_std,\n }\n print(\"Word2vec model #%i: %s\" % (len(train_time_values), model_result))\n train_time_values.append(model_result)\n\ntrain_times_table = pd.DataFrame(train_time_values)\ntrain_times_table = train_times_table.sort_values(\n by=['train_data', 'sg', 'hs', 'compute_loss'],\n ascending=[False, False, True, False],\n)\nprint(train_times_table)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Adding Word2Vec \"model to dict\" method to production pipeline\n-------------------------------------------------------------\n\nSuppose, we still want more performance improvement in production.\n\nOne good way is to cache all the similar words in a dictionary.\n\nSo that next time when we get the similar query word, we'll search it first in the dict.\n\nAnd if it's a hit then we will show the result directly from the dictionary.\n\notherwise we will query the word and then cache it so that it doesn't miss next time.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# re-enable logging\nlogging.root.level = logging.INFO\n\nmost_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}\nfor i, (key, value) in enumerate(most_similars_precalc.items()):\n if i == 3:\n break\n print(key, value)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparison with and without caching\n-----------------------------------\n\nfor time being lets take 4 words randomly\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import time\nwords = ['voted', 'few', 'their', 'around']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Without caching\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"start = time.time()\nfor word in words:\n result = model.wv.most_similar(word)\n print(result)\nend = time.time()\nprint(end - start)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now with caching\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"start = time.time()\nfor word in words:\n if 'voted' in most_similars_precalc:\n result = most_similars_precalc[word]\n print(result)\n else:\n result = model.wv.most_similar(word)\n most_similars_precalc[word] = result\n print(result)\n\nend = time.time()\nprint(end - start)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly you can see the improvement but this difference will be even larger\nwhen we take more words in the consideration.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualising the Word Embeddings\n-------------------------------\n\nThe word embeddings made by the model can be visualised by reducing\ndimensionality of the words to 2 dimensions using tSNE.\n\nVisualisations can be used to notice semantic and syntactic trends in the data.\n\nExample:\n\n* Semantic: words like cat, dog, cow, etc. have a tendency to lie close by\n* Syntactic: words like run, running or cut, cutting lie close together.\n\nVector relations like vKing - vMan = vQueen - vWoman can also be noticed.\n\n.. Important::\n The model used for the visualisation is trained on a small corpus. Thus\n some of the relations might not be so clear.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.decomposition import IncrementalPCA # inital reduction\nfrom sklearn.manifold import TSNE # final reduction\nimport numpy as np # array handling\n\n\ndef reduce_dimensions(model):\n num_dimensions = 2 # final num dimensions (2D, 3D, etc)\n\n vectors = [] # positions in vector space\n labels = [] # keep track of words to label our data again later\n for word in model.wv.vocab:\n vectors.append(model.wv[word])\n labels.append(word)\n\n # convert both lists into numpy vectors for reduction\n vectors = np.asarray(vectors)\n labels = np.asarray(labels)\n\n # reduce using t-SNE\n vectors = np.asarray(vectors)\n tsne = TSNE(n_components=num_dimensions, random_state=0)\n vectors = tsne.fit_transform(vectors)\n\n x_vals = [v[0] for v in vectors]\n y_vals = [v[1] for v in vectors]\n return x_vals, y_vals, labels\n\n\nx_vals, y_vals, labels = reduce_dimensions(model)\n\ndef plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):\n from plotly.offline import init_notebook_mode, iplot, plot\n import plotly.graph_objs as go\n\n trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)\n data = [trace]\n\n if plot_in_notebook:\n init_notebook_mode(connected=True)\n iplot(data, filename='word-embedding-plot')\n else:\n plot(data, filename='word-embedding-plot.html')\n\n\ndef plot_with_matplotlib(x_vals, y_vals, labels):\n import matplotlib.pyplot as plt\n import random\n\n random.seed(0)\n\n plt.figure(figsize=(12, 12))\n plt.scatter(x_vals, y_vals)\n\n #\n # Label randomly subsampled 25 data points\n #\n indices = list(range(len(labels)))\n selected_indices = random.sample(indices, 25)\n for i in selected_indices:\n plt.annotate(labels[i], (x_vals[i], y_vals[i]))\n\ntry:\n get_ipython()\nexcept Exception:\n plot_function = plot_with_matplotlib\nelse:\n plot_function = plot_with_plotly\n\nplot_function(x_vals, y_vals, labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Conclusion\n----------\n\nIn this tutorial we learned how to train word2vec models on your custom data\nand also how to evaluate it. Hope that you too will find this popular tool\nuseful in your Machine Learning tasks!\n\nLinks\n-----\n\n- API docs: :py:mod:`gensim.models.word2vec`\n- `Original C toolkit and word2vec papers by Google `_.\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZObk
J J tutorials/run_annoy.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nSimilarity Queries with Annoy and Word2Vec\n==========================================\n\nIntroduces the annoy library for similarity queries using a Word2Vec model.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"LOGS = False\nif LOGS:\n import logging\n logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Annoy Approximate Nearest Neighbors Oh Yeah\n`_ library enables similarity queries with\na Word2Vec model. The current implementation for finding k nearest neighbors\nin a vector space in gensim has linear complexity via brute force in the\nnumber of indexed documents, although with extremely low constant factors.\nThe retrieved results are exact, which is an overkill in many applications:\napproximate results retrieved in sub-linear time may be enough. Annoy can\nfind approximate nearest neighbors much faster.\n\nOutline\n-------\n\n1. Download Text8 Corpus\n2. Train the Word2Vec model\n3. Construct AnnoyIndex with model & make a similarity query\n4. Compare to the traditional indexer\n5. Persist indices to disk\n6. Save memory by via memory-mapping indices saved to disk\n7. Evaluate relationship of ``num_trees`` to initialization time and accuracy\n8. Work with Google's word2vec C formats\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Download Text8 corpus\n------------------------\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import gensim.downloader as api\ntext8_path = api.load('text8', return_path=True)\ntext8_path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Train the Word2Vec model\n---------------------------\n\nFor more details, see `sphx_glr_auto_examples_tutorials_run_word2vec.py`.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models import Word2Vec, KeyedVectors\nfrom gensim.models.word2vec import Text8Corpus\n\n# Using params from Word2Vec_FastText_Comparison\nparams = {\n 'alpha': 0.05,\n 'size': 100,\n 'window': 5,\n 'iter': 5,\n 'min_count': 5,\n 'sample': 1e-4,\n 'sg': 1,\n 'hs': 0,\n 'negative': 5\n}\nmodel = Word2Vec(Text8Corpus(text8_path), **params)\nprint(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Construct AnnoyIndex with model & make a similarity query\n------------------------------------------------------------\n\nAn instance of ``AnnoyIndexer`` needs to be created in order to use Annoy in gensim. The ``AnnoyIndexer`` class is located in ``gensim.similarities.index``\n\n``AnnoyIndexer()`` takes two parameters:\n\n* **model**: A ``Word2Vec`` or ``Doc2Vec`` model\n* **num_trees**: A positive integer. ``num_trees`` effects the build\n time and the index size. **A larger value will give more accurate results,\n but larger indexes**. More information on what trees in Annoy do can be found\n `here `__. The relationship\n between ``num_trees``\\ , build time, and accuracy will be investigated later\n in the tutorial. \n\nNow that we are ready to make a query, lets find the top 5 most similar words\nto \"science\" in the Text8 corpus. To make a similarity query we call\n``Word2Vec.most_similar`` like we would traditionally, but with an added\nparameter, ``indexer``. The only supported indexer in gensim as of now is\nAnnoy. \n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.similarities.index import AnnoyIndexer\n\n# 100 trees are being used in this example\nannoy_index = AnnoyIndexer(model, 100)\n# Derive the vector for the word \"science\" in our model\nvector = model.wv[\"science\"]\n# The instance of AnnoyIndexer we just created is passed \napproximate_neighbors = model.wv.most_similar([vector], topn=11, indexer=annoy_index)\n# Neatly print the approximate_neighbors and their corresponding cosine similarity values\nprint(\"Approximate Neighbors\")\nfor neighbor in approximate_neighbors:\n print(neighbor)\n\nnormal_neighbors = model.wv.most_similar([vector], topn=11)\nprint(\"\\nNormal (not Annoy-indexed) Neighbors\")\nfor neighbor in normal_neighbors:\n print(neighbor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The closer the cosine similarity of a vector is to 1, the more similar that\nword is to our query, which was the vector for \"science\". There are some\ndifferences in the ranking of similar words and the set of words included\nwithin the 10 most similar words.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Compare to the traditional indexer\n-------------------------------------\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Set up the model and vector that we are using in the comparison\nmodel.init_sims()\nannoy_index = AnnoyIndexer(model, 100)\n\n# Dry run to make sure both indices are fully in RAM\nvector = model.wv.vectors_norm[0]\nmodel.wv.most_similar([vector], topn=5, indexer=annoy_index)\nmodel.wv.most_similar([vector], topn=5)\n\nimport time\nimport numpy as np\n\ndef avg_query_time(annoy_index=None, queries=1000):\n \"\"\"\n Average query time of a most_similar method over 1000 random queries,\n uses annoy if given an indexer\n \"\"\"\n total_time = 0\n for _ in range(queries):\n rand_vec = model.wv.vectors_norm[np.random.randint(0, len(model.wv.vocab))]\n start_time = time.process_time()\n model.wv.most_similar([rand_vec], topn=5, indexer=annoy_index)\n total_time += time.process_time() - start_time\n return total_time / queries\n\nqueries = 10000\n\ngensim_time = avg_query_time(queries=queries)\nannoy_time = avg_query_time(annoy_index, queries=queries)\nprint(\"Gensim (s/query):\\t{0:.5f}\".format(gensim_time))\nprint(\"Annoy (s/query):\\t{0:.5f}\".format(annoy_time))\nspeed_improvement = gensim_time / annoy_time\nprint (\"\\nAnnoy is {0:.2f} times faster on average on this particular run\".format(speed_improvement))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**This speedup factor is by no means constant** and will vary greatly from\nrun to run and is particular to this data set, BLAS setup, Annoy\nparameters(as tree size increases speedup factor decreases), machine\nspecifications, among other factors.\n\n.. Important::\n Initialization time for the annoy indexer was not included in the times.\n The optimal knn algorithm for you to use will depend on how many queries\n you need to make and the size of the corpus. If you are making very few\n similarity queries, the time taken to initialize the annoy indexer will be\n longer than the time it would take the brute force method to retrieve\n results. If you are making many queries however, the time it takes to\n initialize the annoy indexer will be made up for by the incredibly fast\n retrieval times for queries once the indexer has been initialized\n\n.. Important::\n Gensim's 'most_similar' method is using numpy operations in the form of\n dot product whereas Annoy's method isnt. If 'numpy' on your machine is\n using one of the BLAS libraries like ATLAS or LAPACK, it'll run on\n multiple cores (only if your machine has multicore support ). Check `SciPy\n Cookbook\n `_\n for more details.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5. Persisting indices to disk\n-----------------------------\n\nYou can save and load your indexes from/to disk to prevent having to\nconstruct them each time. This will create two files on disk, *fname* and\n*fname.d*. Both files are needed to correctly restore all attributes. Before\nloading an index, you will have to create an empty AnnoyIndexer object.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fname = '/tmp/mymodel.index'\n\n# Persist index to disk\nannoy_index.save(fname)\n\n# Load index back\nimport os.path\nif os.path.exists(fname):\n annoy_index2 = AnnoyIndexer()\n annoy_index2.load(fname)\n annoy_index2.model = model\n\n# Results should be identical to above\nvector = model.wv[\"science\"]\napproximate_neighbors2 = model.wv.most_similar([vector], topn=11, indexer=annoy_index2)\nfor neighbor in approximate_neighbors2:\n print(neighbor)\n \nassert approximate_neighbors == approximate_neighbors2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Be sure to use the same model at load that was used originally, otherwise you\nwill get unexpected behaviors.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6. Save memory via memory-mapping indices saved to disk\n-------------------------------------------------------\n\nAnnoy library has a useful feature that indices can be memory-mapped from\ndisk. It saves memory when the same index is used by several processes.\n\nBelow are two snippets of code. First one has a separate index for each\nprocess. The second snipped shares the index between two processes via\nmemory-mapping. The second example uses less total RAM as it is shared.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Remove verbosity from code below (if logging active)\nif LOGS:\n logging.disable(logging.CRITICAL)\n\nfrom multiprocessing import Process\nimport os\nimport psutil"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bad example: two processes load the Word2vec model from disk and create there\nown Annoy indices from that model.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model.save('/tmp/mymodel.pkl')\n\ndef f(process_id):\n print('Process Id: {}'.format(os.getpid()))\n process = psutil.Process(os.getpid())\n new_model = Word2Vec.load('/tmp/mymodel.pkl')\n vector = new_model.wv[\"science\"]\n annoy_index = AnnoyIndexer(new_model,100)\n approximate_neighbors = new_model.wv.most_similar([vector], topn=5, indexer=annoy_index)\n print('\\nMemory used by process {}: {}\\n---'.format(os.getpid(), process.memory_info()))\n\n# Creating and running two parallel process to share the same index file.\np1 = Process(target=f, args=('1',))\np1.start()\np1.join()\np2 = Process(target=f, args=('2',))\np2.start()\np2.join()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Good example: two processes load both the Word2vec model and index from disk\nand memory-map the index\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model.save('/tmp/mymodel.pkl')\n\ndef f(process_id):\n print('Process Id: {}'.format(os.getpid()))\n process = psutil.Process(os.getpid())\n new_model = Word2Vec.load('/tmp/mymodel.pkl')\n vector = new_model.wv[\"science\"]\n annoy_index = AnnoyIndexer()\n annoy_index.load('/tmp/mymodel.index')\n annoy_index.model = new_model\n approximate_neighbors = new_model.wv.most_similar([vector], topn=5, indexer=annoy_index)\n print('\\nMemory used by process {}: {}\\n---'.format(os.getpid(), process.memory_info()))\n\n# Creating and running two parallel process to share the same index file.\np1 = Process(target=f, args=('1',))\np1.start()\np1.join()\np2 = Process(target=f, args=('2',))\np2.start()\np2.join()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7. Evaluate relationship of ``num_trees`` to initialization time and accuracy\n-----------------------------------------------------------------------------\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build dataset of Initialization times and accuracy measures:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"exact_results = [element[0] for element in model.wv.most_similar([model.wv.vectors_norm[0]], topn=100)]\n\nx_values = []\ny_values_init = []\ny_values_accuracy = []\n\nfor x in range(1, 300, 10):\n x_values.append(x)\n start_time = time.time()\n annoy_index = AnnoyIndexer(model, x)\n y_values_init.append(time.time() - start_time)\n approximate_results = model.wv.most_similar([model.wv.vectors_norm[0]], topn=100, indexer=annoy_index)\n top_words = [result[0] for result in approximate_results]\n y_values_accuracy.append(len(set(top_words).intersection(exact_results)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot results:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.figure(1, figsize=(12, 6))\nplt.subplot(121)\nplt.plot(x_values, y_values_init)\nplt.title(\"num_trees vs initalization time\")\nplt.ylabel(\"Initialization time (s)\")\nplt.xlabel(\"num_trees\")\nplt.subplot(122)\nplt.plot(x_values, y_values_accuracy)\nplt.title(\"num_trees vs accuracy\")\nplt.ylabel(\"% accuracy\")\nplt.xlabel(\"num_trees\")\nplt.tight_layout()\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the above, we can see that the initialization time of the annoy indexer\nincreases in a linear fashion with num_trees. Initialization time will vary\nfrom corpus to corpus, in the graph above the lee corpus was used\n\nFurthermore, in this dataset, the accuracy seems logarithmically related to\nthe number of trees. We see an improvement in accuracy with more trees, but\nthe relationship is nonlinear. \n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7. Work with Google word2vec files\n----------------------------------\n\nOur model can be exported to a word2vec C format. There is a binary and a\nplain text word2vec format. Both can be read with a variety of other\nsoftware, or imported back into gensim as a ``KeyedVectors`` object.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# To export our model as text\nmodel.wv.save_word2vec_format('/tmp/vectors.txt', binary=False)\n\nfrom smart_open import open\n# View the first 3 lines of the exported file\n\n# The first line has the total number of entries and the vector dimension count. \n# The next lines have a key (a string) followed by its vector.\nwith open('/tmp/vectors.txt') as myfile:\n for i in range(3):\n print(myfile.readline().strip())\n\n# To import a word2vec text model\nwv = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)\n\n# To export our model as binary\nmodel.wv.save_word2vec_format('/tmp/vectors.bin', binary=True)\n\n# To import a word2vec binary model\nwv = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)\n\n# To create and save Annoy Index from a loaded `KeyedVectors` object (with 100 trees)\nannoy_index = AnnoyIndexer(wv, 100)\nannoy_index.save('/tmp/mymodel.index')\n\n# Load and test the saved word vectors and saved annoy index\nwv = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)\nannoy_index = AnnoyIndexer()\nannoy_index.load('/tmp/mymodel.index')\nannoy_index.model = wv\n\nvector = wv[\"cat\"]\napproximate_neighbors = wv.most_similar([vector], topn=11, indexer=annoy_index)\n# Neatly print the approximate_neighbors and their corresponding cosine similarity values\nprint(\"Approximate Neighbors\")\nfor neighbor in approximate_neighbors:\n print(neighbor)\n\nnormal_neighbors = wv.most_similar([vector], topn=11)\nprint(\"\\nNormal (not Annoy-indexed) Neighbors\")\nfor neighbor in normal_neighbors:\n print(neighbor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recap\n-----\n\nIn this notebook we used the Annoy module to build an indexed approximation\nof our word embeddings. To do so, we did the following steps:\n\n1. Download Text8 Corpus\n2. Train Word2Vec Model\n3. Construct AnnoyIndex with model & make a similarity query\n4. Persist indices to disk\n5. Save memory by via memory-mapping indices saved to disk\n6. Evaluate relationship of ``num_trees`` to initialization time and accuracy\n7. Work with Google's word2vec C formats\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZOd{8 {8 ! tutorials/run_summarization.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nText Summarization\n==================\n\nDemonstrates summarizing text by extracting the most important sentences from it.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This module automatically summarizes the given text, by extracting one or\nmore important sentences from the text. In a similar way, it can also extract\nkeywords. This tutorial will teach you to use this summarization module via\nsome examples. First, we will try a small example, then we will try two\nlarger ones, and then we will review the performance of the summarizer in\nterms of speed.\n\nThis summarizer is based on the , from an `\"TextRank\" algorithm by Mihalcea\net al `_.\nThis algorithm was later improved upon by `Barrios et al.\n`_,\nby introducing something called a \"BM25 ranking function\". \n\n.. important::\n Gensim's summarization only works for English for now, because the text\n is pre-processed so that stopwords are removed and the words are stemmed,\n and these processes are language-dependent.\n\nSmall example\n-------------\n\nFirst of all, we import the :py:func:`gensim.summarization.summarize` function.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from pprint import pprint as print\nfrom gensim.summarization import summarize"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will try summarizing a small toy example; later we will use a larger piece of text. In reality, the text is too small, but it suffices as an illustrative example.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"text = (\n \"Thomas A. Anderson is a man living two lives. By day he is an \"\n \"average computer programmer and by night a hacker known as \"\n \"Neo. Neo has always questioned his reality, but the truth is \"\n \"far beyond his imagination. Neo finds himself targeted by the \"\n \"police when he is contacted by Morpheus, a legendary computer \"\n \"hacker branded a terrorist by the government. Morpheus awakens \"\n \"Neo to the real world, a ravaged wasteland where most of \"\n \"humanity have been captured by a race of machines that live \"\n \"off of the humans' body heat and electrochemical energy and \"\n \"who imprison their minds within an artificial reality known as \"\n \"the Matrix. As a rebel against the machines, Neo must return to \"\n \"the Matrix and confront the agents: super-powerful computer \"\n \"programs devoted to snuffing out Neo and the entire human \"\n \"rebellion. \"\n)\nprint(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To summarize this text, we pass the **raw string data** as input to the\nfunction \"summarize\", and it will return a summary.\n\nNote: make sure that the string does not contain any newlines where the line\nbreaks in a sentence. A sentence with a newline in it (i.e. a carriage\nreturn, \"\\n\") will be treated as two sentences.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(summarize(text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the \"split\" option if you want a list of strings instead of a single string.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(summarize(text, split=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can adjust how much text the summarizer outputs via the \"ratio\" parameter\nor the \"word_count\" parameter. Using the \"ratio\" parameter, you specify what\nfraction of sentences in the original text should be returned as output.\nBelow we specify that we want 50% of the original text (the default is 20%).\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(summarize(text, ratio=0.5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the \"word_count\" parameter, we specify the maximum amount of words we\nwant in the summary. Below we have specified that we want no more than 50\nwords.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(summarize(text, word_count=50))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned earlier, this module also supports **keyword** extraction.\nKeyword extraction works in the same way as summary generation (i.e. sentence\nextraction), in that the algorithm tries to find words that are important or\nseem representative of the entire text. They keywords are not always single\nwords; in the case of multi-word keywords, they are typically all nouns.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.summarization import keywords\nprint(keywords(text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Larger example\n--------------\n\nLet us try an example with a larger piece of text. We will be using a\nsynopsis of the movie \"The Matrix\", which we have taken from `this\n`_ IMDb page.\n\nIn the code below, we read the text file directly from a web-page using\n\"requests\". Then we produce a summary and some keywords.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import requests\n\ntext = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text\nprint(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, the summary\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(summarize(text, ratio=0.01))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now, the keywords:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(keywords(text, ratio=0.01))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you know this movie, you see that this summary is actually quite good. We\nalso see that some of the most important characters (Neo, Morpheus, Trinity)\nwere extracted as keywords.\n\nAnother example\n---------------\n\nLet's try an example similar to the one above. This time, we will use the IMDb synopsis\n`The Big Lebowski `_.\n\nAgain, we download the text and produce a summary and some keywords.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"text = requests.get('http://rare-technologies.com/the_big_lebowski_synopsis.txt').text\nprint(text)\nprint(summarize(text, ratio=0.01))\nprint(keywords(text, ratio=0.01))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time around, the summary is not of high quality, as it does not tell us\nmuch about the movie. In a way, this might not be the algorithms fault,\nrather this text simply doesn't contain one or two sentences that capture the\nessence of the text as in \"The Matrix\" synopsis.\n\nThe keywords, however, managed to find some of the main characters.\n\nPerformance\n-----------\n\nWe will test how the speed of the summarizer scales with the size of the\ndataset. These tests were run on an Intel Core i5 4210U CPU @ 1.70 GHz x 4\nprocessor. Note that the summarizer does **not** support multithreading\n(parallel processing).\n\nThe tests were run on the book \"Honest Abe\" by Alonzo Rothschild. Download\nthe book in plain-text `here `__.\n\nIn the **plot below** , we see the running times together with the sizes of\nthe datasets. To create datasets of different sizes, we have simply taken\nprefixes of text; in other words we take the first **n** characters of the\nbook. The algorithm seems to be **quadratic in time** , so one needs to be\ncareful before plugging a large dataset into the summarizer.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('summarization_tutorial_plot.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Text-content dependent running times\n------------------------------------\n\nThe running time is not only dependent on the size of the dataset. For\nexample, summarizing \"The Matrix\" synopsis (about 36,000 characters) takes\nabout 3.1 seconds, while summarizing 35,000 characters of this book takes\nabout 8.5 seconds. So the former is **more than twice as fast**.\n\nOne reason for this difference in running times is the data structure that is\nused. The algorithm represents the data using a graph, where vertices (nodes)\nare sentences, and then constructs weighted edges between the vertices that\nrepresent how the sentences relate to each other. This means that every piece\nof text will have a different graph, thus making the running times different.\nThe size of this data structure is **quadratic in the worst case** (the worst\ncase is when each vertex has an edge to every other vertex).\n\nAnother possible reason for the difference in running times is that the\nproblems converge at different rates, meaning that the error drops slower for\nsome datasets than for others.\n\nMontemurro and Zanette's entropy based keyword extraction algorithm\n-------------------------------------------------------------------\n\n`This paper `__ describes a technique to\nidentify words that play a significant role in the large-scale structure of a\ntext. These typically correspond to the major themes of the text. The text is\ndivided into blocks of ~1000 words, and the entropy of each word's\ndistribution amongst the blocks is caclulated and compared with the expected\nentropy if the word were distributed randomly.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import requests\nfrom gensim.summarization import mz_keywords\n\ntext=requests.get(\"http://www.gutenberg.org/files/49679/49679-0.txt\").text\nprint(mz_keywords(text,scores=True,threshold=0.001))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, the algorithm weights the entropy by the overall frequency of the\nword in the document. We can remove this weighting by setting weighted=False\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(mz_keywords(text,scores=True,weighted=False,threshold=1.0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When this option is used, it is possible to calculate a threshold\nautomatically from the number of blocks\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(mz_keywords(text,scores=True,weighted=False,threshold=\"auto\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The complexity of the algorithm is **O**\\ (\\ *Nw*\\ ), where *N* is the number\nof words in the document and *w* is the number of unique words.\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK tZO }'; '; tutorials/run_fasttext.ipynb{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nFastText Model\n==============\n\nIntroduces Gensim's fastText model and demonstrates its use on the Lee Corpus.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we'll learn to work with fastText library for training word-embedding\nmodels, saving & loading them and performing similarity operations & vector\nlookups analogous to Word2Vec.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When to use FastText?\n---------------------\n\nThe main principle behind `fastText `_ is that the morphological structure of a word carries important information about the meaning of the word, which is not taken into account by traditional word embeddings, which train a unique word embedding for every individual word. This is especially significant for morphologically rich languages (German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings.\n\n\nfastText attempts to solve this by treating each word as the aggregation of its subwords. For the sake of simplicity and language-independence, subwords are taken to be the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams.\n\n\nAccording to a detailed comparison of Word2Vec and FastText in `this notebook `__, fastText does significantly better on syntactic tasks as compared to the original Word2Vec, especially when the size of the training corpus is small. Word2Vec slightly outperforms FastText on semantic tasks though. The differences grow smaller as the size of training corpus increases.\n\n\nTraining time for fastText is significantly higher than the Gensim version of Word2Vec (\\ ``15min 42s`` vs ``6min 42s`` on text8, 17 mil tokens, 5 epochs, and a vector size of 100).\n\n\nfastText can be used to obtain vectors for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training models\n---------------\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model.\n\n\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from pprint import pprint as print\nfrom gensim.models.fasttext import FastText as FT_gensim\nfrom gensim.test.utils import datapath\n\n# Set file names for train and test data\ncorpus_file = datapath('lee_background.cor')\n\nmodel = FT_gensim(size=100)\n\n# build the vocabulary\nmodel.build_vocab(corpus_file=corpus_file)\n\n# train the model\nmodel.train(\n corpus_file=corpus_file, epochs=model.epochs,\n total_examples=model.corpus_count, total_words=model.corpus_total_words\n)\n\nprint(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training hyperparameters\n^^^^^^^^^^^^^^^^^^^^^^^^\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec:\n\n- model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)\n- size: Size of embeddings to be learnt (Default 100)\n- alpha: Initial learning rate (Default 0.025)\n- window: Context window size (Default 5)\n- min_count: Ignore words with number of occurrences below this (Default 5)\n- loss: Training objective. Allowed values: `ns`, `hs`, `softmax` (Default `ns`)\n- sample: Threshold for downsampling higher-frequency words (Default 0.001)\n- negative: Number of negative words to sample, for `ns` (Default 5)\n- iter: Number of epochs (Default 5)\n- sorted_vocab: Sort vocab by descending frequency (Default 1)\n- threads: Number of threads to use (Default 12)\n\n\nIn addition, FastText has three additional parameters:\n\n- min_n: min length of char ngrams (Default 3)\n- max_n: max length of char ngrams (Default 6)\n- bucket: number of buckets used for hashing ngrams (Default 2000000)\n\n\nParameters ``min_n`` and ``max_n`` control the lengths of character ngrams that each word is broken down into while training and looking up embeddings. If ``max_n`` is set to 0, or to be lesser than ``min_n``\\ , no character ngrams are used, and the model effectively reduces to Word2Vec.\n\n\n\nTo bound the memory requirements of the model being trained, a hashing function is used that maps ngrams to integers in 1 to K. For hashing these character sequences, the `Fowler-Noll-Vo hashing function `_ (FNV-1a variant) is employed.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** As in the case of Word2Vec, you can continue to train your model while using Gensim's native implementation of fastText.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Saving/loading models\n---------------------\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Models can be saved and loaded via the ``load`` and ``save`` methods.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# saving a model trained via Gensim's fastText implementation\nimport tempfile\nimport os\nwith tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:\n model.save(tmp.name, separately=[])\n\nloaded_model = FT_gensim.load(tmp.name)\nprint(loaded_model)\n\nos.unlink(tmp.name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ``save_word2vec_method`` causes the vectors for ngrams to be lost. As a result, a model loaded in this way will behave as a regular word2vec model.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word vector lookup\n------------------\n\n\n**Note:** Operations like word vector lookups and similarity queries can be performed in exactly the same manner for both the implementations of fastText so they have been demonstrated using only the native fastText implementation here.\n\n\n\nFastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print('night' in model.wv.vocab)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print('nights' in model.wv.vocab)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model['night'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model['nights'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ``in`` operation works slightly differently from the original word2vec. It tests whether a vector for the given word exists or not, not whether the word is present in the word vocabulary. To test whether a word is present in the training word vocabulary -\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tests if word present in vocab\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"word\" in model.wv.vocab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tests if vector present for word\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"word\" in model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarity operations\n---------------------\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarity operations work the same way as word2vec. **Out-of-vocabulary words can also be used, provided they have at least one character ngram present in the training data.**\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"nights\" in model.wv.vocab)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"night\" in model.wv.vocab)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.similarity(\"night\", \"nights\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Syntactically similar words generally have high similarity in fastText models, since a large number of the component char-ngrams will be the same. As a result, fastText generally does better at syntactic tasks than Word2Vec. A detailed comparison is provided `here `_.\n\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other similarity operations\n^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nThe example training corpus is a toy corpus, results are not expected to be good, for proof-of-concept only\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.most_similar(\"nights\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.doesnt_match(\"breakfast cereal dinner lunch\".split()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.most_similar(positive=['baghdad', 'england'], negative=['london']))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.accuracy(questions=datapath('questions-words.txt')))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word Movers distance\n^^^^^^^^^^^^^^^^^^^^\n\nLet's start with two sentences:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()\nsentence_president = 'The president greets the press in Chicago'.lower().split()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove their stopwords.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from nltk.corpus import stopwords\nstopwords = stopwords.words('english')\nsentence_obama = [w for w in sentence_obama if w not in stopwords]\nsentence_president = [w for w in sentence_president if w not in stopwords]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute WMD.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"distance = model.wmdistance(sentence_obama, sentence_president)\nprint(distance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's all! You've made it to the end of this tutorial.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\nimg = mpimg.imread('fasttext-logo-color-web.png')\nimgplot = plt.imshow(img)\nplt.axis('off')\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}PK PRaOښ;gA gA ) core/run_topics_and_transformations.ipynbPK tZO!/ / ! A core/run_similarity_queries.ipynbPK tZOZĒxG xG q core/run_core_concepts.ipynbPK tZO2,M ,M ( Q core/run_corpora_and_vector_spaces.ipynbPK tZO'] ] howtos/run_doc2vec_imdb.ipynbPK tZO;LM, M, d howtos/run_compare_lda.ipynbPK tZO4P$ $ b howtos/run_doc.ipynbPK tZOd-r howtos/run_downloader_api.ipynbPK tZORbU U $ tutorials/run_distance_metrics.ipynbPK tZOْ]? ]? * tutorials/run_lda.ipynbPK tZOEAĽ" " Zj tutorials/run_wmd.ipynbPK tZO{nL nL L tutorials/run_doc2vec_lee.ipynbPK tZOtRg, , $ tutorials/run_pivoted_doc_norm.ipynbPK PRaO@UX߄ ߄ tutorials/run_word2vec.ipynbPK tZObk
J J ы tutorials/run_annoy.ipynbPK tZOd{8 {8 ! tutorials/run_summarization.ipynbPK tZO }'; '; [ tutorials/run_fasttext.ipynbPK J