topic_coherence.text_analysis
– Analyzing the texts of a corpus to accumulate statistical information about word occurrences¶
This module contains classes for analyzing the texts of a corpus to accumulate statistical information about word occurrences.
- class gensim.topic_coherence.text_analysis.AccumulatingWorker(input_q, output_q, accumulator, window_size)¶
Bases:
Process
Accumulate stats from texts fed in from queue.
- property authkey¶
- close()¶
Close the Process object.
This method releases resources held by the Process object. It is an error to call this method if the child process is still running.
- property daemon¶
Return whether process is a daemon
- property exitcode¶
Return exit code of process or None if it has yet to stop
- property ident¶
Return identifier (PID) of process or None if it has yet to start
- is_alive()¶
Return whether process is alive
- join(timeout=None)¶
Wait until child process terminates
- kill()¶
Terminate process; sends SIGKILL signal or uses TerminateProcess()
- property name¶
- property pid¶
Return identifier (PID) of process or None if it has yet to start
- reply_to_master()¶
- run()¶
Method to be run in sub-process; can be overridden in sub-class
- property sentinel¶
Return a file descriptor (Unix) or handle (Windows) suitable for waiting for process termination.
- start()¶
Start child process
- terminate()¶
Terminate process; sends SIGTERM signal or uses TerminateProcess()
- class gensim.topic_coherence.text_analysis.BaseAnalyzer(relevant_ids)¶
Bases:
object
Base class for corpus and text analyzers.
- relevant_ids¶
Mapping
- Type
dict
- _vocab_size¶
Size of vocabulary.
- Type
int
- id2contiguous¶
Mapping word_id -> number.
- Type
dict
- log_every¶
Interval for logging.
- Type
int
- _num_docs¶
Number of documents.
- Type
int
- Parameters
relevant_ids (dict) – Mapping
Examples
>>> from gensim.topic_coherence import text_analysis >>> ids = {1: 'fake', 4: 'cats'} >>> base = text_analysis.BaseAnalyzer(ids) >>> # should return {1: 'fake', 4: 'cats'} 2 {1: 0, 4: 1} 1000 0 >>> print(base.relevant_ids, base._vocab_size, base.id2contiguous, base.log_every, base._num_docs) {1: 'fake', 4: 'cats'} 2 {1: 0, 4: 1} 1000 0
- __getitem__(word_or_words)¶
- analyze_text(text, doc_num=None)¶
- get_co_occurrences(word_id1, word_id2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word_id)¶
Return number of docs the word occurs in, once accumulate has been called.
- property num_docs¶
- class gensim.topic_coherence.text_analysis.CorpusAccumulator(*args)¶
Bases:
InvertedIndexBased
Gather word occurrence stats from a corpus by iterating over its BoW representation.
- Parameters
args (dict) – Look at
BaseAnalyzer
Examples
>>> from gensim.topic_coherence import text_analysis >>> >>> ids = {1: 'fake', 4: 'cats'} >>> ininb = text_analysis.InvertedIndexBased(ids) >>> >>> print(ininb._inverted_index) [set([]) set([])]
- __getitem__(word_or_words)¶
- accumulate(corpus)¶
- analyze_text(text, doc_num=None)¶
Build an inverted index from a sequence of corpus texts.
- get_co_occurrences(word_id1, word_id2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word_id)¶
Return number of docs the word occurs in, once accumulate has been called.
- index_to_dict()¶
- property num_docs¶
- class gensim.topic_coherence.text_analysis.InvertedIndexAccumulator(relevant_ids, dictionary)¶
Bases:
WindowedTextsAnalyzer
,InvertedIndexBased
Build an inverted index from a sequence of corpus texts.
- Parameters
relevant_ids (set of int) – Relevant id
dictionary (
Dictionary
) – Dictionary instance with mappings for the relevant_ids.
- __getitem__(word_or_words)¶
- accumulate(texts, window_size)¶
- analyze_text(window, doc_num=None)¶
- get_co_occurrences(word1, word2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word)¶
Return number of docs the word occurs in, once accumulate has been called.
- index_to_dict()¶
- property num_docs¶
- class gensim.topic_coherence.text_analysis.InvertedIndexBased(*args)¶
Bases:
BaseAnalyzer
Analyzer that builds up an inverted index to accumulate stats.
- Parameters
args (dict) – Look at
BaseAnalyzer
Examples
>>> from gensim.topic_coherence import text_analysis >>> >>> ids = {1: 'fake', 4: 'cats'} >>> ininb = text_analysis.InvertedIndexBased(ids) >>> >>> print(ininb._inverted_index) [set([]) set([])]
- __getitem__(word_or_words)¶
- analyze_text(text, doc_num=None)¶
- get_co_occurrences(word_id1, word_id2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word_id)¶
Return number of docs the word occurs in, once accumulate has been called.
- index_to_dict()¶
- property num_docs¶
- class gensim.topic_coherence.text_analysis.ParallelWordOccurrenceAccumulator(processes, *args, **kwargs)¶
Bases:
WindowedTextsAnalyzer
Accumulate word occurrences in parallel.
- processes¶
Number of processes to use; must be at least two.
- Type
int
- args¶
Should include relevant_ids and dictionary (see
__init__
).
- kwargs¶
Can include batch_size, which is the number of docs to send to a worker at a time. If not included, it defaults to 64.
- Parameters
relevant_ids (set of int) – Relevant id
dictionary (
Dictionary
) – Dictionary instance with mappings for the relevant_ids.
- __getitem__(word_or_words)¶
- accumulate(texts, window_size)¶
- analyze_text(text, doc_num=None)¶
- get_co_occurrences(word1, word2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word)¶
Return number of docs the word occurs in, once accumulate has been called.
- merge_accumulators(accumulators)¶
Merge the list of accumulators into a single WordOccurrenceAccumulator with all occurrence and co-occurrence counts, and a num_docs that reflects the total observed by all the individual accumulators.
- property num_docs¶
- queue_all_texts(q, texts, window_size)¶
Sequentially place batches of texts on the given queue until texts is consumed. The texts are filtered so that only those with at least one relevant token are queued.
- start_workers(window_size)¶
Set up an input and output queue and start processes for each worker.
Notes
The input queue is used to transmit batches of documents to the workers. The output queue is used by workers to transmit the WordOccurrenceAccumulator instances.
- Parameters
window_size (int) –
- Returns
Tuple of (list of workers, input queue, output queue).
- Return type
(list of lists)
- terminate_workers(input_q, output_q, workers, interrupted=False)¶
Wait until all workers have transmitted their WordOccurrenceAccumulator instances, then terminate each.
Warning
We do not use join here because it has been shown to have some issues in Python 2.7 (and even in later versions). This method also closes both the input and output queue. If interrupted is False (normal execution), a None value is placed on the input queue for each worker. The workers are looking for this sentinel value and interpret it as a signal to terminate themselves. If interrupted is True, a KeyboardInterrupt occurred. The workers are programmed to recover from this and continue on to transmit their results before terminating. So in this instance, the sentinel values are not queued, but the rest of the execution continues as usual.
- yield_batches(texts)¶
Return a generator over the given texts that yields batches of batch_size texts at a time.
- class gensim.topic_coherence.text_analysis.PatchedWordOccurrenceAccumulator(*args)¶
Bases:
WordOccurrenceAccumulator
Monkey patched for multiprocessing worker usage, to move some of the logic to the master process.
- Parameters
relevant_ids (set of int) – Relevant id
dictionary (
Dictionary
) – Dictionary instance with mappings for the relevant_ids.
- __getitem__(word_or_words)¶
- accumulate(texts, window_size)¶
- analyze_text(window, doc_num=None)¶
- get_co_occurrences(word1, word2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word)¶
Return number of docs the word occurs in, once accumulate has been called.
- merge(other)¶
- property num_docs¶
- partial_accumulate(texts, window_size)¶
Meant to be called several times to accumulate partial results.
Notes
The final accumulation should be performed with the accumulate method as opposed to this one. This method does not ensure the co-occurrence matrix is in lil format and does not symmetrize it after accumulation.
- class gensim.topic_coherence.text_analysis.UsesDictionary(relevant_ids, dictionary)¶
Bases:
BaseAnalyzer
A BaseAnalyzer that uses a Dictionary, hence can translate tokens to counts. The standard BaseAnalyzer can only deal with token ids since it doesn’t have the token2id mapping.
- relevant_words¶
Set of words that occurrences should be accumulated for.
- Type
set
- dictionary¶
Dictionary based on text
- Type
- token2id¶
Mapping from
Dictionary
- Type
dict
- Parameters
relevant_ids (dict) – Mapping
dictionary (
Dictionary
) – Dictionary based on text
Examples
>>> from gensim.topic_coherence import text_analysis >>> from gensim.corpora.dictionary import Dictionary >>> >>> ids = {1: 'foo', 2: 'bar'} >>> dictionary = Dictionary([['foo', 'bar', 'baz'], ['foo', 'bar', 'bar', 'baz']]) >>> udict = text_analysis.UsesDictionary(ids, dictionary) >>> >>> print(udict.relevant_words) set([u'foo', u'baz'])
- __getitem__(word_or_words)¶
- analyze_text(text, doc_num=None)¶
- get_co_occurrences(word1, word2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word)¶
Return number of docs the word occurs in, once accumulate has been called.
- property num_docs¶
- class gensim.topic_coherence.text_analysis.WindowedTextsAnalyzer(relevant_ids, dictionary)¶
Bases:
UsesDictionary
Gather some stats about relevant terms of a corpus by iterating over windows of texts.
- Parameters
relevant_ids (set of int) – Relevant id
dictionary (
Dictionary
) – Dictionary instance with mappings for the relevant_ids.
- __getitem__(word_or_words)¶
- accumulate(texts, window_size)¶
- analyze_text(text, doc_num=None)¶
- get_co_occurrences(word1, word2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word)¶
Return number of docs the word occurs in, once accumulate has been called.
- property num_docs¶
- class gensim.topic_coherence.text_analysis.WordOccurrenceAccumulator(*args)¶
Bases:
WindowedTextsAnalyzer
Accumulate word occurrences and co-occurrences from a sequence of corpus texts.
- Parameters
relevant_ids (set of int) – Relevant id
dictionary (
Dictionary
) – Dictionary instance with mappings for the relevant_ids.
- __getitem__(word_or_words)¶
- accumulate(texts, window_size)¶
- analyze_text(window, doc_num=None)¶
- get_co_occurrences(word1, word2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word)¶
Return number of docs the word occurs in, once accumulate has been called.
- merge(other)¶
- property num_docs¶
- partial_accumulate(texts, window_size)¶
Meant to be called several times to accumulate partial results.
Notes
The final accumulation should be performed with the accumulate method as opposed to this one. This method does not ensure the co-occurrence matrix is in lil format and does not symmetrize it after accumulation.
- class gensim.topic_coherence.text_analysis.WordVectorsAccumulator(relevant_ids, dictionary, model=None, **model_kwargs)¶
Bases:
UsesDictionary
Accumulate context vectors for words using word vector embeddings.
- model¶
If None, a new Word2Vec model is trained on the given text corpus. Otherwise, it should be a pre-trained Word2Vec context vectors.
- Type
Word2Vec (
KeyedVectors
)
- model_kwargs¶
if model is None, these keyword arguments will be passed through to the Word2Vec constructor.
- Parameters
relevant_ids (dict) – Mapping
dictionary (
Dictionary
) – Dictionary based on text
Examples
>>> from gensim.topic_coherence import text_analysis >>> from gensim.corpora.dictionary import Dictionary >>> >>> ids = {1: 'foo', 2: 'bar'} >>> dictionary = Dictionary([['foo', 'bar', 'baz'], ['foo', 'bar', 'bar', 'baz']]) >>> udict = text_analysis.UsesDictionary(ids, dictionary) >>> >>> print(udict.relevant_words) set([u'foo', u'baz'])
- __getitem__(word_or_words)¶
- accumulate(texts, window_size)¶
- analyze_text(text, doc_num=None)¶
- get_co_occurrences(word1, word2)¶
Return number of docs the words co-occur in, once accumulate has been called.
- get_occurrences(word)¶
Return number of docs the word occurs in, once accumulate has been called.
- ids_similarity(ids1, ids2)¶
- not_in_vocab(words)¶
- property num_docs¶