Download

Get gensim from the Python Package Index or install directly with:

easy_install -U gensim

Table Of Contents

Gensim – Python Framework for Vector Space Modelling

What’s new?

Version 0.7 is out!

  • incremental algorithms for Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) are now faster and consume less memory.
  • Optimizations to vocabulary generation.
  • Input corpus iterator can come from a compressed file (bzip2, gzip, ...), to save disk space when dealing with very large corpora.

gensim now completes LSI of the English Wikipedia (3.2 million documents) in 5 hours 14 minutes, using a one-pass incremental SVD algorithm, on a Macbook Pro laptop (NIPS workshop paper). Be sure to check out the distributed mode, too.

For an overview of what you can (or cannot) do with gensim, go to the introduction.

For examples on how to use it, try the tutorials.

Quick Reference Example

>>> from gensim import corpora, models, similarities
>>>
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
>>>
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, numTopics=200)
>>>
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>>
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]