Skip to content
piskvorky edited this page Nov 3, 2011 · 34 revisions

##Add your useful code snippets and recipes here. You can also post a short question -- please only ask questions that can be fully answered in a sentence or two. No open-ended questions or discussions here.

###Q: How many times does a feature with id 123 appear in a corpus? A: total_sum = sum(dict(doc).get(123, 0) for doc in corpus)


###Q: How do you calculate the vector length of a term? A: (note that "vector length" only makes sense for non-zero vectors):

  1. If the input vector vec is in gensim sparse format (a list of 2-tuples) : length = math.sqrt(sum(val**2 for _, val in vec)), or use length = gensim.matutils.veclen(vec).
  2. If the input vector is a numpy array: length = gensim.matutils.blas_nrm2(vec)
  3. If the input vector is in a scipy.sparse format: length = numpy.sqrt(numpy.sum(vec.tocsr().data**2))

###Q: How do you calculate the vector V in LSI space? A: With the singular value decomposition of your corpus X being X=U*S*V^T, doing lsi[X] computes U^-1*X, which equals V*S (basic linear algebra). So if you want V, divide lsi[X] by S:

V = gensim.matutils.corpus2dense(lsi[X], len(lsi.projection.s)).T / lsi.projection.s, to get V as a 2d numpy array.

Clone this wiki locally