Reduce memory consumption of summarizer #2298

horpto · 2018-12-16T08:42:06Z

By a request of @menshikh-iv:

I found a couple of questions (this and this about memory of summarization when I was looking for an information related to my previous PR with graph nodes removing.
I've taken first part of the english translated War and Peace. It consists of 14426 unique docs (number of nodes). I've tested on computer with 16 GiB of memory, Intel i5, 64-bit Windows 10, Python 3.7. Original version have eaten 14 GiB of memory after 7 minutes of execution and failed with MemoryError. My version have eaten 2 GiB of memory and taken ~2min (108 sec), but failed to process full version War and Peace (~57000 unique docs) with MemoryError too (failed on the phase pagerank).

…factoring

horpto · 2018-12-19T23:36:36Z

I need #2263 to add some tests about graph.

menshikh-iv · 2019-01-10T07:24:21Z

#2263 merged, feel free to continue

…factoring

gensim/summarization/pagerank_weighted.py

menshikh-iv

Thanks for PR @horpto, can you also please update your first comment in PR with

what was a reason for these changes (links to complaints about RAM)
little benchmark (time & RAM) 3.6.0 vs summarization-refactoring to better understand, how this improved

menshikh-iv · 2019-01-14T05:36:44Z

gensim/summarization/bm25.py

+        scores = []
+        for index in range(self.corpus_size):
+            score = self.get_score(document, index)
+            if score > 0:


In that case len(scores) <= self.corpus_size, why?

Because it's actually quite sparse array, isn't it?
In summarizer._set_graph_edge_weights such documents with little weight will be dropped anyway, so there is no reason to waste extra memory. And what's more, if we are needed a dense array we can uncompactify this bow.

I don't get, how we understand ids of documents that have 0 scores in that case?

Easy. Like words with 0 weight in bow. We have ids of docs with not zero weight. They are saved in bag-of-docs (I should rename function name part from bow - bag-of-weights to bod - bag-of-docs). If doc id isn't in bag-of-docs, so weight of doc is 0.

gensim/summarization/graph.py

gensim/summarization/pagerank_weighted.py

gensim/summarization/summarizer.py

menshikh-iv · 2019-01-18T05:09:39Z

I've taken first part of the english translated War and Peace. It consists of 14426 unique docs (number of nodes). I've tested on computer with 16 GiB of memory, Intel i5, 64-bit Windows 10, Python 3.7. Original version have eaten 14 GiB of memory after 7 minutes of execution and failed with MemoryError. My version have eaten 2 GiB of memory and taken ~2min (108 sec), but failed to process full version War and Peace (~57000 unique docs) with MemoryError too (failed on the phase pagerank).

Awesome result, nice improvement for new release, thanks @horpto 👍

reduce memory consumption

a0523f6

horpto changed the title ~~[WIP] reduce memory consumption~~ [WIP] reduce memory consumption of summarizer Dec 16, 2018

horpto added 3 commits December 16, 2018 21:08

fix build

7273b05

Merge remote-tracking branch 'upstream/develop' into summarization-re…

0d606fa

…factoring

add deleting null-weighted edge

a7d21c2

horpto added 8 commits January 12, 2019 16:44

Merge remote-tracking branch 'upstream/develop' into summarization-re…

3b2c818

…factoring

iterate over bm25 weights

ae39b0b

fix build

c04ed73

fix wrong example

9f0259a

add method iter_graph in Graph

bb9365c

add index argument in SyntacticUnit

b56a811

fix logging messages

70bd79a

refactor graph - remove unnecessary parts

008d5cb

horpto changed the title ~~[WIP] reduce memory consumption of summarizer~~ reduce memory consumption of summarizer Jan 14, 2019

horpto changed the title ~~reduce memory consumption of summarizer~~ Reduce memory consumption of summarizer Jan 14, 2019

horpto commented Jan 14, 2019

View reviewed changes

gensim/summarization/pagerank_weighted.py Show resolved Hide resolved

menshikh-iv reviewed Jan 14, 2019

View reviewed changes

horpto added 3 commits January 17, 2019 23:12

Add test and fix typos

4549a86

fix built (flake8)

f6b68bf

add printing of documents number

b850ff4

menshikh-iv merged commit fabeffe into piskvorky:develop Jan 18, 2019

fbarrios mentioned this pull request Jan 18, 2019

Adapt Gensim improvements summanlp/textrank#59

Open

gojomo mentioned this pull request Jul 6, 2023

Discussion: discard "gensim.summarization"? #2592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory consumption of summarizer #2298

Reduce memory consumption of summarizer #2298

horpto commented Dec 16, 2018 •

edited

Loading

horpto commented Dec 19, 2018

menshikh-iv commented Jan 10, 2019

menshikh-iv left a comment

menshikh-iv Jan 14, 2019

horpto Jan 15, 2019

menshikh-iv Jan 15, 2019

horpto Jan 15, 2019

menshikh-iv commented Jan 18, 2019

Reduce memory consumption of summarizer #2298

Reduce memory consumption of summarizer #2298

Conversation

horpto commented Dec 16, 2018 • edited Loading

horpto commented Dec 19, 2018

menshikh-iv commented Jan 10, 2019

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Jan 14, 2019

Choose a reason for hiding this comment

horpto Jan 15, 2019

Choose a reason for hiding this comment

menshikh-iv Jan 15, 2019

Choose a reason for hiding this comment

horpto Jan 15, 2019

Choose a reason for hiding this comment

menshikh-iv commented Jan 18, 2019

horpto commented Dec 16, 2018 •

edited

Loading