Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption of summarizer #2298

Merged
merged 15 commits into from
Jan 18, 2019

Conversation

horpto
Copy link
Contributor

@horpto horpto commented Dec 16, 2018

By a request of @menshikh-iv:

  • I found a couple of questions (this and this about memory of summarization when I was looking for an information related to my previous PR with graph nodes removing.
  • I've taken first part of the english translated War and Peace. It consists of 14426 unique docs (number of nodes). I've tested on computer with 16 GiB of memory, Intel i5, 64-bit Windows 10, Python 3.7. Original version have eaten 14 GiB of memory after 7 minutes of execution and failed with MemoryError. My version have eaten 2 GiB of memory and taken ~2min (108 sec), but failed to process full version War and Peace (~57000 unique docs) with MemoryError too (failed on the phase pagerank).

@horpto horpto changed the title [WIP] reduce memory consumption [WIP] reduce memory consumption of summarizer Dec 16, 2018
@horpto
Copy link
Contributor Author

horpto commented Dec 19, 2018

I need #2263 to add some tests about graph.

@menshikh-iv
Copy link
Contributor

#2263 merged, feel free to continue

@horpto horpto changed the title [WIP] reduce memory consumption of summarizer reduce memory consumption of summarizer Jan 14, 2019
@horpto horpto changed the title reduce memory consumption of summarizer Reduce memory consumption of summarizer Jan 14, 2019
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for PR @horpto, can you also please update your first comment in PR with

  • what was a reason for these changes (links to complaints about RAM)
  • little benchmark (time & RAM) 3.6.0 vs summarization-refactoring to better understand, how this improved

scores = []
for index in range(self.corpus_size):
score = self.get_score(document, index)
if score > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case len(scores) <= self.corpus_size, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's actually quite sparse array, isn't it?
In summarizer._set_graph_edge_weights such documents with little weight will be dropped anyway, so there is no reason to waste extra memory. And what's more, if we are needed a dense array we can uncompactify this bow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get, how we understand ids of documents that have 0 scores in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy. Like words with 0 weight in bow. We have ids of docs with not zero weight. They are saved in bag-of-docs (I should rename function name part from bow - bag-of-weights to bod - bag-of-docs). If doc id isn't in bag-of-docs, so weight of doc is 0.

gensim/summarization/graph.py Show resolved Hide resolved
gensim/summarization/graph.py Show resolved Hide resolved
gensim/summarization/pagerank_weighted.py Show resolved Hide resolved
gensim/summarization/pagerank_weighted.py Show resolved Hide resolved
gensim/summarization/summarizer.py Outdated Show resolved Hide resolved
@menshikh-iv
Copy link
Contributor

I've taken first part of the english translated War and Peace. It consists of 14426 unique docs (number of nodes). I've tested on computer with 16 GiB of memory, Intel i5, 64-bit Windows 10, Python 3.7. Original version have eaten 14 GiB of memory after 7 minutes of execution and failed with MemoryError. My version have eaten 2 GiB of memory and taken ~2min (108 sec), but failed to process full version War and Peace (~57000 unique docs) with MemoryError too (failed on the phase pagerank).

Awesome result, nice improvement for new release, thanks @horpto 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants