Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StarDict: improve memory usage #409

Closed
windwerfer opened this issue Jan 14, 2023 · 4 comments
Closed

StarDict: improve memory usage #409

windwerfer opened this issue Jan 14, 2023 · 4 comments

Comments

@windwerfer
Copy link

windwerfer commented Jan 14, 2023

dear Saeed Rasooli,

thank you very much for writing this converter. it works very reliable and helped me a lot. i would like to help by improving it even more.

when processing stardict files with a large amount of synonyms (> 500'000) pyglossary runs out of memory on my system.

repoduce:
https://github.com/digitalpalidictionary/digitalpalidictionary/releases/download/2023-01-06/dpd-goldendict.zip
download this file (about 6.8 million synonyms) and run

$ pyglossary dpd-goldendict.ifo dpd-new.ifo

on my system it crashes after about 4 million synonyms (the conversion of the .dict file runs without problems)

possible solution:

diff --git a/plugins/stardict.py b/plugins/stardict.py
@@ -494,7 +506,7 @@ class Writer(object):
 	def byteSortKey(self, b_word: bytes) -> "Tuple[bytes, bytes]":
 		return (
 			b_word.lower(),
-			b_word,
 		)
 
 	def finish(self) -> None:

as far as i understood: the sorting of the synonyms is done in memory by (line 738)

altIndexList.sort( key=lambda x: self.byteSortKey(x[0]) )

i dont know much about sorting (especially when its in bytes and not str), so i cant say for sure, but i think that the function is sorting the list twice (or by 2 creteria), first column b_word.lower() and than b_word (which might explain the out of memory).

when i remove the line (the proposed patch from above), it compiles without issues.

what is your opinion about this?

@ilius
Copy link
Owner

ilius commented Jan 14, 2023

This will not always sort them correctly. There might be several entries with same lowercase headword, then it might produce broken output. And won't always fix the memory issue.

This is for sorting .idx and .syn entries, and I might be able to fix the memory issue by using SQLite. Please stay tuned.

Meanwhile, please try to use swap file (or increase it) to extend your memory.

@ilius ilius changed the title Bugfix: Stardict StarDict: improve memory usage Jan 14, 2023
@ilius
Copy link
Owner

ilius commented Jan 16, 2023

I added a new option to use SQLite to reduce memory.

Please checkout / download branch named stardict-sqlite, and try again by adding flag --write-options=sqlite=True to your command.

@windwerfer
Copy link
Author

i can confirm that it your patch works flawless. (i tryed it on a tablet with 3gb ram)
way better than i expected.
thank you very much for the work. awsome.

@ilius
Copy link
Owner

ilius commented Jan 16, 2023

Great.
I pushed into master, so I'm deleting that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants