Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a tricky mdx file #384

Closed
ghost opened this issue Aug 15, 2022 · 15 comments
Closed

a tricky mdx file #384

ghost opened this issue Aug 15, 2022 · 15 comments

Comments

@ghost
Copy link

ghost commented Aug 15, 2022

Thank you for your wonderful software.
I managed to come across an mdict file which fails to convert to an ifo file.
Here is the file in question.
whenever I run pyglossary it stucks at [INFO] extracting links...

@ilius
Copy link
Owner

ilius commented Aug 18, 2022

Looks like the file is problematic.
I get this error:

[INFO] extracting links...
unhandled exception:
Traceback (most recent call last):
  File ".../pyglossary/pyglossary/glossary.py", line 613, in _read
    reader.open(filename)
  File ".../pyglossary/pyglossary/plugins/octopus_mdict_new/__init__.py", line 131, in open
    self.loadLinks()
  File ".../pyglossary/pyglossary/plugins/octopus_mdict_new/__init__.py", line 139, in loadLinks
    for b_word, b_defi in self._mdx.items():
  File ".../pyglossary/pyglossary/plugin_lib/readmdict.py", line 586, in _decode_record_block
    record_block = zlib.decompress(record_block_compressed[8:])
zlib.error: Error -3 while decompressing data: incorrect data check

[CRITICAL] Reading file 'CambridgeOnline.mdx' failed.

Can you test it with a dictionary that supports MDX, like BlueDict?

@ilius
Copy link
Owner

ilius commented Aug 18, 2022

I tested it on MDict app on Android and works fine!

@xiaoqiangwang @csarron Any ideas?

@xiaoqiangwang
Copy link
Contributor

There are two such blocks in total that zlib decompress fails. They are not truncated or anything and seem like valid zlib compressed data. Without the original source text, it is hard to say wether MDict app ignores the error silently.

Maybe it is worth to regenerate the mdx file and check again. Or modify readmdict.py to skip the failed blocks

diff --git a/pyglossary/plugin_lib/readmdict.py b/pyglossary/plugin_lib/readmdict.py
index 05b75d1a..3d6efb18 100644
--- a/pyglossary/plugin_lib/readmdict.py
+++ b/pyglossary/plugin_lib/readmdict.py
@@ -583,8 +583,11 @@ class MDX(MDict):
                        # zlib compression
                        elif record_block_type == b'\x02\x00\x00\x00':
                                # decompress
-                               record_block = zlib.decompress(record_block_compressed[8:])
-
+                               try:
+                                       record_block = zlib.decompress(record_block_compressed[8:])
+                               except zlib.error:
+                                       print("zlib decompress error")
+                                       continue
                        # notice that adler32 return signed value
                        assert(adler32 == zlib.adler32(record_block) & 0xffffffff)
 
@@ -611,7 +614,7 @@ class MDX(MDict):
                                yield key_text, record
                        offset += len(record_block)
                        size_counter += compressed_size
-               assert(size_counter == record_block_size)
+               #assert(size_counter == record_block_size)
 
                f.close()

@ilius
Copy link
Owner

ilius commented Aug 18, 2022

Thank you @xiaoqiangwang

@florentinovame
I pushed a commit that skips these few blocks and completes the conversion.
Please try again.

@ghost
Copy link
Author

ghost commented Aug 19, 2022 via email

@ghost
Copy link
Author

ghost commented Aug 19, 2022

for some unknown reason using dictzip makes the dictionary useless. I know that you are not to blame as the conversing process is flawless. my comment is a kind of warning to those run up against this problem (dictzip).

@ilius
Copy link
Owner

ilius commented Aug 19, 2022

MDX has internal compression (zlib / lzo), so running dictzip (which is basically gzip) on top of that won't change the size much.
I also haven't seen anyone do that.

@ghost
Copy link
Author

ghost commented Aug 19, 2022

sorry for being ambiguous... I meant to compress dsl which is 2.4 gb with the help of dictzip. it turned out that dictzip has a size limitation to 1.8 gb.

@ilius
Copy link
Owner

ilius commented Aug 19, 2022

sorry for being ambiguous... I meant to compress dsl which is 2.4 gb with the help of dictzip. it turned out that dictzip has a size limitation to 1.8 gb.

Interesting. What error do you get when file is too big?

@ghost
Copy link
Author

ghost commented Aug 19, 2022

it silently creates a dz file, which makes goldendict crash and xarchiver is impossible to open the archive. if i try dictzip -d i get
dictzip (dict_read_header): Internal error File position (86575) != header length + 1 (21039) Aborting dictzip...

@ghost
Copy link
Author

ghost commented Aug 19, 2022

the manpage for dictzip says:
XLEN (which is specified earlier in the header) is a two byte integer, so the extra field can be 0xffff bytes long, 2 bytes of which are used for the subfield ID (SI1 and SI1), and 2 bytes of which are used for the subfield length (LEN). This leaves 0xfffb bytes (0x7ffd 2-byte entries or 0x3ffe 4-byte entries). Given that the zip output buffer must be 10% + 12 bytes larger than the input buffer, we can store 58969 bytes per entry, or about 1.8GB if the 2-byte entries are used. If this becomes a limiting factor, another format version can be selected and defined for 4-byte entries.
It's all sounds greek to me but possibly you know how to implement the other LEN meaning.

@ilius
Copy link
Owner

ilius commented Aug 26, 2022

Thanks.
Although we don't support compressing with dictzip internally.
We just run dictzip program (if installed) in some cases.
So the most I can do is to show a warning for large files.

@ilius
Copy link
Owner

ilius commented Aug 26, 2022

Can you find the exact number of bytes for the "about 1.8GB" limit?

@ghost
Copy link
Author

ghost commented Aug 26, 2022

I am sorry, but I am not able to perform this calculation. As for the dictionary mentioned before it turned out to be not a standalone dictionary but a combination of several sources of reference. dictzip crashed long before the stated limit of 1.8GB.

@ilius ilius closed this as completed Sep 7, 2022
@ghost
Copy link
Author

ghost commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants