a tricky mdx file #384

ghost · 2022-08-15T08:51:34Z

Thank you for your wonderful software.
I managed to come across an mdict file which fails to convert to an ifo file.
Here is the file in question.
whenever I run pyglossary it stucks at [INFO] extracting links...

ilius · 2022-08-18T09:46:14Z

Looks like the file is problematic.
I get this error:

[INFO] extracting links...
unhandled exception:
Traceback (most recent call last):
  File ".../pyglossary/pyglossary/glossary.py", line 613, in _read
    reader.open(filename)
  File ".../pyglossary/pyglossary/plugins/octopus_mdict_new/__init__.py", line 131, in open
    self.loadLinks()
  File ".../pyglossary/pyglossary/plugins/octopus_mdict_new/__init__.py", line 139, in loadLinks
    for b_word, b_defi in self._mdx.items():
  File ".../pyglossary/pyglossary/plugin_lib/readmdict.py", line 586, in _decode_record_block
    record_block = zlib.decompress(record_block_compressed[8:])
zlib.error: Error -3 while decompressing data: incorrect data check

[CRITICAL] Reading file 'CambridgeOnline.mdx' failed.

Can you test it with a dictionary that supports MDX, like BlueDict?

ilius · 2022-08-18T16:37:48Z

I tested it on MDict app on Android and works fine!

@xiaoqiangwang @csarron Any ideas?

xiaoqiangwang · 2022-08-18T20:08:35Z

There are two such blocks in total that zlib decompress fails. They are not truncated or anything and seem like valid zlib compressed data. Without the original source text, it is hard to say wether MDict app ignores the error silently.

Maybe it is worth to regenerate the mdx file and check again. Or modify readmdict.py to skip the failed blocks

diff --git a/pyglossary/plugin_lib/readmdict.py b/pyglossary/plugin_lib/readmdict.py
index 05b75d1a..3d6efb18 100644
--- a/pyglossary/plugin_lib/readmdict.py
+++ b/pyglossary/plugin_lib/readmdict.py
@@ -583,8 +583,11 @@ class MDX(MDict):
                        # zlib compression
                        elif record_block_type == b'\x02\x00\x00\x00':
                                # decompress
-                               record_block = zlib.decompress(record_block_compressed[8:])
-
+                               try:
+                                       record_block = zlib.decompress(record_block_compressed[8:])
+                               except zlib.error:
+                                       print("zlib decompress error")
+                                       continue
                        # notice that adler32 return signed value
                        assert(adler32 == zlib.adler32(record_block) & 0xffffffff)
 
@@ -611,7 +614,7 @@ class MDX(MDict):
                                yield key_text, record
                        offset += len(record_block)
                        size_counter += compressed_size
-               assert(size_counter == record_block_size)
+               #assert(size_counter == record_block_size)
 
                f.close()

ilius · 2022-08-18T20:58:49Z

Thank you @xiaoqiangwang

@florentinovame
I pushed a commit that skips these few blocks and completes the conversion.
Please try again.

ghost · 2022-08-19T06:18:15Z

It works! thanks a lot. чт, 18 авг. 2022 г. в 23:59, Saeed Rasooli ***@***.***>:

…

Thank you @xiaoqiangwang <https://github.com/xiaoqiangwang> @florentinovame <https://github.com/florentinovame> I pushed a commit that skips these few blocks and completes the conversion. Please try again. — Reply to this email directly, view it on GitHub <#384 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2Q6ZU4K4XCXSB45YHDR4L3VZ2PZJANCNFSM56RRQ3PA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…384

ghost · 2022-08-19T08:40:43Z

for some unknown reason using dictzip makes the dictionary useless. I know that you are not to blame as the conversing process is flawless. my comment is a kind of warning to those run up against this problem (dictzip).

ilius · 2022-08-19T09:42:04Z

MDX has internal compression (zlib / lzo), so running dictzip (which is basically gzip) on top of that won't change the size much.
I also haven't seen anyone do that.

ghost · 2022-08-19T09:53:51Z

sorry for being ambiguous... I meant to compress dsl which is 2.4 gb with the help of dictzip. it turned out that dictzip has a size limitation to 1.8 gb.

ilius · 2022-08-19T10:10:23Z

sorry for being ambiguous... I meant to compress dsl which is 2.4 gb with the help of dictzip. it turned out that dictzip has a size limitation to 1.8 gb.

Interesting. What error do you get when file is too big?

ghost · 2022-08-19T10:38:14Z

it silently creates a dz file, which makes goldendict crash and xarchiver is impossible to open the archive. if i try dictzip -d i get
dictzip (dict_read_header): Internal error File position (86575) != header length + 1 (21039) Aborting dictzip...

ghost · 2022-08-19T13:16:36Z

the manpage for dictzip says:
XLEN (which is specified earlier in the header) is a two byte integer, so the extra field can be 0xffff bytes long, 2 bytes of which are used for the subfield ID (SI1 and SI1), and 2 bytes of which are used for the subfield length (LEN). This leaves 0xfffb bytes (0x7ffd 2-byte entries or 0x3ffe 4-byte entries). Given that the zip output buffer must be 10% + 12 bytes larger than the input buffer, we can store 58969 bytes per entry, or about 1.8GB if the 2-byte entries are used. If this becomes a limiting factor, another format version can be selected and defined for 4-byte entries.
It's all sounds greek to me but possibly you know how to implement the other LEN meaning.

ilius · 2022-08-26T04:17:29Z

Thanks.
Although we don't support compressing with dictzip internally.
We just run dictzip program (if installed) in some cases.
So the most I can do is to show a warning for large files.

ilius · 2022-08-26T04:18:08Z

Can you find the exact number of bytes for the "about 1.8GB" limit?

ghost · 2022-08-26T06:59:21Z

I am sorry, but I am not able to perform this calculation. As for the dictionary mentioned before it turned out to be not a standalone dictionary but a combination of several sources of reference. dictzip crashed long before the stated limit of 1.8GB.

ghost · 2022-10-11T07:27:53Z

thank you for your prompt reply. I use goldendict which supports mdx based files. I am under the illusion that ifo based dictionaries are smaller than mdx equivalents. cambridge online dictionary has extracts from cambridge corpus. this data is also available online but in the light of coming closure of lexico.com one should take interest in having an offline copy. чт, 18 авг. 2022 г. в 19:37, Saeed Rasooli ***@***.***>:

…

I tested it on MDict app on Android and works fine! @xiaoqiangwang <https://github.com/xiaoqiangwang> Any ideas? — Reply to this email directly, view it on GitHub <#384 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2Q6ZU6EI3JLING2CZ3ZQP3VZZRGNANCNFSM56RRQ3PA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ilius added a commit that referenced this issue Aug 18, 2022

MDX: readmdict.py: skip zlib decompress exceptions, #384

78fa747

ilius added a commit that referenced this issue Aug 19, 2022

MDX: readmdict.py: use __name__ as logger name, and add 2 debug logs, #…

0535bcf

…384

ilius added Improvement Q&A labels Aug 26, 2022

ilius closed this as completed Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a tricky mdx file #384

a tricky mdx file #384

ghost commented Aug 15, 2022 •

edited by ilius

Loading

ilius commented Aug 18, 2022

ilius commented Aug 18, 2022 •

edited

Loading

xiaoqiangwang commented Aug 18, 2022

ilius commented Aug 18, 2022

ghost commented Aug 19, 2022 via email

ghost commented Aug 19, 2022 •

edited by ghost

Loading

ilius commented Aug 19, 2022

ghost commented Aug 19, 2022

ilius commented Aug 19, 2022

ghost commented Aug 19, 2022 •

edited by ghost

Loading

ghost commented Aug 19, 2022

ilius commented Aug 26, 2022

ilius commented Aug 26, 2022

ghost commented Aug 26, 2022 •

edited by ghost

Loading

ghost commented Oct 11, 2022 via email

a tricky mdx file #384

a tricky mdx file #384

Comments

ghost commented Aug 15, 2022 • edited by ilius Loading

ilius commented Aug 18, 2022

ilius commented Aug 18, 2022 • edited Loading

xiaoqiangwang commented Aug 18, 2022

ilius commented Aug 18, 2022

ghost commented Aug 19, 2022 via email

ghost commented Aug 19, 2022 • edited by ghost Loading

ilius commented Aug 19, 2022

ghost commented Aug 19, 2022

ilius commented Aug 19, 2022

ghost commented Aug 19, 2022 • edited by ghost Loading

ghost commented Aug 19, 2022

ilius commented Aug 26, 2022

ilius commented Aug 26, 2022

ghost commented Aug 26, 2022 • edited by ghost Loading

ghost commented Oct 11, 2022 via email

ghost commented Aug 15, 2022 •

edited by ilius

Loading

ilius commented Aug 18, 2022 •

edited

Loading

ghost commented Aug 19, 2022 •

edited by ghost

Loading

ghost commented Aug 19, 2022 •

edited by ghost

Loading

ghost commented Aug 26, 2022 •

edited by ghost

Loading