Doesn't support non-ASCII characters in resource paths/names #110

rkwright · 2014-09-04T17:12:44Z

A-Klimashevsky posted this issue on readium-sdk. I am closing that one as it is NOT specific to the SDK
He wrote:
I have a epub book with Russian names of files (e.g. assets/images/без названия.jpg). readium doesnt read this files from epub container.

rkwright · 2014-09-04T17:13:01Z

Can share this EPUB with us? If the encoding is correct, it should just work, AFAIK. You can send it to me (rkwright@readium.org) or post it somewhere.

rkwright · 2014-09-04T17:13:22Z

I have verified this with both a Russian EPUB as well as a simple test file. If the image resource is in non-ASCII characters we won't load it. But it's more complex than that. We DO load non-ASCII resources, both XHTML and images. See for example Kusamakura
https://epub-samples.googlecode.com/files/kusamakura-japanese-vertical-writing-20121124.epub
We render it fine even though it has both XHTML and images with non-ASCII names. But the same images in my little testfile don't work. That file is https://readiumfoundation.box.com/s/uf16y8x7f9jgp4ygxun9
There are two images in Chapter 2, one with Russian characters, the other Japanese. Neither works in the current build of the Chrome Extension, but work perfectly in iBooks.

There must be some metadata or similar that we are looking for.

danielweck · 2014-09-04T17:17:46Z

FYI:
Tiny3.epub displays both Japanese and Russiant images in cloud reader (tested Chrome + Safari OSX).
2.15.1
readium-js-viewer@50be03dbc48a4746d21fe68b43ce5b1386933989(with local changes)
readium-js@490a442ab8e0e3ca5f12f434448ac11aefdd9fcb
readium-shared-js@5b53aea6b5c3ebd4b4763d6ad4f34d1359cda6a6

danielweck · 2014-09-04T17:22:32Z

Follow-up: confirmed, broken in Chrome extension

rkwright · 2014-09-04T17:23:20Z

Ha! Just verified that it also works in the OSX-Launcher. So it looks like the failure is specific to the Chrome Extension.

danielweck · 2014-09-04T17:33:27Z

The Unicode filenames are messed up in Chrome's storage:

rkwright · 2014-09-04T17:53:45Z

I have also confirmed that it was not working correctly in 2.14.2 and 0.9.1. So it's not a regression. It's no doubt treating the filenames as thought they are ASCII. As long as they are, we're OK...

ryanackley · 2014-09-04T18:48:15Z

So this appears to be an issue with the zip file. There is a flag that indicates the filenames should be decoded using UTF-8. It is missing in Tiny3.epub. I'm assuming this is a common issue for zip files since it works when we uncompress it using platform tools and the libraries of other clients.

The library we're using for the chrome extension is zip.js and after stepping through the code, it appears to be looking for the correct flag in the correct place. I could try and add a workaround for this. The question is: is this an issue for many books? Is this for v1? What did you use to compress tiny3.epub?

FYI, CloudReader has the same problem if you try to load directly from an epub file (instead of an exploded directory).

ryanackley · 2014-09-04T19:17:18Z

Tried unzipping on Windows using 7-zip and the Windows Explorer unzipper. They both mangle the names. I'm starting to think that tiny3.epub is just a bad zip file.

rkwright · 2014-09-04T21:26:52Z

I think it is a little more complicated than that, though you might be right. As I mentioned above, iBooks opens the files correctly without problems (as does OSX Launcher and the CloudReader). In addition, EPUBCheck opens the files and validates them without problems. It DOES warn that there are non-ASCII characters in the filenames and suggests that they be changed. But it does so (apparently) not because it couldn't read them, but because there is concern that other tools may not handle the non-ASCII filenames correctly. How is it that the CloudReader and the Launcher correctly handle the names? It suggests that they correctly detect the non-ASCII nature of the filenames and then interpret them correctly.

Bear in mind that this thread started because a user had a file that had non-ASCII filenames that our Chrome Extension wouldn't load. I think we need to think about that as opposed to saying that EPUBs must have this flag. Not saying that it wouldn't be best to author the EPUB with the flag on, but telling authors to create their EPUBs with certain flags has not worked well for me in the past.

ryanackley · 2014-09-04T22:48:26Z

I'm referring to a flag in the zip file format not the epub format. Epubs (files with .epub extensions) are packaged as zip files.

In the chrome extension, we use a third party library to unzip epubs. When that library attempts to unzip tiny3.epub, it looks for a bit to be set at the zip file level for each file it extracts. If this bit is set, it will treat the filename as UTF-8 encoded, if it is not, it treats it as ASCII encoded. This behavior conforms to the zip file format specification linked above. In Tiny3.epub, this bit is not set for any of the compressed files. Therefore, as it extracts each file, It encodes the filenames as ASCII. It is then stored in the chrome filesystem with this filename. At this point, the file is stored in the chrome extension with the wrong name because it used the wrong encoding. Therefore, when it's referenced from a book, it can't be found.

As to the reason it works in iBooks and OSX Launcher, I believe that OS X just handles this differently. It does not work in the Cloud Reader unless you have an already unzipped epub. Presumably unzipped on OS X. As I mentioned earlier, unzipping Tiny3.epub on Windows 7 mangled the file names that contain Unicode characters just like the chrome extension. This is the case using the built-in unzip tool and 7-zip. 7-zip is probably the most popular unzipper for Windows out there.

This is fixable, I just want you to understand the problem and what is involved in fixing this. Seems like it could be easy but it could also be a rabbit hole.

rkwright · 2014-09-05T12:32:03Z

OK. My files were compressed using epubcrude, which uses the standard Java ZipOutputStream. As a stream, it apparently doesn't pay any attention to encodings. You have to use the *Writer form of Zip support. I am in the midst of rewriting epubcrude to use the e4 framework. I'll fix this bug in the process. Otherwise, let's just document that the flag needs to be set. I'll leave the issue open to flag that.

In any case, it is strongly recommended by IDPF/EPUB that filenames be ASCII. If you run any of these files with non-ASCII names through EPUBCheck, it warns you that you have non-ASCII characters in the filename and recommends that you fix it.

rkwright · 2014-10-24T17:07:48Z

I have logged the issue against epubcrude so am closing this one

rkwright/epubcrude#3

rkwright added bug labels Sep 4, 2014

rkwright added this to the v1 milestone Sep 4, 2014

rkwright assigned ryanackley Sep 4, 2014

danielweck modified the milestones: v1+, v1 Sep 9, 2014

rkwright added the ReadiumJS label Sep 18, 2014

rkwright closed this as completed Oct 24, 2014

rkwright removed priority high [critical] labels Oct 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't support non-ASCII characters in resource paths/names #110

Doesn't support non-ASCII characters in resource paths/names #110

rkwright commented Sep 4, 2014

rkwright commented Sep 4, 2014

rkwright commented Sep 4, 2014

danielweck commented Sep 4, 2014

danielweck commented Sep 4, 2014

rkwright commented Sep 4, 2014

danielweck commented Sep 4, 2014

rkwright commented Sep 4, 2014

ryanackley commented Sep 4, 2014

ryanackley commented Sep 4, 2014

rkwright commented Sep 4, 2014

ryanackley commented Sep 4, 2014

rkwright commented Sep 5, 2014

rkwright commented Oct 24, 2014

Doesn't support non-ASCII characters in resource paths/names #110

Doesn't support non-ASCII characters in resource paths/names #110

Comments

rkwright commented Sep 4, 2014

rkwright commented Sep 4, 2014

rkwright commented Sep 4, 2014

danielweck commented Sep 4, 2014

danielweck commented Sep 4, 2014

rkwright commented Sep 4, 2014

danielweck commented Sep 4, 2014

rkwright commented Sep 4, 2014

ryanackley commented Sep 4, 2014

ryanackley commented Sep 4, 2014

rkwright commented Sep 4, 2014

ryanackley commented Sep 4, 2014

rkwright commented Sep 5, 2014

rkwright commented Oct 24, 2014