Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't support non-ASCII characters in resource paths/names #110

Closed
rkwright opened this issue Sep 4, 2014 · 13 comments
Closed

Doesn't support non-ASCII characters in resource paths/names #110

rkwright opened this issue Sep 4, 2014 · 13 comments
Assignees

Comments

@rkwright
Copy link
Contributor

rkwright commented Sep 4, 2014

A-Klimashevsky posted this issue on readium-sdk. I am closing that one as it is NOT specific to the SDK
He wrote:
I have a epub book with Russian names of files (e.g. assets/images/без названия.jpg). readium doesnt read this files from epub container.

@rkwright
Copy link
Contributor Author

rkwright commented Sep 4, 2014

Can share this EPUB with us? If the encoding is correct, it should just work, AFAIK. You can send it to me (rkwright@readium.org) or post it somewhere.

@rkwright
Copy link
Contributor Author

rkwright commented Sep 4, 2014

I have verified this with both a Russian EPUB as well as a simple test file. If the image resource is in non-ASCII characters we won't load it. But it's more complex than that. We DO load non-ASCII resources, both XHTML and images. See for example Kusamakura
https://epub-samples.googlecode.com/files/kusamakura-japanese-vertical-writing-20121124.epub
We render it fine even though it has both XHTML and images with non-ASCII names. But the same images in my little testfile don't work. That file is https://readiumfoundation.box.com/s/uf16y8x7f9jgp4ygxun9
There are two images in Chapter 2, one with Russian characters, the other Japanese. Neither works in the current build of the Chrome Extension, but work perfectly in iBooks.

There must be some metadata or similar that we are looking for.

@rkwright rkwright added this to the v1 milestone Sep 4, 2014
@danielweck
Copy link
Member

FYI:
Tiny3.epub displays both Japanese and Russiant images in cloud reader (tested Chrome + Safari OSX).
2.15.1
readium-js-viewer@50be03dbc48a4746d21fe68b43ce5b1386933989(with local changes)
readium-js@490a442ab8e0e3ca5f12f434448ac11aefdd9fcb
readium-shared-js@5b53aea6b5c3ebd4b4763d6ad4f34d1359cda6a6

@danielweck
Copy link
Member

Follow-up: confirmed, broken in Chrome extension

@rkwright
Copy link
Contributor Author

rkwright commented Sep 4, 2014

Ha! Just verified that it also works in the OSX-Launcher. So it looks like the failure is specific to the Chrome Extension.

@danielweck
Copy link
Member

The Unicode filenames are messed up in Chrome's storage:

unicodefilepathschromeextension

@rkwright
Copy link
Contributor Author

rkwright commented Sep 4, 2014

I have also confirmed that it was not working correctly in 2.14.2 and 0.9.1. So it's not a regression. It's no doubt treating the filenames as thought they are ASCII. As long as they are, we're OK...

@ryanackley
Copy link
Contributor

So this appears to be an issue with the zip file. There is a flag that indicates the filenames should be decoded using UTF-8. It is missing in Tiny3.epub. I'm assuming this is a common issue for zip files since it works when we uncompress it using platform tools and the libraries of other clients.

The library we're using for the chrome extension is zip.js and after stepping through the code, it appears to be looking for the correct flag in the correct place. I could try and add a workaround for this. The question is: is this an issue for many books? Is this for v1? What did you use to compress tiny3.epub?

FYI, CloudReader has the same problem if you try to load directly from an epub file (instead of an exploded directory).

@ryanackley
Copy link
Contributor

Tried unzipping on Windows using 7-zip and the Windows Explorer unzipper. They both mangle the names. I'm starting to think that tiny3.epub is just a bad zip file.

@rkwright
Copy link
Contributor Author

rkwright commented Sep 4, 2014

I think it is a little more complicated than that, though you might be right. As I mentioned above, iBooks opens the files correctly without problems (as does OSX Launcher and the CloudReader). In addition, EPUBCheck opens the files and validates them without problems. It DOES warn that there are non-ASCII characters in the filenames and suggests that they be changed. But it does so (apparently) not because it couldn't read them, but because there is concern that other tools may not handle the non-ASCII filenames correctly. How is it that the CloudReader and the Launcher correctly handle the names? It suggests that they correctly detect the non-ASCII nature of the filenames and then interpret them correctly.

Bear in mind that this thread started because a user had a file that had non-ASCII filenames that our Chrome Extension wouldn't load. I think we need to think about that as opposed to saying that EPUBs must have this flag. Not saying that it wouldn't be best to author the EPUB with the flag on, but telling authors to create their EPUBs with certain flags has not worked well for me in the past.

@ryanackley
Copy link
Contributor

I'm referring to a flag in the zip file format not the epub format. Epubs (files with .epub extensions) are packaged as zip files.

In the chrome extension, we use a third party library to unzip epubs. When that library attempts to unzip tiny3.epub, it looks for a bit to be set at the zip file level for each file it extracts. If this bit is set, it will treat the filename as UTF-8 encoded, if it is not, it treats it as ASCII encoded. This behavior conforms to the zip file format specification linked above. In Tiny3.epub, this bit is not set for any of the compressed files. Therefore, as it extracts each file, It encodes the filenames as ASCII. It is then stored in the chrome filesystem with this filename. At this point, the file is stored in the chrome extension with the wrong name because it used the wrong encoding. Therefore, when it's referenced from a book, it can't be found.

As to the reason it works in iBooks and OSX Launcher, I believe that OS X just handles this differently. It does not work in the Cloud Reader unless you have an already unzipped epub. Presumably unzipped on OS X. As I mentioned earlier, unzipping Tiny3.epub on Windows 7 mangled the file names that contain Unicode characters just like the chrome extension. This is the case using the built-in unzip tool and 7-zip. 7-zip is probably the most popular unzipper for Windows out there.

This is fixable, I just want you to understand the problem and what is involved in fixing this. Seems like it could be easy but it could also be a rabbit hole.

@rkwright
Copy link
Contributor Author

rkwright commented Sep 5, 2014

OK. My files were compressed using epubcrude, which uses the standard Java ZipOutputStream. As a stream, it apparently doesn't pay any attention to encodings. You have to use the *Writer form of Zip support. I am in the midst of rewriting epubcrude to use the e4 framework. I'll fix this bug in the process. Otherwise, let's just document that the flag needs to be set. I'll leave the issue open to flag that.

In any case, it is strongly recommended by IDPF/EPUB that filenames be ASCII. If you run any of these files with non-ASCII names through EPUBCheck, it warns you that you have non-ASCII characters in the filename and recommends that you fix it.

@danielweck danielweck modified the milestones: v1+, v1 Sep 9, 2014
@rkwright
Copy link
Contributor Author

I have logged the issue against epubcrude so am closing this one

rkwright/epubcrude#3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants