-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't support non-ASCII characters in resource paths/names #110
Comments
Can share this EPUB with us? If the encoding is correct, it should just work, AFAIK. You can send it to me (rkwright@readium.org) or post it somewhere. |
I have verified this with both a Russian EPUB as well as a simple test file. If the image resource is in non-ASCII characters we won't load it. But it's more complex than that. We DO load non-ASCII resources, both XHTML and images. See for example Kusamakura There must be some metadata or similar that we are looking for. |
FYI: |
Follow-up: confirmed, broken in Chrome extension |
Ha! Just verified that it also works in the OSX-Launcher. So it looks like the failure is specific to the Chrome Extension. |
I have also confirmed that it was not working correctly in 2.14.2 and 0.9.1. So it's not a regression. It's no doubt treating the filenames as thought they are ASCII. As long as they are, we're OK... |
So this appears to be an issue with the zip file. There is a flag that indicates the filenames should be decoded using UTF-8. It is missing in Tiny3.epub. I'm assuming this is a common issue for zip files since it works when we uncompress it using platform tools and the libraries of other clients. The library we're using for the chrome extension is zip.js and after stepping through the code, it appears to be looking for the correct flag in the correct place. I could try and add a workaround for this. The question is: is this an issue for many books? Is this for v1? What did you use to compress tiny3.epub? FYI, CloudReader has the same problem if you try to load directly from an epub file (instead of an exploded directory). |
Tried unzipping on Windows using 7-zip and the Windows Explorer unzipper. They both mangle the names. I'm starting to think that tiny3.epub is just a bad zip file. |
I think it is a little more complicated than that, though you might be right. As I mentioned above, iBooks opens the files correctly without problems (as does OSX Launcher and the CloudReader). In addition, EPUBCheck opens the files and validates them without problems. It DOES warn that there are non-ASCII characters in the filenames and suggests that they be changed. But it does so (apparently) not because it couldn't read them, but because there is concern that other tools may not handle the non-ASCII filenames correctly. How is it that the CloudReader and the Launcher correctly handle the names? It suggests that they correctly detect the non-ASCII nature of the filenames and then interpret them correctly. Bear in mind that this thread started because a user had a file that had non-ASCII filenames that our Chrome Extension wouldn't load. I think we need to think about that as opposed to saying that EPUBs must have this flag. Not saying that it wouldn't be best to author the EPUB with the flag on, but telling authors to create their EPUBs with certain flags has not worked well for me in the past. |
I'm referring to a flag in the zip file format not the epub format. Epubs (files with .epub extensions) are packaged as zip files. In the chrome extension, we use a third party library to unzip epubs. When that library attempts to unzip tiny3.epub, it looks for a bit to be set at the zip file level for each file it extracts. If this bit is set, it will treat the filename as UTF-8 encoded, if it is not, it treats it as ASCII encoded. This behavior conforms to the zip file format specification linked above. In Tiny3.epub, this bit is not set for any of the compressed files. Therefore, as it extracts each file, It encodes the filenames as ASCII. It is then stored in the chrome filesystem with this filename. At this point, the file is stored in the chrome extension with the wrong name because it used the wrong encoding. Therefore, when it's referenced from a book, it can't be found. As to the reason it works in iBooks and OSX Launcher, I believe that OS X just handles this differently. It does not work in the Cloud Reader unless you have an already unzipped epub. Presumably unzipped on OS X. As I mentioned earlier, unzipping Tiny3.epub on Windows 7 mangled the file names that contain Unicode characters just like the chrome extension. This is the case using the built-in unzip tool and 7-zip. 7-zip is probably the most popular unzipper for Windows out there. This is fixable, I just want you to understand the problem and what is involved in fixing this. Seems like it could be easy but it could also be a rabbit hole. |
OK. My files were compressed using epubcrude, which uses the standard Java ZipOutputStream. As a stream, it apparently doesn't pay any attention to encodings. You have to use the *Writer form of Zip support. I am in the midst of rewriting epubcrude to use the e4 framework. I'll fix this bug in the process. Otherwise, let's just document that the flag needs to be set. I'll leave the issue open to flag that. In any case, it is strongly recommended by IDPF/EPUB that filenames be ASCII. If you run any of these files with non-ASCII names through EPUBCheck, it warns you that you have non-ASCII characters in the filename and recommends that you fix it. |
I have logged the issue against epubcrude so am closing this one |
A-Klimashevsky posted this issue on readium-sdk. I am closing that one as it is NOT specific to the SDK
He wrote:
I have a epub book with Russian names of files (e.g. assets/images/без названия.jpg). readium doesnt read this files from epub container.
The text was updated successfully, but these errors were encountered: