-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken internal links when converting epub #10207
Comments
|
I think I've fixed this issue. Note, however, that the EPUB you uploaded does not define an identifier |
You're right. I didn't upload it myself, I just looked for one where I was getting similar errors. Turns out this one had broken links even in the epub. I'll try the nightly release later. |
Unfortunately the issue is not solved in the original file I was initially interested in. There are 237 missing label errors, even after trying to add auto_identifiers The parts after Edit: Version 3.4-nightly-2024-09-23 |
What do these anchors point to in the epub? You may need to unzip it and inspect the xhtml files contained therein, e.g. look in part0007.html and try to find the thing that has id |
If you can give me an epub to test with, it would really help, even if it's just stripped down to couple of these examples (all the better). |
There is code that should be changing these ids. At least the |
I don't mind sending it to you for troubleshooting purposes, but I can't post it on github for obvious reasons. Edit: Sent via email. Hopefully the large attachment doesn't cause issues. |
I just tested with latex output instead, and that has the same issue, so it's not only typst writer. |
Most of the writers won't pay any attention to an identifier attribute on a Link element. |
Converting to html does work, but that's not that surprising (since epub is html based) |
Yes, but remember that pandoc isn't just moving the HTML from EPUB to the output. It is parsing everything into an intermediate data structure and re-rendering it. If it works with HTML, that shows that the identifier on the link does get parsed and represented in the AST. So the issue is simply that the Typst (and LaTeX) writer doesn't do anything with this attribute. |
That makes sense. |
If you want to email me the epub, I can look into it further. At least the identifier on the heading should work. |
I did email you the epub, as I described above. Though it may have vanished into the void because of the large ( 11 MB) attachment. |
ok, found it in junk folder. |
OK, here's one example.
So I look in part0111.html in the epub, and here's where the anchor is: <p class="toc1" id="pre"><a href="part0007.html#pre" class="calibre1">Prelude to the Stormlight Archive</a></p> Pandoc doesn't put attributes on Para elements, so this identifier was lost in the parsing stage. The other cases I've looked at are like this. Links to headings, tables, figures, divs, and spans should work fine. Anything else pandoc is going to drop, but those are the lion's share of real uses. Probably this can be closed. |
For your immediate purposes, it might work to use a Lua filter to remove all internal links, so you don't get errors in typst. |
Is there a particular reason why it can't keep them? In this case it completely breaks the TOC.
That's probably what I'm going to end up doing short term. (Having working links is useful for navigation though.) |
A sensible TOC has links to identifiers on headings (e.g. h2 in HTML). These should work fine in a pandoc conversion. This particular document has links all over the place -- to Pandoc has no place to put an id attribute on |
Explain the problem.
Take an epub file that uses internal links, e.g.
https://dieterplex.github.io/rust-ebookshelf/The%20Rust%20Programming%20Language.epub
Run
pandoc -f epub -t typst '.\The Rust Programming Language.epub' --standalone -o 'trpl.typ'
. The exact options or even output file type are not very important.The resulting file includes links like
#link(label("ch01-01-installation.html#troubleshooting"))
(there are some other flavors too), which will not work, because the label it refers to does not exist in the document. The closest being<ch01-01-installation.html>
, which refers to the entire chapter.Pandoc version?
What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue.)
pandoc 3.4, Windows 11
A separate issue is that in order for images to work, the files have to be manually extracted from theepub
, and the placed correctly in relation the resultingtyp
file. (Not sure if I should create a separate issue for this)(This particular issue was fixed by adding the
--extract-media .
option. Not entirely sure why the.
is required. Without it I getCouldn't extract ePub file: Did not find end of central directory signature
)The text was updated successfully, but these errors were encountered: