Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken internal links when converting epub #10207

Open
Enivex opened this issue Sep 23, 2024 · 21 comments
Open

Broken internal links when converting epub #10207

Enivex opened this issue Sep 23, 2024 · 21 comments
Labels

Comments

@Enivex
Copy link

Enivex commented Sep 23, 2024

Explain the problem.
Take an epub file that uses internal links, e.g.
https://dieterplex.github.io/rust-ebookshelf/The%20Rust%20Programming%20Language.epub

Run pandoc -f epub -t typst '.\The Rust Programming Language.epub' --standalone -o 'trpl.typ'. The exact options or even output file type are not very important.

The resulting file includes links like #link(label("ch01-01-installation.html#troubleshooting")) (there are some other flavors too), which will not work, because the label it refers to does not exist in the document. The closest being <ch01-01-installation.html>, which refers to the entire chapter.

Pandoc version?
What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue.)

pandoc 3.4, Windows 11


A separate issue is that in order for images to work, the files have to be manually extracted from the epub, and the placed correctly in relation the resulting typ file. (Not sure if I should create a separate issue for this)
(This particular issue was fixed by adding the --extract-media . option. Not entirely sure why the . is required. Without it I get Couldn't extract ePub file: Did not find end of central directory signature)

@Enivex Enivex added the bug label Sep 23, 2024
@jgm
Copy link
Owner

jgm commented Sep 23, 2024

--extract-media requires an argument (file path). That is why.

@jgm jgm closed this as completed in f3fff87 Sep 23, 2024
@jgm
Copy link
Owner

jgm commented Sep 23, 2024

I think I've fixed this issue. Note, however, that the EPUB you uploaded does not define an identifier troubleshooting in installation.html. You can make this work by adding -f html+auto_identifiers, but this seems like a bug in the EPUB.

@Enivex
Copy link
Author

Enivex commented Sep 23, 2024

I think I've fixed this issue. Note, however, that the EPUB you uploaded does not define an identifier troubleshooting in installation.html. You can make this work by adding -f html+auto_identifiers, but this seems like a bug in the EPUB.

You're right. I didn't upload it myself, I just looked for one where I was getting similar errors. Turns out this one had broken links even in the epub.

I'll try the nightly release later.

@Enivex
Copy link
Author

Enivex commented Sep 24, 2024

Unfortunately the issue is not solved in the original file I was initially interested in. There are 237 missing label errors, even after trying to add auto_identifiers

image

The parts after _ corresponds to id-s in the htmls, but there's no corresponding labels being created in the typ file.

e.g. in part0111.html:
image

Edit: Version 3.4-nightly-2024-09-23

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

What do these anchors point to in the epub? You may need to unzip it and inspect the xhtml files contained therein, e.g. look in part0007.html and try to find the thing that has id pre or ack1.

@Enivex
Copy link
Author

Enivex commented Sep 24, 2024

What do these anchors point to in the epub? You may need to unzip it and inspect the xhtml files contained therein, e.g. look in part0007.html and try to find the thing that has id pre or ack1.

ack1 corresponds to another link back to the other one
image

pre corresponds to a heading
image

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

If you can give me an epub to test with, it would really help, even if it's just stripped down to couple of these examples (all the better).

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

There is code that should be changing these ids. At least the pre should work (on a heading). The identifier on the a href might be ignored by the typst writer.

@jgm jgm reopened this Sep 24, 2024
@Enivex
Copy link
Author

Enivex commented Sep 24, 2024

If you can give me an epub to test with, it would really help, even if it's just stripped down to couple of these examples (all the better).

I don't mind sending it to you for troubleshooting purposes, but I can't post it on github for obvious reasons.

Edit: Sent via email. Hopefully the large attachment doesn't cause issues.

@Enivex
Copy link
Author

Enivex commented Sep 24, 2024

There is code that should be changing these ids. At least the pre should work (on a heading). The identifier on the a href might be ignored by the typst writer.

I just tested with latex output instead, and that has the same issue, so it's not only typst writer.

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

Most of the writers won't pay any attention to an identifier attribute on a Link element.
(Try HTML.)

@Enivex
Copy link
Author

Enivex commented Sep 24, 2024

Most of the writers won't pay any attention to an identifier attribute on a Link element. (Try HTML.)

Converting to html does work, but that's not that surprising (since epub is html based)

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

that's not that surprising (since epub is html based)

Yes, but remember that pandoc isn't just moving the HTML from EPUB to the output. It is parsing everything into an intermediate data structure and re-rendering it. If it works with HTML, that shows that the identifier on the link does get parsed and represented in the AST. So the issue is simply that the Typst (and LaTeX) writer doesn't do anything with this attribute.

@Enivex
Copy link
Author

Enivex commented Sep 24, 2024

that's not that surprising (since epub is html based)

Yes, but remember that pandoc isn't just moving the HTML from EPUB to the output. It is parsing everything into an intermediate data structure and re-rendering it. If it works with HTML, that shows that the identifier on the link does get parsed and represented in the AST. So the issue is simply that the Typst (and LaTeX) writer doesn't do anything with this attribute.

That makes sense.

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

If you want to email me the epub, I can look into it further. At least the identifier on the heading should work.

@Enivex
Copy link
Author

Enivex commented Sep 24, 2024

If you want to email me the epub, I can look into it further. At least the identifier on the heading should work.

I did email you the epub, as I described above. Though it may have vanished into the void because of the large ( 11 MB) attachment.

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

ok, found it in junk folder.

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

OK, here's one example.


error: label `<part0111.html#pre>` does not exist in the document
    ┌─ twok.typ:328:47
    │  
328 │   = <part0007.html_page14><part0007.html_page15>#link(label("part0111.html#pre"))[#strong[PRELUDE TO \

So I look in part0111.html in the epub, and here's where the anchor is:

<p class="toc1" id="pre"><a href="part0007.html#pre" class="calibre1">Prelude to the Stormlight Archive</a></p>

Pandoc doesn't put attributes on Para elements, so this identifier was lost in the parsing stage.

The other cases I've looked at are like this. Links to headings, tables, figures, divs, and spans should work fine. Anything else pandoc is going to drop, but those are the lion's share of real uses.

Probably this can be closed.

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

For your immediate purposes, it might work to use a Lua filter to remove all internal links, so you don't get errors in typst.

@Enivex
Copy link
Author

Enivex commented Sep 25, 2024

Anything else pandoc is going to drop, but those are the lion's share of real uses.

Probably this can be closed.

Is there a particular reason why it can't keep them? In this case it completely breaks the TOC.

For your immediate purposes, it might work to use a Lua filter to remove all internal links, so you don't get errors in typst.

That's probably what I'm going to end up doing short term. (Having working links is useful for navigation though.)

@jgm
Copy link
Owner

jgm commented Sep 25, 2024

A sensible TOC has links to identifiers on headings (e.g. h2 in HTML). These should work fine in a pandoc conversion. This particular document has links all over the place -- to p elements, a elements, etc.

Pandoc has no place to put an id attribute on p, because its Para element has no slot of attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants