Broken internal links when converting epub #10207

Enivex · 2024-09-23T00:31:20Z

Explain the problem.
Take an epub file that uses internal links, e.g.
https://dieterplex.github.io/rust-ebookshelf/The%20Rust%20Programming%20Language.epub

Run pandoc -f epub -t typst '.\The Rust Programming Language.epub' --standalone -o 'trpl.typ'. The exact options or even output file type are not very important.

The resulting file includes links like #link(label("ch01-01-installation.html#troubleshooting")) (there are some other flavors too), which will not work, because the label it refers to does not exist in the document. The closest being <ch01-01-installation.html>, which refers to the entire chapter.

Pandoc version?
What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue.)

pandoc 3.4, Windows 11

A separate issue is that in order for images to work, the files have to be manually extracted from the epub, and the placed correctly in relation the resulting typ file. (Not sure if I should create a separate issue for this)
(This particular issue was fixed by adding the --extract-media . option. Not entirely sure why the . is required. Without it I get Couldn't extract ePub file: Did not find end of central directory signature)

The text was updated successfully, but these errors were encountered:

jgm · 2024-09-23T01:00:38Z

--extract-media requires an argument (file path). That is why.

jgm · 2024-09-23T05:51:46Z

I think I've fixed this issue. Note, however, that the EPUB you uploaded does not define an identifier troubleshooting in installation.html. You can make this work by adding -f html+auto_identifiers, but this seems like a bug in the EPUB.

Enivex · 2024-09-23T17:45:35Z

I think I've fixed this issue. Note, however, that the EPUB you uploaded does not define an identifier troubleshooting in installation.html. You can make this work by adding -f html+auto_identifiers, but this seems like a bug in the EPUB.

You're right. I didn't upload it myself, I just looked for one where I was getting similar errors. Turns out this one had broken links even in the epub.

I'll try the nightly release later.

Enivex · 2024-09-24T04:01:07Z

Unfortunately the issue is not solved in the original file I was initially interested in. There are 237 missing label errors, even after trying to add auto_identifiers

The parts after _ corresponds to id-s in the htmls, but there's no corresponding labels being created in the typ file.

e.g. in part0111.html:

Edit: Version 3.4-nightly-2024-09-23

jgm · 2024-09-24T05:06:52Z

What do these anchors point to in the epub? You may need to unzip it and inspect the xhtml files contained therein, e.g. look in part0007.html and try to find the thing that has id pre or ack1.

Enivex · 2024-09-24T15:55:38Z

What do these anchors point to in the epub? You may need to unzip it and inspect the xhtml files contained therein, e.g. look in part0007.html and try to find the thing that has id pre or ack1.

ack1 corresponds to another link back to the other one

pre corresponds to a heading

jgm · 2024-09-24T16:32:12Z

If you can give me an epub to test with, it would really help, even if it's just stripped down to couple of these examples (all the better).

jgm · 2024-09-24T16:35:50Z

There is code that should be changing these ids. At least the pre should work (on a heading). The identifier on the a href might be ignored by the typst writer.

Enivex · 2024-09-24T16:39:49Z

If you can give me an epub to test with, it would really help, even if it's just stripped down to couple of these examples (all the better).

I don't mind sending it to you for troubleshooting purposes, but I can't post it on github for obvious reasons.

Edit: Sent via email. Hopefully the large attachment doesn't cause issues.

Enivex · 2024-09-24T17:06:16Z

There is code that should be changing these ids. At least the pre should work (on a heading). The identifier on the a href might be ignored by the typst writer.

I just tested with latex output instead, and that has the same issue, so it's not only typst writer.

jgm · 2024-09-24T17:09:19Z

Most of the writers won't pay any attention to an identifier attribute on a Link element.
(Try HTML.)

Enivex · 2024-09-24T18:33:33Z

Most of the writers won't pay any attention to an identifier attribute on a Link element. (Try HTML.)

Converting to html does work, but that's not that surprising (since epub is html based)

jgm · 2024-09-24T18:51:04Z

that's not that surprising (since epub is html based)

Yes, but remember that pandoc isn't just moving the HTML from EPUB to the output. It is parsing everything into an intermediate data structure and re-rendering it. If it works with HTML, that shows that the identifier on the link does get parsed and represented in the AST. So the issue is simply that the Typst (and LaTeX) writer doesn't do anything with this attribute.

Enivex · 2024-09-24T19:26:39Z

that's not that surprising (since epub is html based)

Yes, but remember that pandoc isn't just moving the HTML from EPUB to the output. It is parsing everything into an intermediate data structure and re-rendering it. If it works with HTML, that shows that the identifier on the link does get parsed and represented in the AST. So the issue is simply that the Typst (and LaTeX) writer doesn't do anything with this attribute.

That makes sense.

jgm · 2024-09-24T19:36:22Z

If you want to email me the epub, I can look into it further. At least the identifier on the heading should work.

Enivex · 2024-09-24T19:37:38Z

If you want to email me the epub, I can look into it further. At least the identifier on the heading should work.

I did email you the epub, as I described above. Though it may have vanished into the void because of the large ( 11 MB) attachment.

jgm · 2024-09-24T20:17:37Z

ok, found it in junk folder.

jgm · 2024-09-24T21:51:20Z

OK, here's one example.


error: label `<part0111.html#pre>` does not exist in the document
    ┌─ twok.typ:328:47
    │  
328 │   = <part0007.html_page14><part0007.html_page15>#link(label("part0111.html#pre"))[#strong[PRELUDE TO \

So I look in part0111.html in the epub, and here's where the anchor is:

<p class="toc1" id="pre"><a href="part0007.html#pre" class="calibre1">Prelude to the Stormlight Archive</a></p>

Pandoc doesn't put attributes on Para elements, so this identifier was lost in the parsing stage.

The other cases I've looked at are like this. Links to headings, tables, figures, divs, and spans should work fine. Anything else pandoc is going to drop, but those are the lion's share of real uses.

Probably this can be closed.

jgm · 2024-09-24T21:52:12Z

For your immediate purposes, it might work to use a Lua filter to remove all internal links, so you don't get errors in typst.

Enivex · 2024-09-25T00:26:43Z

Anything else pandoc is going to drop, but those are the lion's share of real uses.

Probably this can be closed.

Is there a particular reason why it can't keep them? In this case it completely breaks the TOC.

For your immediate purposes, it might work to use a Lua filter to remove all internal links, so you don't get errors in typst.

That's probably what I'm going to end up doing short term. (Having working links is useful for navigation though.)

jgm · 2024-09-25T01:56:12Z

A sensible TOC has links to identifiers on headings (e.g. h2 in HTML). These should work fine in a pandoc conversion. This particular document has links all over the place -- to p elements, a elements, etc.

Pandoc has no place to put an id attribute on p, because its Para element has no slot of attributes.

Enivex added the bug label Sep 23, 2024

jgm closed this as completed in f3fff87 Sep 23, 2024

jgm reopened this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken internal links when converting epub #10207

Broken internal links when converting epub #10207

Enivex commented Sep 23, 2024 •

edited

Loading

jgm commented Sep 23, 2024

jgm commented Sep 23, 2024

Enivex commented Sep 23, 2024

Enivex commented Sep 24, 2024 •

edited

Loading

jgm commented Sep 24, 2024

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024

jgm commented Sep 24, 2024

Enivex commented Sep 24, 2024 •

edited

Loading

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024 •

edited

Loading

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024 •

edited

Loading

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024

jgm commented Sep 24, 2024

jgm commented Sep 24, 2024

Enivex commented Sep 25, 2024

jgm commented Sep 25, 2024

Broken internal links when converting epub #10207

Broken internal links when converting epub #10207

Comments

Enivex commented Sep 23, 2024 • edited Loading

jgm commented Sep 23, 2024

jgm commented Sep 23, 2024

Enivex commented Sep 23, 2024

Enivex commented Sep 24, 2024 • edited Loading

jgm commented Sep 24, 2024

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024

jgm commented Sep 24, 2024

Enivex commented Sep 24, 2024 • edited Loading

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024 • edited Loading

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024 • edited Loading

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024

Enivex commented Sep 24, 2024

jgm commented Sep 24, 2024

jgm commented Sep 24, 2024

jgm commented Sep 24, 2024

Enivex commented Sep 25, 2024

jgm commented Sep 25, 2024

Enivex commented Sep 23, 2024 •

edited

Loading

Enivex commented Sep 24, 2024 •

edited

Loading

Enivex commented Sep 24, 2024 •

edited

Loading

jgm commented Sep 24, 2024 •

edited

Loading

jgm commented Sep 24, 2024 •

edited

Loading