Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandoc does not convert an URL from HTML to asciidoc correctly #8070

Open
patrickdung opened this issue May 13, 2022 · 31 comments
Open

Pandoc does not convert an URL from HTML to asciidoc correctly #8070

patrickdung opened this issue May 13, 2022 · 31 comments

Comments

@patrickdung
Copy link

patrickdung commented May 13, 2022

Explain the problem.
Include the exact command line you used and all inputs necessary to reproduce the issue. Please create as minimal an example as possible, to help the maintainers isolate the problem. Explain the output you received and how it differs from what you expected.

Pandoc version?
What version of pandoc are you using, on what OS?

Fedora 36
pandoc --version
pandoc 2.14.0.3

Problem:
The URL/link is not converted correctly from HTML to adoc by Pandoc (reverse_adoc do not have this problem)
image

I have provided the sample files (HTML, converted file by Pandoc and Reverse_adoc) as a zip file in this ticket.

Pandoc command used:
pandoc -t asciidoctor -f html original.html -o pandoc.adoc

Samples:
test.zip

@jgm
Copy link
Owner

jgm commented May 14, 2022

Can you say what needs to be changed in the adoc output?
Is the problem the line break after [ on line 30?

@patrickdung
Copy link
Author

patrickdung commented May 14, 2022

Firstly, to answer your question, it is line 27 and line 28 in the pandoc.adoc file.

I got an update from Asciidoctor discussion forum (by Dan Allen). Here's more information:

  1. The source HTML is exported from Atlassian Confluence
  2. Dan points out that there are <u> tags in the HTML that wraps the URL (<a>).
    This cause problem in the HTML to adoc conversion.
    <u>foo</u>
    Produces:
    +++foo+++

image

After conversion to adoc, the adoc could not display the URL correctly when view in HTML format.

  1. He suggested that the <u> tags should be removed in the source HTML
    I tested it and pandoc performed the correct conversion:

image

  1. So Pandoc may consider ignoring the <u> tags when doing the conversion.
    BTW, I just spotted another problem if I manually removed the <u> tags in the HTML:
    (The first link inside the Reference section of the HTML)
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1 manual, Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets]

becomes
12.1 manual

instead of

12.1 manual, Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets

@jgm
Copy link
Owner

jgm commented May 14, 2022

To be clear, then, is there no way to represent a link inside underline in adoc?
That seems unfortunate if so.

@jgm
Copy link
Owner

jgm commented May 14, 2022

With the other issue you found, please give the HTML of the link, the asciidoc output produced by pandoc, and the asciidoc you think it should have produced instead.

@patrickdung
Copy link
Author

patrickdung commented May 15, 2022

I am no expert on AsciiDoc, but I think the Confluence may be producing extra/wrong output using <u> tags. Here's the original layout in Confluence, it is just URL links,
image

OK, the information below is referring to point 4 in my last reply (the link for 12.1 manual).
I have extracted the part from the source HTML to narrow down the problem.
I used command cat -vet to display the line breaks.

$ cat -vet another-url-with-u-tags.html
<!DOCTYPE html>$
<html>$
  <body>$
    <p>$
      <a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724" class="external-link" rel="nofollow"><u>12.1 manual, Database Backup and Recovery User's Guide: <span class="enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets</u>$
      </a>$
    </p>$
  </body>$
</html>$

Conversion is performed on the HTML with <u> and </u> removed.

$ cat -vet another-url-without-u-tags.html
<!DOCTYPE html>$
<html>$
  <body>$
    <p>$
      <a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724" class="external-link" rel="nofollow">12.1 manual, Database Backup and Recovery User's Guide: <span class="enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets$
      </a>$
    </p>$
  </body>$
</html>$

1. Normal conversion by Pandoc
$ pandoc -t asciidoctor -f html -o pandoc.adoc another-url-without-u-tags.html
$ cat -vet pandoc.adoc
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1$
manual, Database Backup and Recovery User's Guide:$
[.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms,$
Steps to Transport a Database to a Different Platform Using Backup Sets]$

2. Pandoc conversion and preserve wrap
$ pandoc --wrap=preserve -t asciidoctor -f html -o pandoc-preserve-wrap.adoc another-url-without-u-tags.html
$ cat -vet pandoc-preserve-wrap.adoc
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1 manual, Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets]$

3. Desired output
$ cat -vet desired.adoc
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1 manual, Database Backup and Recovery User's Guide: Chapter 28 Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets]$
$

Here's the preview on the combined output (in order):
image

@jgm
Copy link
Owner

jgm commented May 15, 2022

This really looks to me like a bug in asciidoctor.
This asciidoc

https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
[.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]

gets converted by asciidoc to

<p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1
manual, Database Backup and Recovery User&#8217;s Guide:
<span class=".enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms,                 
Steps to Transport a Database to a Different Platform Using Backup Sets</a></p>

which is fine. But asciidoctor converts it to

<p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1 manual</a></p> 

and just drops the rest. That is a bug in asciidoctor, no? If not, can someone point to something in asciidoctor's documentation that explains why we get this output?

@patrickdung
Copy link
Author

Looks like it's a bug in AsciiDoctor. Let me check with AsciiDoctor for this part. (the first URL in Reference section)

@mojavelinux
Copy link

mojavelinux commented May 15, 2022

It's not a bug in Asciidoctor. The AsciiDoc language now supports attributes in a link macro. The parsing rules are clearly described here: https://docs.asciidoctor.org/asciidoc/latest/macros/link-macro-attribute-parsing/#linked-text-alongside-named-attributes (It's the phrase with the role inside the link text that's introducing the = sign).

The only way this can be expressed in modern AsciiDoc is as follows:

https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
pass:n[[.enumeration_chapter\]#Chapter 28#] Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]

@mojavelinux
Copy link

mojavelinux commented May 15, 2022

To be clear, then, is there no way to represent a link inside underline in adoc?

There is. The way to represent underline text is as follows:

[.underline]#text#

<u> is a formatting element, not a semantic element. Therefore the AsciiDoc language does not provide a direct translation for it. Instead, it correctly maps it to a phrase role (as it does for other formatting roles). See https://docs.asciidoctor.org/asciidoc/latest/text/text-span-built-in-roles/#built-in-roles-for-text

@patrickdung
Copy link
Author

One thing I had noticed is that Confluence produced the HTML code with
<span class="enumeration_chapter">Chapter 28</span> inside the link description

For the screenshot mentioned before
image

The desired output is actually generated by a program called reverse_adoc. They removed the enumeration_chapter class and just place the text "Chapter 28" into the text description. I cannot tell if it is semantically correct.

So based on the discussion in AsciiDoctor forum, I think this might be the desired output?

https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724["12.1 manual, Database Backup and Recovery User's Guide: \[.enumeration_chapter\]#Chapter 28# Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets"]

@mojavelinux
Copy link

mojavelinux commented May 15, 2022

So based on the discussion in AsciiDoctor forum, I think this might be the desired output?

That's not correct. The parsable output would be:

https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
pass:n[[.enumeration_chapter\]#Chapter 28#] Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]

However, I seriously question whether pandoc should be producing this output. It would be better to remove the formatting in the link text (or at least phrases with roles). We don't want to encourage this kind of complex markup as the whole point of AsciiDoc is to keep the markup as simple as possible. Thus, I agree with this suggestion:

The desired output is actually generated by a program called reverse_adoc. They removed the enumeration_chapter class and just place the text "Chapter 28" into the text description.

@jgm
Copy link
Owner

jgm commented May 15, 2022

(It's the phrase with the role inside the link text that's introducing the = sign).

Sorry, you lost me there. In the original text

https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[12.1
manual, Database Backup and Recovery User's Guide:
[.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets]

there is no = sign. And if you take out the [.enumeration_chapter]#Chapter 28#, the whole thing becomes part of the link text. How does [.enumeration_chapter]#Chapter 28# introduce an = sign?

In any case, seems like a fragile syntax -- language designers may want to reconsider it, if link attributes can be accidentally triggered so easily.

jgm added a commit that referenced this issue May 15, 2022
We were rendering it as `+++text+++`; this is now changed to
`[.underline]#text#`.  See comment at
<#8070 (comment)>.
@mojavelinux
Copy link

How does [.enumeration_chapter]#Chapter 28# introduce an = sign?

It's added when the phrase with role is converted. What the parser sees is:

<span class="enumeration_chapter">Chapter 28</span>

That's where you get the equal sign.

In any case, seems like a fragile syntax

Perhaps. Lightweight markup languages are not designed to be perfectly robust. They are designed to be concise. And I don't believe link text should have formatting in it. So I consider this to be a reasonable tradeoff. I'm open to discussions about it. That's just where I currently stand.

@jgm
Copy link
Owner

jgm commented May 16, 2022

With the current situation, I'm really at a loss as to how to handle this better in pandoc. I tried putting the whole link text in quotes, as suggested in the manual when it contains commas, but this results in malformed HTML, I guess because of the quotes introduced by the interpolation of the span...

<a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1 manual, Database Backup and Recovery User&#8217;s Guide: <span class=</a>

We could remove all spans, links, images from link text, I suppose. Or we could try to use the complex escaping method you illustrate above (which seems to require some delicate backslash-escaping of ]).

@mojavelinux Here are two suggestions (assuming you have something to do with the language spec).

  1. If the comma is escaped, it should not introduce attributes. That would provide a simple workaround, consistent with the general principle that backslash-escaping special characters defeats their usual special meanings. Unfortunately backslash-escaping in asciidoc is a complete mess, but maybe that can be cleaned up. Looks like currently the comma cannot be backslash-escaped.

  2. Don't treat the comma as introducing attributes if what follows is not of the proper form to be an attribute, e.g. role=....

@mojavelinux
Copy link

mojavelinux commented May 16, 2022

First, we can continue this discussion without the accusations. I have read "this is a bug in Asciidoctor", "seems like a fragile syntax", and "backslash escaping in AsciiDoc is a complete mess". There's just no need for that kind of attitude and it makes me want to walk away from this situation. If you want my input, please be respectful of the immense time, effort, and dedication I have put into this language.

We recognize that there's room for improvement in the syntax, just as there are with all things in life. That's a key part of why we formed the AsciiDoc Language project to specify and evolve the language. While I lead Asciidoctor and helped launch the effort for the language specification, changes to the language and the parsing rules have to be done through that project.

Until that project starts to move forward, the language is what it is right now. I can't make changes in Asciidoctor that change the parsing rules. Therefore, pandoc should position itself to work with the AsciiDoc language as it currently stands (based on the initial contribution, which is https://docs.asciidoctor.org/asciidoc/latest/).

There are two options I can suggest:

  1. Drop the role on an inline phrase inside of link text; such syntax is essentially forbid in the AsciiDoc language right now, so you aren't doing the wrong thing
  2. Enclose the inline phrase in an inline passthrough, as I showed above

On a related note, there is no requirement for an AsciiDoc converter to generate <span class="underline">underline me</span> from [.underline]#underline me#. It could just as well produce <u>underline me</u>. The built-in converter just happens to produce the former for the reason I already cited about the use of a <u> tag. But that decision is downstream from what pandoc (or a writer) produces.

@jgm
Copy link
Owner

jgm commented May 16, 2022

Of course, I respect the amount of time and dedication it takes to work on a light markup language. The comments were intended as constructive suggestions, but their tone probably reflects the frustration I've had over the years trying to get pandoc to do the right thing in its asciidoc output. Let me avoid the negative tone of "complete mess" and just say that I have no understanding of how escaping works in asciidoc. Because of that, I'm hesitant to go with option 2, since it requires escaping things and I'm not sure I'd get it right in full generality. But maybe you can explain it. In this case, the desired output is:

pass:n[[.enumeration_chapter\]#Chapter 28#]

As a general method, would it be sufficient to follow this recipe?

  1. render the element as it would be rendered outside of a link
  2. add a backslash in front of every ] in the result of 1
  3. enclose the result of 2 in pass:n[...]?

Will this work even when the element contains verbatim ] characters, e.g., HTML <span class="foo"><code>]</code></span>?

@jgm
Copy link
Owner

jgm commented May 16, 2022

As for option 1: what, exactly, would we need to worry about inside link text? Do we have to avoid anything that renders in HTML with an =? If so, that includes images. Since a lot of people put images in the link text, this could be a big limitation.

@mojavelinux
Copy link

Thank you for acknowledging my concern. I will now continue to engage in this thread.

I have no understanding of how escaping works in asciidoc.

The rules have been documented the best way we can document them in the following two places:

It's well known that the escaping in AsciiDoc is not universal; nor is intended to be. And while it may (perhaps even likely) be something the language project considers adding, the language tries not to enable the writer to use a heavy amount of formatting because it goes against our tenants. If a writer needs that amount complexity, then HTML, DocBook, or LaTeX is what the writer should be using.

Having said that, the passthrough macro provides closer to the universal escaping that you're looking for. It takes everything from the left square bracket to the next right square bracket not proceeded by a backslash. It then un-escapes any escaped right square brackets. So you can escape all right square brackets within the enclosed text and it will do the right thing. However, keep in mind that the passthrough macro cannot be nested.

Will this work even when the element contains verbatim ] characters

No, it will not. But these characters could be escaped using &#93; (as we do in Asciidoctor PDF). When trying to neutralize characters which have meaning in the syntax, using character references is often a workable strategy.

@mojavelinux
Copy link

Do we have to avoid anything that renders in HTML with an =?

No. Substitution order matters a lot here. Inline images are substituted after the link macro. So it's safe to put an image in the link text (as long as its close square bracket is escaped). The problematic markup lies almost entirely with text formatting (which in AsciiDoc is currently called the "quotes substitution"). In other words, this markup: https://docs.asciidoctor.org/asciidoc/latest/text/#inline-text-and-punctuation-styles.

@jgm
Copy link
Owner

jgm commented May 16, 2022

So we'd need to remove Strong, Emph, Code, and all other inline formatting for option 1?
And also backslash-escape any closing square brackets?
And also do something with any literal = signs that happen to be there, perhaps using entities?
But we can avoid doing any of this as long as the link text doesn't contain a comma?

@mojavelinux
Copy link

So we'd need to remove Strong, Emph, Code, and all other inline formatting for option 1?

You'd need to remove any roles (i.e., CSS classes). The formatting itself is fine. It's the introduction of what's indistinguishable from an attribute on an inline macro that's the problem (e.g., key="value").

@jgm
Copy link
Owner

jgm commented May 16, 2022

You'd need to remove any roles (i.e., CSS classes). The formatting itself is fine.
It's the introduction of what's indistinguishable from an attribute on an inline macro that's the problem (e.g., key="value").

But how do I know which things will get substituted by something with key="value" in your toolchain?
You linked to https://docs.asciidoctor.org/asciidoc/latest/text/#inline-text-and-punctuation-styles , so I was assuming all of those...

@mojavelinux
Copy link

I can offer what we've written about the language, but I can't do all the work for you. It's necessary to study and understand the language and its processor to know what decisions to make. From my viewpoint, that's part of the work of making a language translator. I'm happy to answer questions as they come up, but that's all that I can offer to do.

@jgm
Copy link
Owner

jgm commented May 16, 2022

I would have thought that as a promoter of the language, it would be in your interest to have good tools for converting to it from other formats. I'm just not interested enough in asciidoc to spend more time on this, so I'm going to drop this thread. Maybe someone else will be interested enough to figure out how to handle these cases.

I'm happy to answer questions as they come up, but that's all that I can offer to do.

I did ask a question, above. So if you are really happy to answer questions, what is the answer?

@mojavelinux
Copy link

Again with that attitude. I don't understand why you have to come at me like that when I'm offering my time to help you with your project. It's your project that's offering to translate to AsciiDoc, so I don't see why you are acting put upon that you actually have to learn the rules of the language. As I've said before, I'm very happy to answer your questions (and I go out of my way to do so), but ultimately this is not my project. I don't appreciate you trying to guilt me into making it my responsibility.

@jgm
Copy link
Owner

jgm commented May 16, 2022

I don't see why you are acting put upon that you actually have to learn the rules of the language.

I have never used asciidoc, nor did I write this part of the code. The writer was contributed long ago by a third party. I'm happy to improve it in response to requests from asciidoc users, but I don't have time to become an expert in this format, so I need to rely on those who are.

@jgm jgm closed this as completed in 3df55b4 May 16, 2022
@jgm
Copy link
Owner

jgm commented May 16, 2022

I found a very simple solution. When there are commas in the link text, I convert them to numeric entities.
That works well for the original case, above, and avoids the complexity of passthrough syntax. It could be that it has other unforeseen consequences; if so, please open a new issue.

@mojavelinux
Copy link

That seems like a very reasonable approach. Nice thinking.

@patrickdung
Copy link
Author

patrickdung commented May 17, 2022

@jgm

I've got the Pandoc nightly version:

pandoc 2.18-nightly-2022-05-17
Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3,
citeproc 0.7, ipynb 0.2, hslua 2.2.0
Scripting engine: Lua 5.4

Copyright (C) 2006-2022 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
  1. For the first URL in the reference section
    Then the adoc is created by Pandoc as:
https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[[.underline]#12.1 manual&#44; Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms&#44; Steps to Transport a Database to a Different Platform Using Backup Sets#]

Edit: For the link text, it stopped at the closing square bracket of .enumeration_chapter]
other remaining text are just displayed as plain text.

<p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724"><span class="underline">12.1 manual&#44; Database Backup and Recovery User&#8217;s Guide: [.enumeration_chapter</a>#Chapter 28</span> Transporting Data Across Platforms&#44; Steps to Transport a Database to a Different Platform Using Backup Sets#]</p>
  1. The second URL looks ok now (the <u tags / [.underline] / 1906ae0)

@jgm
Copy link
Owner

jgm commented May 17, 2022

The difference between this and the case I tested is that here the whole link is underlined.
So here we have nested spans delimited by # characters:

[.underline]#.... [.enumeration_chapter]#....# ...#

I suspect that's the problem. Asciidoctor is closing the underline span at the third #.
It may be possible to escape the third # or something; perhaps @mojavelinux can illuminate this.

@jgm jgm reopened this May 17, 2022
jgm added a commit that referenced this issue Nov 29, 2022
...with entities when they're in Str elements.  If a link
contains an image, it may have attributes, and the commas
there should not be converted.

See #8437, #8070.
@jgm
Copy link
Owner

jgm commented Jan 15, 2025

To summarize the current state of play. For

<u><a href="http://refermefororacle.blogspot.hk/2015/10/cross-platform-backup-and-restore-in-12c.html" class="external-link" rel="nofollow">Refer me for Oracle , Cross-Platform Backup and Restore in 12c</a></u>

pandoc 3.6.2 produces:

[.underline]#http://refermefororacle.blogspot.hk/2015/10/cross-platform-backup-and-restore-in-12c.html[Refer me for Oracle &#44; Cross-Platform Backup and Restore in 12c]#

which I believe is correct. For

<a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724" class="external-link" rel="nofollow"><u>12.1 manual, Database Backup and Recovery User's Guide: <span class="enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets</u></a>

pandoc 3.6.2 produces:

https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724[[.underline]#12.1 manual&#44; Database Backup and Recovery User's Guide: [.enumeration_chapter]#Chapter 28# Transporting Data Across Platforms&#44; Steps to Transport a Database to a Different Platform Using Backup Sets#]

This isn't right, asciidoctor renders it as:

<a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724"><span class="underline">12.1 manual&#44; Database Backup and Recovery User&#8217;s Guide: [.enumeration_chapter</a>#Chapter 28</span> Transporting Data Across Platforms&#44; Steps to Transport a Database to a Different Platform Using Backup Sets#]

But I'm still not sure how this kind of nested thing should be done in asciidoc. Without guidance on this we can't improve the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants