-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandoc does not convert an URL from HTML to asciidoc correctly #8070
Comments
Can you say what needs to be changed in the adoc output? |
Firstly, to answer your question, it is line 27 and line 28 in the pandoc.adoc file. I got an update from Asciidoctor discussion forum (by Dan Allen). Here's more information:
After conversion to adoc, the adoc could not display the URL correctly when view in HTML format.
becomes instead of
|
To be clear, then, is there no way to represent a link inside underline in adoc? |
With the other issue you found, please give the HTML of the link, the asciidoc output produced by pandoc, and the asciidoc you think it should have produced instead. |
This really looks to me like a bug in asciidoctor.
gets converted by asciidoc to <p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1
manual, Database Backup and Recovery User’s Guide:
<span class=".enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms,
Steps to Transport a Database to a Different Platform Using Backup Sets</a></p> which is fine. But asciidoctor converts it to <p><a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1 manual</a></p> and just drops the rest. That is a bug in asciidoctor, no? If not, can someone point to something in asciidoctor's documentation that explains why we get this output? |
Looks like it's a bug in AsciiDoctor. Let me check with AsciiDoctor for this part. (the first URL in Reference section) |
It's not a bug in Asciidoctor. The AsciiDoc language now supports attributes in a link macro. The parsing rules are clearly described here: https://docs.asciidoctor.org/asciidoc/latest/macros/link-macro-attribute-parsing/#linked-text-alongside-named-attributes (It's the phrase with the role inside the link text that's introducing the The only way this can be expressed in modern AsciiDoc is as follows:
|
There is. The way to represent underline text is as follows:
|
One thing I had noticed is that Confluence produced the HTML code with For the screenshot mentioned before The desired output is actually generated by a program called reverse_adoc. They removed the enumeration_chapter class and just place the text "Chapter 28" into the text description. I cannot tell if it is semantically correct. So based on the discussion in AsciiDoctor forum, I think this might be the desired output?
|
That's not correct. The parsable output would be:
However, I seriously question whether pandoc should be producing this output. It would be better to remove the formatting in the link text (or at least phrases with roles). We don't want to encourage this kind of complex markup as the whole point of AsciiDoc is to keep the markup as simple as possible. Thus, I agree with this suggestion:
|
Sorry, you lost me there. In the original text
there is no In any case, seems like a fragile syntax -- language designers may want to reconsider it, if link attributes can be accidentally triggered so easily. |
We were rendering it as `+++text+++`; this is now changed to `[.underline]#text#`. See comment at <#8070 (comment)>.
It's added when the phrase with role is converted. What the parser sees is:
That's where you get the equal sign.
Perhaps. Lightweight markup languages are not designed to be perfectly robust. They are designed to be concise. And I don't believe link text should have formatting in it. So I consider this to be a reasonable tradeoff. I'm open to discussions about it. That's just where I currently stand. |
With the current situation, I'm really at a loss as to how to handle this better in pandoc. I tried putting the whole link text in quotes, as suggested in the manual when it contains commas, but this results in malformed HTML, I guess because of the quotes introduced by the interpolation of the span... <a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724">12.1 manual, Database Backup and Recovery User’s Guide: <span class=</a> We could remove all spans, links, images from link text, I suppose. Or we could try to use the complex escaping method you illustrate above (which seems to require some delicate backslash-escaping of @mojavelinux Here are two suggestions (assuming you have something to do with the language spec).
|
First, we can continue this discussion without the accusations. I have read "this is a bug in Asciidoctor", "seems like a fragile syntax", and "backslash escaping in AsciiDoc is a complete mess". There's just no need for that kind of attitude and it makes me want to walk away from this situation. If you want my input, please be respectful of the immense time, effort, and dedication I have put into this language. We recognize that there's room for improvement in the syntax, just as there are with all things in life. That's a key part of why we formed the AsciiDoc Language project to specify and evolve the language. While I lead Asciidoctor and helped launch the effort for the language specification, changes to the language and the parsing rules have to be done through that project. Until that project starts to move forward, the language is what it is right now. I can't make changes in Asciidoctor that change the parsing rules. Therefore, pandoc should position itself to work with the AsciiDoc language as it currently stands (based on the initial contribution, which is https://docs.asciidoctor.org/asciidoc/latest/). There are two options I can suggest:
On a related note, there is no requirement for an AsciiDoc converter to generate |
Of course, I respect the amount of time and dedication it takes to work on a light markup language. The comments were intended as constructive suggestions, but their tone probably reflects the frustration I've had over the years trying to get pandoc to do the right thing in its asciidoc output. Let me avoid the negative tone of "complete mess" and just say that I have no understanding of how escaping works in asciidoc. Because of that, I'm hesitant to go with option 2, since it requires escaping things and I'm not sure I'd get it right in full generality. But maybe you can explain it. In this case, the desired output is:
As a general method, would it be sufficient to follow this recipe?
Will this work even when the element contains verbatim |
As for option 1: what, exactly, would we need to worry about inside link text? Do we have to avoid anything that renders in HTML with an |
Thank you for acknowledging my concern. I will now continue to engage in this thread.
The rules have been documented the best way we can document them in the following two places:
It's well known that the escaping in AsciiDoc is not universal; nor is intended to be. And while it may (perhaps even likely) be something the language project considers adding, the language tries not to enable the writer to use a heavy amount of formatting because it goes against our tenants. If a writer needs that amount complexity, then HTML, DocBook, or LaTeX is what the writer should be using. Having said that, the passthrough macro provides closer to the universal escaping that you're looking for. It takes everything from the left square bracket to the next right square bracket not proceeded by a backslash. It then un-escapes any escaped right square brackets. So you can escape all right square brackets within the enclosed text and it will do the right thing. However, keep in mind that the passthrough macro cannot be nested.
No, it will not. But these characters could be escaped using |
No. Substitution order matters a lot here. Inline images are substituted after the link macro. So it's safe to put an image in the link text (as long as its close square bracket is escaped). The problematic markup lies almost entirely with text formatting (which in AsciiDoc is currently called the "quotes substitution"). In other words, this markup: https://docs.asciidoctor.org/asciidoc/latest/text/#inline-text-and-punctuation-styles. |
So we'd need to remove Strong, Emph, Code, and all other inline formatting for option 1? |
You'd need to remove any roles (i.e., CSS classes). The formatting itself is fine. It's the introduction of what's indistinguishable from an attribute on an inline macro that's the problem (e.g., key="value"). |
But how do I know which things will get substituted by something with key="value" in your toolchain? |
I can offer what we've written about the language, but I can't do all the work for you. It's necessary to study and understand the language and its processor to know what decisions to make. From my viewpoint, that's part of the work of making a language translator. I'm happy to answer questions as they come up, but that's all that I can offer to do. |
I would have thought that as a promoter of the language, it would be in your interest to have good tools for converting to it from other formats. I'm just not interested enough in asciidoc to spend more time on this, so I'm going to drop this thread. Maybe someone else will be interested enough to figure out how to handle these cases.
I did ask a question, above. So if you are really happy to answer questions, what is the answer? |
Again with that attitude. I don't understand why you have to come at me like that when I'm offering my time to help you with your project. It's your project that's offering to translate to AsciiDoc, so I don't see why you are acting put upon that you actually have to learn the rules of the language. As I've said before, I'm very happy to answer your questions (and I go out of my way to do so), but ultimately this is not my project. I don't appreciate you trying to guilt me into making it my responsibility. |
I have never used asciidoc, nor did I write this part of the code. The writer was contributed long ago by a third party. I'm happy to improve it in response to requests from asciidoc users, but I don't have time to become an expert in this format, so I need to rely on those who are. |
I found a very simple solution. When there are commas in the link text, I convert them to numeric entities. |
That seems like a very reasonable approach. Nice thinking. |
I've got the Pandoc nightly version:
Edit: For the link text, it stopped at the closing square bracket of .enumeration_chapter]
|
The difference between this and the case I tested is that here the whole link is underlined.
I suspect that's the problem. Asciidoctor is closing the underline span at the third #. |
To summarize the current state of play. For <u><a href="http://refermefororacle.blogspot.hk/2015/10/cross-platform-backup-and-restore-in-12c.html" class="external-link" rel="nofollow">Refer me for Oracle , Cross-Platform Backup and Restore in 12c</a></u> pandoc 3.6.2 produces:
which I believe is correct. For <a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724" class="external-link" rel="nofollow"><u>12.1 manual, Database Backup and Recovery User's Guide: <span class="enumeration_chapter">Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets</u></a> pandoc 3.6.2 produces:
This isn't right, asciidoctor renders it as: <a href="https://docs.oracle.com/database/121/BRADV/rcmxplat.htm#BRADV724"><span class="underline">12.1 manual, Database Backup and Recovery User’s Guide: [.enumeration_chapter</a>#Chapter 28</span> Transporting Data Across Platforms, Steps to Transport a Database to a Different Platform Using Backup Sets#] But I'm still not sure how this kind of nested thing should be done in asciidoc. Without guidance on this we can't improve the output. |
Explain the problem.
Include the exact command line you used and all inputs necessary to reproduce the issue. Please create as minimal an example as possible, to help the maintainers isolate the problem. Explain the output you received and how it differs from what you expected.
Pandoc version?
What version of pandoc are you using, on what OS?
Fedora 36
pandoc --version
pandoc 2.14.0.3
Problem:
The URL/link is not converted correctly from HTML to adoc by Pandoc (reverse_adoc do not have this problem)
I have provided the sample files (HTML, converted file by Pandoc and Reverse_adoc) as a zip file in this ticket.
Pandoc command used:
pandoc -t asciidoctor -f html original.html -o pandoc.adoc
Samples:
test.zip
The text was updated successfully, but these errors were encountered: