Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<br> inside <code> is not turned into \n correctly #355

Closed
guojuntech opened this issue Oct 16, 2020 · 4 comments
Closed

<br> inside <code> is not turned into \n correctly #355

guojuntech opened this issue Oct 16, 2020 · 4 comments

Comments

@guojuntech
Copy link

guojuntech commented Oct 16, 2020

Sample Html:

<code>
xxxx
<br>
yyyy
</code>

However in result markdown, <br> is just dismissed and not turned into \n.
I checked rules.fencedCodeBlock, I have a question about commonmark-rules.js#L114. In content param, <br> is turned into \n. But in L114, we use node.firstChild.textContent instead of content. Any specific reason for this logic? Is it safe to use code = content.

Thanks

image

@martincizek
Copy link
Collaborator

martincizek commented Dec 2, 2020

The default code span rule is supposed to be as less opinionated as possible, while producing valid Markdown. Please see CommonMark spec 6.3. Let's walk through the assumptions first:

In content param, <br> is turned into \n.

Strictly speaking, it is not. It is turned into \n\x20\x20. In this case, the generated code span would contain made-up spaces.

Is it safe to use code = content.

Update: I first wrote the answer for code spans, but the key things are the same for code blocks.

Depends on your use case:

  1. (applies both to code spans and blocks) The code span would contain markup originating from eventual nested HTML elements. E.g. <code>function <i>foo_bar</i>()</code> becomes function _foo_bar_(). The least opinionated and most reasonable interpretation to me is to expect that the nested elements are sort-of automatic syntax highlighting, which should be dropped.
  2. (applies to code spans) You'd have to ensure that the Markdown generated from the <code> contents is valid as a code span. E.g. a <p> within the <code> would break the code span. Two consecutive <br>s would break it too. See the specs mentioned above.
  3. (applies to code spans) There are some discussions regarding context-dependent escaping. Currently, the content is just unescaped, but this might change, although it would be probably in a backward-compatible way if it happened.

So interpreting eventual HTML within code is definitely something that should be left to users' custom rules. Your rule can always choose to output an embedded HTML <code> element if you really need to represent line breaks within inline code. Then the nested elements would work too.

Now specifically for <br>:

  • It won't have much value for code span (which I first thought you're asking about) - as the CommonMark spec says: First, line endings are converted to spaces.
  • I admit that <br> might be legit and quite common in code blocks. E.g. some WYSIWYG editors produce <br> instead of newlines in the <pre> element. But you're not asking about code blocks. :) This is still something to consider in the future.

Two more remarks:

  • Next to the custom rules, you might also choose to do some DOM preprocessing, so that the document better matches what it represents. Its not uncommon, see for example the source code of the codesample plugin from TinyMCE - it converts <br>s to newlines to make the document consistent.
  • There is an improvement in code span whitespace handling described in Preformatted inline code #318. Not related to newlines, but you can get more inspiration there.

Does this answer your question?

@guojuntech
Copy link
Author

@martincizek Thanks a lot for your very detailed answer. It helps a lot. 👍

TechQuery added a commit to freeCodeCamp-China/article-webpage-to-markdown-action that referenced this issue Oct 10, 2023
@ggorlen
Copy link

ggorlen commented Mar 5, 2024

Here's what I wound up doing to handle cases like this--I'd be curious to hear if there's a better, more robust approach.

For starters, I've found that multiline <code> is typically inside <pre>, but if it's not, Cheerio can be used to wrap it. Once it's in a <pre>, then newlines should be output in multiline code such as OP's example. But if the <code> content is on a single line and uses <br> for rendering newlines, $("pre code br").replaceWith("\n"); can be used to produce the expected result (or something hopefully close to it).

const cheerio = require("cheerio"); // ^1.0.0-rc.12
const TurndownService = require("turndown"); // ^7.1.2

const html = `<code>
xxxx
<br>
yyyy
</code>`;
const $ = cheerio.load(html);

// wrap with `<pre>` if necessary
$("code").replaceWith((_, e) =>
  `<pre><code>${$(e).html().trim()}</code></pre>`
);

// if <br> in your source code is on one line, e.g.
// `<code>xxxx<br>yyyy</code>`, you can use:
//$("pre code br").replaceWith("\n");

const turndownService = new TurndownService({
  codeBlockStyle: "fenced"
});
console.log(($.html()));
console.log("_".repeat(40));
console.log(turndownService.turndown($.html()));

@martincizek
Copy link
Collaborator

But if the <code> content is on a single line and uses <br> for rendering newlines

If the <code> content is on a single line and uses <br> for rendering newlines, it can't be a result of rendering Markdown's inline code - you can try it yourself. Markdown can respect multiple spaces in inline code (e.g. GitLab uses it with appropriate CSS), which is what preformattedCode option reflects, but not newlines.

Only a subset of what can be expressed in HTML can be converted to Markdown. If you are outside of what can be expressed in Markdown, you need to write some opinionated preprocessor like you did when you wrapped the code in the pre tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants