Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted code block after HTML block comment #200

Closed
step- opened this issue Dec 9, 2023 · 3 comments
Closed

Corrupted code block after HTML block comment #200

step- opened this issue Dec 9, 2023 · 3 comments
Labels

Comments

@step-
Copy link
Contributor

step- commented Dec 9, 2023

Example https://spec.commonmark.org/0.30/#example-134
Running md2html with just the example and no other text included meets the spec. However, when I add an HTML block before the example the test result changes unexpectedly and fails the spec.
Failing test subject:

<!-- x -->
    ```
    aaa
    ```

Expected output:

<!-- x -->
<pre><code>```
aaa
```
</code></pre>

Actual output (fail):

<!-- x -->
<pre><code>aaa
```
</code></pre>

Here's another example of the same problem. This time the HTML comment spans two lines. As in the first example, md2html is missing some text in the rendered code block.

<!-- x
-->
    ```
    aaa
    ```

Expected output:

<!-- x
-->
<pre><code>```
aaa
```
</code></pre>

Actual output (fail):

<!-- x
-->
<pre><code>aaa
```
</code></pre>
@N-R-K
Copy link

N-R-K commented Dec 10, 2023

Indented code blocks need to be separated by black lines: https://spec.commonmark.org/0.30/#indented-code-blocks

@step-
Copy link
Contributor Author

step- commented Dec 10, 2023

Indented code blocks need to be separated by black lines: https://spec.commonmark.org/0.30/#indented-code-blocks

I read that section differently. It says that there must be a blank line before the block to interrupt a preceding paragraph. The failing example concerns a type 2 HTML block, https://spec.commonmark.org/0.30/#html-blocks, before an indented code block. In the failing example, the HTML block is properly ended. Then comes an indented code block, which is rendered incorrectly. Running the failing example under cmark yields the reference output shown in my report.

@step-
Copy link
Contributor Author

step- commented Jan 8, 2024

edit: this is now issue #201

As a follow-up to @N-R-K's comment I verified md2html vs cmark for the following failing test subject:

start
    ```
    aaa
    ```

According to the spec linked above, the whole subject constitutes a paragraph that starts with "start" and ends with the second "```". If there was a blank line after the start line then the subject would consists of a one-line paragraph followed by a code block. Below we can see that cmark renders the test subject as a whole paragraph, stripping white space before lines 2, 3 and 4, which then amount to a code span.

cmark

<p>p
<code>aaa</code></p>

Below we can see that md2html renders the test subject incorrectly, much the same way it rendered the original subject above incorrectly:

md2html:

<p>p</p>
<pre><code>aaa
```
</code></pre>

Conclusion: non-conformance is broader than I originally observed.

I spent time tracing md4c.c and I'm pretty confident that the problem originates in the parser code and not in md2html application code.

@step- step- changed the title md2html fails example 134 from commonmark 0.30 spec Corrupted code block after HTML block comment Jan 8, 2024
step- added a commit to step-/md4c that referenced this issue Jan 8, 2024
For the specific case of a one-line block comment (HTML block type 2)
not followed by a blank line[1], MD4C fails to call the leave callback.
Consequently, if another HTML comment follows the first comment, both
end up being combined into a single block instead of staying as two
separate blocks, each with its own callback.

[1] In my comments to issue mity#200
I remarked that the spec doesn't require a blank line to close a
type 2 HTML block.

DETAILS

I can't think of a simple way to demonstrate this issue using md2html
alone, precisely because the lack of something can't be shown. Feeding
md2html the following markdown:

```
<!-- C1 -->
<!-- C2 -->
```

outputs the input text so one would be inclined to think that everything
is correct. However, what can't be seen is that there is no callback
between lines C1 and C2, while there should be one. Instead, a callback
after line C2 wrongly combines the two single-line blocks into a
two-line block. Trace the code or add printf statements as needed to
see the issue at work.

This commit fixes the problem by setting `ctx->html_block_type` to
a negative value as a special way to flag this end-of-block case.
step- added a commit to step-/md4c that referenced this issue Jan 8, 2024
@mity mity closed this as completed in 132c29d Jan 8, 2024
@mity mity added the bug label Jan 13, 2024
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jan 25, 2024
## Version 0.5.1

Changes:

 * LaTeX math extension (`MD_FLAG_LATEXMATHSPANS`) now requires that opener
   mark is not immediately preceded with alpha-numeric character and similarly
   that closer mark is not immediately followed with alpha-numeric character.

   So for example `foo$ x + y = z $` is not recognized as LaTeX equation
   anymore because there is no space between `foo` and the opening `$`.

 * Table extension (`MD_FLAG_TABLES`) now recognizes only tables with no more
   than 128 columns. This limit has been imposed to prevent a pathological
   case of quadratic output size explosion which could be used as DoS attack
   vector.

 * We are now more strict with `MD_FLAG_PERMISSIVExxxAUTOLINKS` family of
   extensions with respect to non-alphanumeric characters, with the aim to
   mitigate false positive detections.

   Only relatively few selected non-alphanumeric are now allowed in permissive
   e-mail auto-links (`MD_FLAG_PERMISSIVEEMAILAUTOLINKS`):
     - `.`, `-`, `_`, `+` in user name part of e-mail address; and
     - `.`, `-`, `_` in host part of the e-mail address.

   Similarly for URL and e-mail auto-links (`MD_FLAG_PERMISSIVEURLAUTOLINKS` and
   `MD_FLAG_PERMISSIVEWWWAUTOLINKS`):
     - `.`, `-`, `_` in host part of the URL;
     - `/`, `.`, `-`, `_` in path part of the URL;
     - `&`, `.`, `-`, `+`, `_`, `=`, `(`, `)` in the query part of the URL
       (additionally, if present, `(` and `)` must form balanced pairs); and
     - `.`, `-`, `+`, `_` in the fragment part of the URL.

   Furthermore these characters (with some exceptions like where they serve as
   delimiter characters, e.g. `/` for paths) are generally accepted only when
   an alphanumeric character both precedes and follows them (i.e. these cannot
   be "stacked" together).

Fixes:

 * Fix several bugs where we haven't properly respected already resolved spans
   of higher precedence level in handling of permissive auto-links extensions
   (family of `MD_FLAG_PERMISSIVExxxAUTOLINKS` flags), LaTeX math extension
   (`MD_FLAG_LATEXMATHSPANS`) and wiki-links extension (`MD_FLAG_WIKILINKS`)
   of the form `[[label|text]]` (with pipe `|`). In some complex cases this
   could lead to invalid internal parser state and memory corruption.

   Identified with [OSS-Fuzz](https://github.com/google/oss-fuzz).

 * [#222](mity/md4c#222):
   Fix strike-through extension (`MD_FLAG_STRIKETHROUGH`) which did not respect
   same rules for pairing opener and closer marks as other emphasis spans.

 * [#223](mity/md4c#223):
   Fix incorrect handling of new-line character just at the beginning and/or
   end of a code span where we were not following CommonMark specification
   requirements correctly.


## Version 0.5.0

Changes:

 * Changes mandated by CommonMark specification 0.30.

   Actually there are only very minor changes to recognition of HTML blocks:

   - The tag `<textarea>` now triggers HTML block (of type 1 as per the
     specification).

   - HTML declaration (HTML block type 4) is not required to begin with an
     upper-case ASCII character after the `<!`. Any ASCII character is now
     allowed. Also it now doesn't require a whitespace before the closing `>`.

   Other than that, the newest specification mainly improves test coverage and
   clarifies its wording in some cases, without affecting the implementation.

   Refer to [CommonMark
   0.30 notes](https://github.com/commonmark/commonmark-spec/releases/tag/0.30)
   for more info.

 * Make Unicode-specific code compliant to Unicode 15.1.

 * Update list of entities known to the HTML renderer from
   https://html.spec.whatwg.org/entities.json.

New Features:

 * Add extension allowing to treat all soft break as hard ones. It has to be
   explicitly enabled with `MD_FLAG_HARD_SOFT_BREAKS`.

   Contributed by [l-m](https://github.com/l1mey112).

 * Structure `MD_SPAN_A_DETAIL` now has a new member `is_autolink`.

   Contributed by [Jens Alfke](https://github.com/snej).

 * `md2html` utility now supports command line options `--html-title` and
   `--html-css`.

   Contributed by [Andreas Baumann](https://github.com/andreasbaumann).

Fixes:

 * [#163](mity/md4c#163):
   Make HTML renderer to emit `'\n'` after the root tag when in the XHTML mode.

 * [#165](mity/md4c#165):
   Make HTML renderer not to percent-encode `'~'` in URLs. Although it does
   work, it's not needed, and it can actually be confusing with URLs such as
   `http://www.example.com/~johndoe/`.

 * [#167](mity/md4c#167),
   [#168](mity/md4c#168):
   Fix multiple instances of various buffer overflow bugs, found mostly using
   a fuzz testing. Contributed by [dtldarek](https://github.com/dtldarek) and
   [Thierry Coppey](https://github.com/TCKnet).

 * [#169](mity/md4c#169):
   Table underline now does not require 3 characters per table column anymore.
   One dash (optionally with a leading or tailing `:` appended or prepended)
   is now sufficient. This improves compatibility with the GFM.

 * [#172](mity/md4c#172):
   Fix quadratic time behavior caused by unnecessary lookup for link reference
   definition even if the potential label contains nested brackets.

 * [#173](mity/md4c#173),
   [#174](mity/md4c#174),
   [#212](mity/md4c#212),
   [#213](mity/md4c#213):
   Multiple bugs identified with [OSS-Fuzz](https://github.com/google/oss-fuzz)
   were fixed.

 * [#190](mity/md4c#190),
   [#200](mity/md4c#200),
   [#201](mity/md4c#201):
   Multiple fixes of incorrect interactions of indented code block with a
   preceding block.

 * [#202](mity/md4c#202):
   We were not correctly calling `enter_block()` and `leave_block()` callbacks
   if multiple HTML blocks followed one after another; instead previously
   such blocks were merged into one.

   (This may likely impact only applications interested in Markdown's AST,
   and not just converting Markdown to other formats like HTML.)

 * [#210](mity/md4c#210):
   The `md2html` utility now handles nested images with optional titles
   correctly.

 * [#214](mity/md4c#214):
   Tags `<h2>` ... `<h6>` incorrectly did not trigger HTML block.

 * [#215](mity/md4c#215):
   The parser incorrectly did not accept optional tabs after setext header
   underline.

 * [#217](mity/md4c#217):
   The parser incorrectly resolved emphasis in some situations, if the emphasis
   marks were enclosed by punctuation characters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants