Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Full Commonmark compliance for Lists #2112

Merged
merged 23 commits into from
Aug 10, 2021

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented Jun 18, 2021

Marked version: 2.1.1

Markdown flavor: CommonMark

Description

A reworking of the Block Lists tokenizer. Also adjusting some of the New unit tests because they are not accurate to the actual Commonmark spec and dingus results.

  • Fixes Commonmark Examples 232, 234, 243, 244, 248, 250, 254, 276, 277, 287, 288, 289. This now passes all Lists and all List Items Commonmark tests!

  • Fixes some gitihub issues probably. Need to dig through and see.

This seems to be about the same speed as Master or slightly slower, but it's always hard to tell.

Side note, should we move the New tests that feature the pedantic option into the Original folder to be with all the other pedantic tests?

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Jun 18, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/4HWneXd69SCqp9wDXN2PgRHkLKHJ
✅ Preview: https://markedjs-git-fork-calculuschild-cleanuplists-markedjs.vercel.app

@UziTech
Copy link
Member

UziTech commented Jun 20, 2021

I believe original are actual specs for the original markdown.pl and new are basically any tests that are not actually part of any spec but should (for the most part) continue passing.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 21, 2021

Ok that makes sense. I could see that the New specs for Pedantic do seem to match the Daring Fireball dingus so that's fine.

I could use some help figuring out the rule for New: list_align_pedantic though. There doesn't seem to be any logical pattern here:

- one
 - two
  - three
    - four
     - five
      - six
       - seven

becomes

<ul>
	<li>one
		<ul>
			<li>two</li>
			<li>three</li>
			<li>four
				<ul>
					<li>five</li>
					<li>six</li>
					<li>seven</li>
				</ul>
			</li>
		</ul>
	</li>
</ul>

Same with New : list_item_text :

  * item1

    * item2

  text

becomes

<ul><li><p>item1</p>  <ul><li>item2 </li></ul> <p>text</p> </li></ul>

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 21, 2021

For New : main, the output HTML does not match the commonmark dingus. How do we want to correct this test? Change the input Markdown so it matches the test HTML, or change the expected HTML so it matches the test Markdown?

Or... after making either of those changes, this test might be redundant since I think it overlaps with existing tests everywhere...

@UziTech
Copy link
Member

UziTech commented Jun 21, 2021

New: list_align_pedantic fixes #1923

New: list_item_text fixes #1947

New: main should probably be marked as pedantic?

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 21, 2021

Ok, thanks for tracking that down. I have the logic from those issues working, but I still am having trouble with the alignment/indentation rules for those two test cases. Can you help me understand when a pedantic list considers something a sublist versus not? The daring fireball spec does not say anything about when to nest sublists that I can see.

list_align_pedantic

Is it something like... Every bullet with an indent between 1 and 4 spaces is nested at the same "level", and then 5-8 spaces is the next level?

Because the dingus gives REALLY strange results with stuff like this:

- one
     - teo
 - three

Becomes

<ul>
<li>one
<ul><li>teo
<ul><li>three</li></ul></li></ul></li>
</ul>

list_item_text

I'm not sure why "text" is part of the outer list because it is not indented enough relative to the top bullet point. The spec says

List items may consist of multiple paragraphs. Each subsequent paragraph in a list item must be indented by... 4 spaces

But maybe it's actually anywhere between 1-4 spaces and the indentation first bullet doesn't matter?

@UziTech
Copy link
Member

UziTech commented Jun 21, 2021

the dingus gives REALLY strange results with stuff

The original spec was very lax on details because it assumed well formatted markdown, so bad markdown should give unspecified results (garbage in, garbage out). Unfortunately people who write bad markdown still want consistency.

The easiest way to get around it is to have if (options.pedantic) blocks that correct for the daringfireball dingus results.

@UziTech
Copy link
Member

UziTech commented Jun 21, 2021

Every bullet with an indent between 1 and 4 spaces is nested at the same "level", and then 5-8 spaces is the next level

That seems to be the case.

But maybe it's actually anywhere between 1-4 spaces and the indentation first bullet doesn't matter?

yes

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 21, 2021

Right right. So I think you saying we should match the dingus as close as possible even if it's garbage, right?

My question then was an ask for help in reverse engineering the dingus logic so I can get the same garbage in garbage out for consistency, as you say. Then I can work out where and what to put into the options.pedsntic sections.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 21, 2021

New: main should probably be marked as pedantic?

@UziTech Unfortunately it doesn't follow Pedantic rules either. The problem area is this:

* List Item 2
  * New List Item 1
    Hi, this is a list item.
  * New List Item 2
    Another item
        Code goes here.
        Lots of it...      
  * New List Item 3
    The last item

The test wants it to become

<ul>
<li>New List Item 1 Hi, this is a list item.</li>
<li>New List Item 2 Another item 
<pre><code>Code goes here.
Lots of it...</code></pre>
</li>
<li>New List Item 3 The last item</li>
</ul>

However, an indented code block cannot interrupt a paragraph without a blank line before, both in CommonMark and in Pedantic.

So both Commonmark and Pedantic dingus give this:

<ul>
<li>New List Item 1 Hi, this is a list item.</li>
<li>New List Item 2 Another item Code goes here. Lots of it...</li>
<li>New List Item 3 The last item</li>
</ul>

My vote is to just remove this test altogether since it is an unwieldy, large, test that don't seem to cover anything that isn't already covered by smaller tests.

@UziTech
Copy link
Member

UziTech commented Jun 21, 2021

Sounds good it looks like it was created well before my time so most likely it is out of date.

I would be better to keep the tests specific anyway.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 21, 2021

Tadaa! Passing all spec tests now, plus 3 more Commonmark examples.

Unfortunately there are a bunch of Unit tests that are expecting tokens to look different now so that needs to be cleaned up. Essentially lists no longer consume blank newlines at the end, so you end up with space tokens instead (makes checking whether a list is loose or not more accurate as well). But the resulting HTML is the same.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 21, 2021

And now unit tests are all passing. I'm going to see if I can knock out the last commonmark specs, but I think this is in a good spot to review.

I also have a couple ideas to speed this up slightly but I want to see if I can get the last examples working first.

@calculuschild calculuschild mentioned this pull request Jul 3, 2021
5 tasks
src/Lexer.js Outdated
lastToken.raw += '\n' + token.raw;
lastToken.text += '\n' + token.raw;
} else {
if (!this.tokens.links[token.tag]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick: this could be changed to else if instead of nested else and if

Copy link
Member

@styfle styfle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks!

@calculuschild
Copy link
Contributor Author

Reminder to not merge this quite yet. Now that #2124 is approved I want to look over this again because it needs to be converted over to Lex its own child tokens in that new format.

@UziTech
Copy link
Member

UziTech commented Aug 2, 2021

I merged #2124 so you should be able to rebase this and make the changes for lexing the child tokens in the tokenizer.

@calculuschild calculuschild changed the base branch from master to dependabot/npm_and_yarn/babel/preset-env-7.14.9 August 6, 2021 04:27
@calculuschild calculuschild changed the base branch from dependabot/npm_and_yarn/babel/preset-env-7.14.9 to master August 6, 2021 04:27
@calculuschild
Copy link
Contributor Author

@UziTech .....sigh.... It is done.... :finnadie:

I don't know why but that was a nightmare.

@UziTech UziTech changed the title Full Commonmark compliance for Lists fix: Full Commonmark compliance for Lists Aug 10, 2021
@UziTech UziTech merged commit eb33d3b into markedjs:master Aug 10, 2021
github-actions bot pushed a commit that referenced this pull request Aug 16, 2021
# [3.0.0](v2.1.3...v3.0.0) (2021-08-16)

### Bug Fixes

* Add module field to package.json ([#2143](#2143)) ([edc2e6d](edc2e6d))
* drop node 10 support ([#2157](#2157)) ([433b16f](433b16f))
* Full Commonmark compliance for Lists ([#2112](#2112)) ([eb33d3b](eb33d3b))
* Refactor table tokens ([#2166](#2166)) ([bc400ac](bc400ac))

### BREAKING CHANGES

* - `table` tokens `header` property changed to contain an array of objects for each header cell with `text` and `tokens` properties.
- `table` tokens `cells` property changed to `rows` and is an array of rows where each row contains an array of objects for each cell with `text` and `tokens` properties.

v2:

```json
{
  "type": "table",
  "align": [null, null],
  "raw": "| a | b |\n|---|---|\n| 1 | 2 |\n",
  "header": ["a", "b"],
  "cells": [["1", "2"]],
  "tokens": {
    "header": [
      [{ "type": "text", "raw": "a", "text": "a" }],
      [{ "type": "text", "raw": "b", "text": "b" }]
    ],
    "cells": [[
      [{ "type": "text", "raw": "1", "text": "1" }],
      [{ "type": "text", "raw": "2", "text": "2" }]
    ]]
  }
}
```

v3:

```json
{
  "type": "table",
  "align": [null, null],
  "raw": "| a | b |\n|---|---|\n| 1 | 2 |\n",
  "header": [
    {
      "text": "a",
      "tokens": [{ "type": "text", "raw": "a", "text": "a" }]
    },
    {
      "text": "b",
      "tokens": [{ "type": "text", "raw": "b", "text": "b" }]
    }
  ],
  "rows": [
    {
      "text": "1",
      "tokens": [{ "type": "text", "raw": "1", "text": "1" }]
    },
    {
      "text": "2",
      "tokens": [{ "type": "text", "raw": "2", "text": "2" }]
    }
  ]
}
```
* Add module field to package.json
* drop node 10 support
@github-actions
Copy link

🎉 This PR is included in version 3.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants