Wiki links #92

niblo · 2019-09-09T17:20:28Z

This does not feature link labels, only the link target. Comments appreciated.

codecov · 2019-09-09T17:23:27Z

Codecov Report

Merging #92 into master will increase coverage by 0.24%.
The diff coverage is 98.86%.

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
+ Coverage   94.01%   94.26%   +0.24%     
==========================================
  Files           1        1              
  Lines        2642     2722      +80     
==========================================
+ Hits         2484     2566      +82     
+ Misses        158      156       -2

Impacted Files	Coverage Δ
md4c/md4c.c	`94.26% <98.86%> (+0.24%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef85cfc...02b773c. Read the comment docs.

mity

So far I have only few comments from reading the code.
I haven't tried it yet in run-time at all.

md2html/md2html.c

md4c/md4c.c

md4c/md4c.h

md4c/md4c.c

mity · 2019-09-09T20:23:34Z

Seems good to me now so far. Do you plan to continue to work in the foreseeable future in this PR?

(It would be simpler for me, as such a feature deserves proper testing, including some fuzzing over-night so I would wait with that until you consider it complete to do it just once.)

niblo · 2019-09-09T20:29:52Z

Seems good to me now so far. Do you plan to continue to work in the foreseeable future in this PR?

I will continue with adding support for the label part of the wiki link during the week. The goal is to complete the wiki link feature, with proper support for labels.

I plan to get back to this in about 20 hours or so. In the meantime, thank you for the support.

test/wiki-links.txt

mity · 2019-09-09T20:37:21Z

Ok. I would then fuzz-test it together when it is done.

It takes AFL on my machine many hours to cover all the code paths and it is almost unusable for anything else in the mean time. Hopefully will qualify to https://github.com/google/oss-fuzz one day but until then...

mity · 2019-09-10T16:17:26Z

md4c/md4c.c

+
+        /* The wiki link is ignored if the `|` delimiter is used, but without a
+         * label. TODO Unsure if this is the desired behavior. How about
+         * removing the `|` from the output instead, and keeping the link? */


In general, I would follow what other important wiki implementations (wikipedia?) usually do.

mity · 2019-09-10T16:19:38Z

md4c/md4c.c

+            break;
+        }
+
+        delim_index++;


This scanning for the | likely opens the gates to quadratic behavior for [[[[[[[[foo|bar]]]]]]]].....

EDIT: Or maybe rather [[[[[[[[|||||||||||||]]]]]]]].....

mity · 2019-09-10T16:26:16Z

md4c/md4c.c

@@ -3497,6 +3535,74 @@ md_resolve_links(MD_CTX* ctx, const MD_LINE* lines, int n_lines)
    return 0;
 }

+static int


I have some trouble understanding why this big function is needed. Even if it turns out we need it, it has some issues I commented in its body.

Generally, I assumed we can still scan only backwards from the ]] as before. The only novelty would be that when we reach a |, we use it as a beginning of the ID from that point (the closing MD_MARK would be expanded to hide it from normal text flow), and that's it.

Also note we then do not have to collect | marks beforehand for wiki-links: We reach that char by the scanning over the ID anyway, so it would simply the logic quite a lot.

Generally, I assumed we can still scan only backwards from the ]] as before. The only novelty would be that when we reach a |, we use it as a beginning of the ID from that point (the closing MD_MARK would be expanded to hide it from normal text flow), and that's it.

The closing mark? Don't you mean the opening mark?

Also note we then do not have to collect | marks beforehand for wiki-links: We reach that char by the scanning over the ID anyway, so it would simply the logic quite a lot.

That would be neat.

The closing mark? Don't you mean the opening mark?

[[text|id]]... I.e. the closer mark would be expanded back to cover not only ]] but whole |id]].

Aha, then it would be so, if not for the fact that the wiki link structure is the other way around: [[id|text]].

Oops. Mea culpa. So I got it whole wrong and we also need scanning forward from the opener, not backward. Whatever. Does not change the principle.

possible quadratic behavior (although I cant find this)

I spent some thinking about it.

Maybe it is not there, at least if the iterating over MD_MARKs is done very carefully: If the ID length has a limit (say 100 characters), then it also limits the number of potential MD_MARK we have to iterate in order to find the pipe: If we reach any (non-dummy) MD_MARK which has MD_MARK::beg > opener.end + 100, we know there is no such pipe mark.

We must be careful to now allow iteration over arbitrary (unlimited) stuff. But if we can guarantee it has a (reasonable) limit, it should be ok.

(But sorry, I don't have time today to dig in the code today to verify whether it works that way already. Busy with other things.)

I didn't mean to make a call to md_analyze_line. What I meant was that total_indent could be saved in the MD_LINE_ANALYSIS struct, and then handed/copied over to the MD_LINE struct in md_add_line_into_current_block. That way, in md_resolve_links, because we have access to the line(s), we would have access to the total_indent value.

(See c48a709 for an implementation of this idea.)

The quadratic runtime behavior is definitely there (I did not test with a large enough link content before to notice it):

$ python -c "print '[[' * 4000 + 'x' * 400 + ']]' * 4000" | md2html/md2html --fwiki-links -s > /dev/null Time spent on parsing: 90.94 ms. $ python -c "print '[[' * 8000 + 'x' * 400 + ']]' * 8000" | md2html/md2html --fwiki-links -s > /dev/null Time spent on parsing: 356.97 ms. $ python -c "print '[[' * 16000 + 'x' * 400 + ']]' * 16000" | md2html/md2html --fwiki-links -s > /dev/null Time spent on parsing: 1.497 s.

I set a limit in commit 2bcd0bc.

This is the most critical structure for memory overhead of MD4C. We construct a vector of blocks and all their MD_LINEs per whole document before the inline processing. So 3 integers instead of 2 in this struct is a big change. I certainly won't do anything like that with a light heart.

If you do not want the new member there, it would be really simple to put the length restriction in the big if-statement leading into the wiki link handling:

/* Detect and resolve wiki links. */ if ((ctx->parser.flags & MD_FLAG_WIKILINKS) && next_opener != NULL && next_closer != NULL && (opener->end - opener->beg == 1) && (next_opener->beg == opener->beg - 1) && (next_closer->beg == closer->beg + 1) && (next_opener->end - next_opener->beg == 1) && (next_closer->end - next_closer->beg == 1) && (next_opener->ch == '[' && next_closer->ch == ']')) {

But it could of course not account for potential leading block quote marks.

Or, if we want a consistent limit on the wiki link content that users can count on, counting the block quote markers "from scratch" is another alternative.

mity · 2019-09-10T16:33:35Z

md4c/md4c.c

+    int newlines = 0;
+    int has_label = (opener->end - opener->beg > 2);
+
+    /* Check for newlines. */


I do not see a point in this at all. The block parsing (the1 st parsing big step) already guarantees we do not try to match [[ with ]] across blocks (i.e. there is no blank line for example), and I do not see a principal reason why

[[a long text over multiple lines|ID]]

should be rejected.

So imho we can still happily only scan over the ID part.

Also your approach would not be reliable:

"\r\n" is considered one new line digraph in Markdown.

Once you do not simply fail when reaching a new line, you can happily scan over contents which does not belong to the block contents like block decorations or indentation, breaking the logic with the text length limit.

Consider e.g. a multi-line wiki-link used in a nested blockquote:

> > > > [[foo > > > > bar|id]

Also: - Permit multiple lines in a label - Prioritize links over tables cell boundaries - Add test cases, test with --ftables The limiting of content to 100 characters is still lacking in that it does not exclude block quote markers that appear after the linebreak of a label that spans multiple lines.

This does not affect anything in practice, because the limit of at most 100 characters is consistent between the first and second while-loop.

mity · 2019-09-19T10:21:48Z

@niblo Sorry for the long delay. Solving some real-life stuff.
I hope to get to it again during the week-end.

niblo · 2019-09-19T10:34:16Z

Sounds good.

…

On 19 Sep 2019, at 12:21, Martin Mitáš wrote: @niblo Sorry for the long delay. Solving some real-life stuff. I hope to get to it again during the week-end. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #92 (comment)

mity · 2019-10-27T14:01:10Z

Fuzz testing has found many crash(es) in this branch, but likely they are all exhibitions of the same bug (multiple randomly chosen crash with the same backtrace).

Looks to me that there its some interference of code for tables and for wiki-links as it crashes only with --ftables. It even does not need --fwiki-links but the crash does not happen on master branch.

I'll try to minimize and textize the crashing input (it has tens of kB of ugly binary stuff) before posting it here.

mity · 2019-10-27T15:03:50Z

Minimized one of the crash inputs into this:

[f]: &#ffff;

x|x
---|---
[foo|x]

x|x
---|---
[foo|x]

mity · 2019-10-27T15:30:12Z

md4c/md4c.c

-            /* A potential table cell boundary. */
-            if(table_mode  &&  ch == _T('|')) {
+            /* A potential table cell boundary or wiki link label delimiter. */
+            if((table_mode || ctx->parser.flags & MD_FLAG_WIKILINKS) && ch == _T('|')) {


Noticed this issue in the code: Bit operator & has higher priority then || so the part of the condition in the brackets is equivalent to (table_mode | flags) & WIKILINKS. It should be table_mode | (flags & WIKILINKS).

Sorry. Not true, it's == which usually makes troubles. But I would prefer if you add the parenthesis there anyway, please.

niblo · 2019-10-30T07:05:29Z

Thanks for the feedback. I will try to have a look at it this weekend.

niblo · 2019-11-03T10:56:26Z

md4c/md4c.c

-    /* (3) Links. */
-    md_analyze_marks(ctx, lines, n_lines, 0, ctx->n_marks, _T("[]!"));
-    MD_CHECK(md_resolve_links(ctx, lines, n_lines));
-    BRACKET_OPENERS.head = -1;
-    BRACKET_OPENERS.tail = -1;
-    ctx->unresolved_link_head = -1;
-    ctx->unresolved_link_tail = -1;


It seems like the issue goes away if I revert this, so that links are analyzed before tables. Will dig a bit deeper into this...

My mistake. It's still there, but the crash/bad output does not happen all the time, and I did not run md2html with the bad input enough times.

No, wait. It actually fixes the issue. I was jumping back and forth between branches. What I meant was: If the links analysis is moved to after the table analysis, it fixes the issue. So I will attempt to dig a bit deeper into this. Sorry for the confusion.

Update: Analyzing links before tables produces the same crash on the master branch too.

@mity: Can you tell if I did something stupid when moving the link analysis to occur before the table analysis? That would be in md_analyze_inlines.

As a sidenote, I seem to have found a related issue. This is on the master branch with no changes on top.

[a]: xyz [abc]

<a href="xyz">abc</a>

[abc]: xyz [a]

<a href="xyz">a</a>

[abc]: ijk [a]: xyz [a]

<a href="xyz">a</a>

[abb]: ijk [a]: xyz [a]

<a href="ijk">a</a>

[a]: ijk [abc]: xyz [abc]

<a href="xyz">abc</a>

[a]: ijk [abb]: xyz [abb]

<a href="ijk">abb</a>

I see, maybe the culprit is elsewhere in md_is_table_row(). But it indeed should return FALSE. The pipe is inside the link body, links have higher priority and therefore it is not usable as a table cell boundary.

md_is_link_reference is returning false when it should be true in md_resolve_links. It is the call that is commented with /* Might be collapsed reference link */.

Ouch. Ouch. Ouch. That makes it all clear to me. And it points to an aspect which may make all the idea of links having higher priority then tables very problematic. We have a very ugly catch-22 here.

The problem is that md_is_table_row() is called when we are still determining whether the line belongs to a table block. To detect where the table ends and another block starts. I.e. during the very 1st pass for block analysis. Yes, md_is_table_row() calls some inline analysis of the given line to resolve the higher precedence of some stuff like code spans, escaped pipes over the pipes used so the code spans in the table are possible and it also elegantly solves the case of escaped pipes.

The issue with links is that they can only be really resolved when complete block analysis is over and we know all link reference definitions in the document. (Because of this, the hash-table is currently only populated after we know all of them and that's why we hit the issue even if the ref. def. is before.)

Interesting, I have just tried to do some experiments with github's Markdown. They clearly do not stop the table when there is a line without pipe.

A | B --- | --- no-pipe-here-but-still-a-table-row

renders into

A B

no-pipe-here-but-still-a-table-row

Hence, they likely do not need any line analysis at that early time. In that case I think we can change table parsing in a similar way. Improving the compatibility with them is a good thing on its own.

If I am right (and I believe I am) it should then fix the crash as a side effect. Actually it might even simplify quite a lot some things around the table parsing, maybe even remove the md_is_table_row() completely.

I'll try to cook something (in another branch because it is quite independent from this PR.)

That sounds like a clever workaround. Looking forward to seeing if it works.

md_is_table_row did some crazy inline parsing to detect whether the line contains at least one pipe which is not inside a code span or other high-priority inline element. This was very complicated under the hood and to was actually breaking the clean design which separates block analysis parse and inline analysis of each block contents. We now just use the table underline for determining the block is table and its properties. This means a paragraph now cannot interrupt a table. This is a change in a behavior but likely acceptable one as it actually brings the behavior closer to behavior of tables in cmark-gfm in this regard. Last but not least, it seems to prevent adoption of other useful features, for about that, see the discussion in PR #92.

md_is_table_row() did some crazy inline parsing to detect whether the line contains at least one pipe which is not inside a code span or other high-priority inline element. This was very complicated under the hood and to was actually breaking the clean design which separates block analysis parse and inline analysis of each block contents. We now just use the table underline for determining the block is table and its properties. This means a paragraph now cannot interrupt a table. This is a change in a behavior but likely acceptable one as it actually brings the behavior closer to behavior of tables in cmark-gfm in this regard. Last but not least, it seems to prevent adoption of other useful features, for about that, see the discussion in PR #92.

mity · 2019-11-03T20:01:23Z

I have just experimentally merged this PR and PR #97 and indeed, it seems to fix the crash. So, I'll fuzz-test that over the night and if there is no other surprise I'll then merge both tomorrow.

We do so by removing the function md_is_table_row(). md_is_table_row() did some crazy inline parsing to detect whether the line contains at least one pipe which is not inside a code span or other high-priority inline element. This was very complicated under the hood and to was actually breaking the clean design which separates block analysis parse and inline analysis of each block contents. We now just use the table underline for determining the block is table and its properties like e.g. the column count. This means a paragraph now cannot interrupt a table. This is a change in a behavior but likely acceptable one as it actually brings the behavior closer to behavior of tables in cmark-gfm in this regard. Last but not least, it seems to prevent adoption of other useful features, for about that, see the discussion in PR #92.

mity · 2019-11-04T14:22:21Z

Finally merged.

@niblo, thanks for all the work and also for your patience.

niblo · 2019-11-04T14:28:09Z

It was my pleasure. I would like to thank you too for the very same reasons. Your support made it possible.

niblo added 2 commits September 9, 2019 18:11

Add draft for wiki links.

01a8533

Remove accidental tabs.

420fe53

mity requested changes Sep 9, 2019

View reviewed changes

md2html/md2html.c Outdated Show resolved Hide resolved

md4c/md4c.c Outdated Show resolved Hide resolved

md4c/md4c.h Outdated Show resolved Hide resolved

mity reviewed Sep 9, 2019

View reviewed changes

md4c/md4c.c Outdated Show resolved Hide resolved

niblo added 3 commits September 9, 2019 20:29

Remove unnecessary text attribute in MD_SPAN_WIKILINK.

b50f2fb

Use 'K' as short for flag MD_FLAG_WIKILINKS.

9cb33b8

If no wiki link, allow for normal link.

fd47645

niblo commented Sep 9, 2019

View reviewed changes

md4c/md4c.c Outdated Show resolved Hide resolved

niblo added 2 commits September 9, 2019 22:01

Re-implement md_is_inline_wikilink_spec.

570c55a

Add more test cases.

b7493cd

niblo commented Sep 9, 2019

View reviewed changes

test/wiki-links.txt Show resolved Hide resolved

Fix off by one errors in wiki link id/target length test.

fb5a96b

niblo added 5 commits September 10, 2019 10:29

Fix bad test.

96d2192

Add test for image priority over wiki links.

1ba3d55

Remove commented code from wiki link resolution.

9cdc8be

Add draft for extending wiki links to handle labels.

31f8ca8

Fix miswording.

8342e43

mity reviewed Sep 10, 2019

View reviewed changes

niblo added 8 commits September 11, 2019 10:55

Count wiki link content length properly.

c48a709

Simplify wiki link content length limit check.

64bc561

Remove redundant if-statement.

7586d88

Accept wiki links with trailing delimiter but no label.

676f8f7

Prevent runaway O(n^2) performance.

2bcd0bc

Disable wiki link if image in target.

11e84e7

Fix limit for preventing O(n^2) runaway runtime.

8101541

Prevent accidentally resolving character other than '|'.

4e89c3c

This does not affect anything in practice, because the limit of at most 100 characters is consistent between the first and second while-loop.

mity reviewed Oct 27, 2019

View reviewed changes

niblo commented Nov 3, 2019

View reviewed changes

mity mentioned this pull request Nov 3, 2019

Link labels comparison issues #96

Closed

mity mentioned this pull request Nov 3, 2019

md_is_table_row: Remove the function. #97

Merged

mity added 2 commits November 4, 2019 15:07

Merge branch 'master' into wiki-links

2f2c7e5

CHANGELOG.md: Update.

02b773c

mity merged commit e336e64 into mity:master Nov 4, 2019

niblo mentioned this pull request Nov 4, 2019

Feature suggestion: Wiki style links #90

Closed

bachmeil mentioned this pull request Jan 29, 2023

[Feature request] Update the port to include wikilinks AuburnSounds/commonmark-d#2

Closed

Wiki links #92

Wiki links #92

Conversation

niblo commented Sep 9, 2019

codecov bot commented Sep 9, 2019 • edited Loading

Codecov Report

mity left a comment

Choose a reason for hiding this comment

mity commented Sep 9, 2019

niblo commented Sep 9, 2019 • edited Loading

mity commented Sep 9, 2019

Choose a reason for hiding this comment

mity Sep 10, 2019 • edited Loading

Choose a reason for hiding this comment

mity Sep 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niblo Sep 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mity commented Sep 19, 2019

niblo commented Sep 19, 2019 via email

mity commented Oct 27, 2019

mity commented Oct 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niblo commented Oct 30, 2019

niblo Nov 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mity Nov 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niblo Nov 3, 2019 • edited Loading

Choose a reason for hiding this comment

mity commented Nov 3, 2019 • edited Loading

mity commented Nov 4, 2019

niblo commented Nov 4, 2019

codecov bot commented Sep 9, 2019 •

edited

Loading

niblo commented Sep 9, 2019 •

edited

Loading

mity Sep 10, 2019 •

edited

Loading

mity Sep 10, 2019 •

edited

Loading

niblo Sep 12, 2019 •

edited

Loading

niblo Nov 3, 2019 •

edited

Loading

mity Nov 3, 2019 •

edited

Loading

niblo Nov 3, 2019 •

edited

Loading

mity commented Nov 3, 2019 •

edited

Loading