Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wiki links #92

Merged
merged 24 commits into from
Nov 4, 2019
Merged

Wiki links #92

merged 24 commits into from
Nov 4, 2019

Conversation

niblo
Copy link
Contributor

@niblo niblo commented Sep 9, 2019

This does not feature link labels, only the link target. Comments appreciated.

@codecov
Copy link

codecov bot commented Sep 9, 2019

Codecov Report

Merging #92 into master will increase coverage by 0.24%.
The diff coverage is 98.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
+ Coverage   94.01%   94.26%   +0.24%     
==========================================
  Files           1        1              
  Lines        2642     2722      +80     
==========================================
+ Hits         2484     2566      +82     
+ Misses        158      156       -2
Impacted Files Coverage Δ
md4c/md4c.c 94.26% <98.86%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef85cfc...02b773c. Read the comment docs.

Copy link
Owner

@mity mity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I have only few comments from reading the code.
I haven't tried it yet in run-time at all.

md2html/md2html.c Outdated Show resolved Hide resolved
md4c/md4c.c Outdated Show resolved Hide resolved
md4c/md4c.h Outdated Show resolved Hide resolved
md4c/md4c.c Outdated Show resolved Hide resolved
md4c/md4c.c Outdated Show resolved Hide resolved
@mity
Copy link
Owner

mity commented Sep 9, 2019

Seems good to me now so far. Do you plan to continue to work in the foreseeable future in this PR?

(It would be simpler for me, as such a feature deserves proper testing, including some fuzzing over-night so I would wait with that until you consider it complete to do it just once.)

@niblo
Copy link
Contributor Author

niblo commented Sep 9, 2019

Seems good to me now so far. Do you plan to continue to work in the foreseeable future in this PR?

I will continue with adding support for the label part of the wiki link during the week. The goal is to complete the wiki link feature, with proper support for labels.

I plan to get back to this in about 20 hours or so. In the meantime, thank you for the support.

@mity
Copy link
Owner

mity commented Sep 9, 2019

Ok. I would then fuzz-test it together when it is done.

It takes AFL on my machine many hours to cover all the code paths and it is almost unusable for anything else in the mean time. Hopefully will qualify to https://github.com/google/oss-fuzz one day but until then...

md4c/md4c.c Outdated

/* The wiki link is ignored if the `|` delimiter is used, but without a
* label. TODO Unsure if this is the desired behavior. How about
* removing the `|` from the output instead, and keeping the link? */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would follow what other important wiki implementations (wikipedia?) usually do.

md4c/md4c.c Outdated
break;
}

delim_index++;
Copy link
Owner

@mity mity Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scanning for the | likely opens the gates to quadratic behavior for [[[[[[[[foo|bar]]]]]]]].....

EDIT: Or maybe rather [[[[[[[[|||||||||||||]]]]]]]].....

md4c/md4c.c Outdated
@@ -3497,6 +3535,74 @@ md_resolve_links(MD_CTX* ctx, const MD_LINE* lines, int n_lines)
return 0;
}

static int
Copy link
Owner

@mity mity Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some trouble understanding why this big function is needed. Even if it turns out we need it, it has some issues I commented in its body.

Generally, I assumed we can still scan only backwards from the ]] as before. The only novelty would be that when we reach a |, we use it as a beginning of the ID from that point (the closing MD_MARK would be expanded to hide it from normal text flow), and that's it.

Also note we then do not have to collect | marks beforehand for wiki-links: We reach that char by the scanning over the ID anyway, so it would simply the logic quite a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I assumed we can still scan only backwards from the ]] as before. The only novelty would be that when we reach a |, we use it as a beginning of the ID from that point (the closing MD_MARK would be expanded to hide it from normal text flow), and that's it.

The closing mark? Don't you mean the opening mark?

Also note we then do not have to collect | marks beforehand for wiki-links: We reach that char by the scanning over the ID anyway, so it would simply the logic quite a lot.

That would be neat.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closing mark? Don't you mean the opening mark?

[[text|id]]... I.e. the closer mark would be expanded back to cover not only ]] but whole |id]].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, then it would be so, if not for the fact that the wiki link structure is the other way around: [[id|text]].

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Mea culpa. So I got it whole wrong and we also need scanning forward from the opener, not backward. Whatever. Does not change the principle.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • possible quadratic behavior (although I cant find this)

I spent some thinking about it.

Maybe it is not there, at least if the iterating over MD_MARKs is done very carefully: If the ID length has a limit (say 100 characters), then it also limits the number of potential MD_MARK we have to iterate in order to find the pipe: If we reach any (non-dummy) MD_MARK which has MD_MARK::beg > opener.end + 100, we know there is no such pipe mark.

We must be careful to now allow iteration over arbitrary (unlimited) stuff. But if we can guarantee it has a (reasonable) limit, it should be ok.

(But sorry, I don't have time today to dig in the code today to verify whether it works that way already. Busy with other things.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to make a call to md_analyze_line. What I meant was that total_indent could be saved in the MD_LINE_ANALYSIS struct, and then handed/copied over to the MD_LINE struct in md_add_line_into_current_block. That way, in md_resolve_links, because we have access to the line(s), we would have access to the total_indent value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See c48a709 for an implementation of this idea.)

Copy link
Contributor Author

@niblo niblo Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quadratic runtime behavior is definitely there (I did not test with a large enough link content before to notice it):

$ python -c "print '[[' * 4000 + 'x' * 400 + ']]' * 4000" | md2html/md2html  --fwiki-links -s > /dev/null
Time spent on parsing:   90.94 ms.

$ python -c "print '[[' * 8000 + 'x' * 400 + ']]' * 8000" | md2html/md2html  --fwiki-links -s > /dev/null
Time spent on parsing:  356.97 ms.

$ python -c "print '[[' * 16000 + 'x' * 400 + ']]' * 16000" | md2html/md2html  --fwiki-links -s > /dev/null
Time spent on parsing:  1.497 s.

I set a limit in commit 2bcd0bc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is the most critical structure for memory overhead of MD4C. We construct a vector of blocks and all their MD_LINEs per whole document before the inline processing. So 3 integers instead of 2 in this struct is a big change. I certainly won't do anything like that with a light heart.

If you do not want the new member there, it would be really simple to put the length restriction in the big if-statement leading into the wiki link handling:

        /* Detect and resolve wiki links. */
        if ((ctx->parser.flags & MD_FLAG_WIKILINKS) &&
            next_opener != NULL && next_closer != NULL &&
            (opener->end - opener->beg == 1) &&
            (next_opener->beg == opener->beg - 1) &&
            (next_closer->beg == closer->beg + 1) &&
            (next_opener->end - next_opener->beg == 1) &&
            (next_closer->end - next_closer->beg == 1) &&
            (next_opener->ch == '[' && next_closer->ch == ']'))
        {

But it could of course not account for potential leading block quote marks.

Or, if we want a consistent limit on the wiki link content that users can count on, counting the block quote markers "from scratch" is another alternative.

md4c/md4c.c Outdated
int newlines = 0;
int has_label = (opener->end - opener->beg > 2);

/* Check for newlines. */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a point in this at all. The block parsing (the1 st parsing big step) already guarantees we do not try to match [[ with ]] across blocks (i.e. there is no blank line for example), and I do not see a principal reason why

[[a long text
over multiple
lines|ID]]

should be rejected.

So imho we can still happily only scan over the ID part.

Also your approach would not be reliable:

  1. "\r\n" is considered one new line digraph in Markdown.

  2. Once you do not simply fail when reaching a new line, you can happily scan over contents which does not belong to the block contents like block decorations or indentation, breaking the logic with the text length limit.

    Consider e.g. a multi-line wiki-link used in a nested blockquote:

    > > > > [[foo
    > > > > bar|id]
    

Also:
- Permit multiple lines in a label
- Prioritize links over tables cell boundaries
- Add test cases, test with --ftables

The limiting of content to 100 characters is still lacking in that it
does not exclude block quote markers that appear after the linebreak of
a label that spans multiple lines.
This does not affect anything in practice, because the limit of at most
100 characters is consistent between the first and second while-loop.
@mity
Copy link
Owner

mity commented Sep 19, 2019

@niblo Sorry for the long delay. Solving some real-life stuff.
I hope to get to it again during the week-end.

@niblo
Copy link
Contributor Author

niblo commented Sep 19, 2019 via email

@mity
Copy link
Owner

mity commented Oct 27, 2019

Fuzz testing has found many crash(es) in this branch, but likely they are all exhibitions of the same bug (multiple randomly chosen crash with the same backtrace).

Looks to me that there its some interference of code for tables and for wiki-links as it crashes only with --ftables. It even does not need --fwiki-links but the crash does not happen on master branch.

I'll try to minimize and textize the crashing input (it has tens of kB of ugly binary stuff) before posting it here.

@mity
Copy link
Owner

mity commented Oct 27, 2019

Minimized one of the crash inputs into this:

[f]: &#ffff;

x|x
---|---
[foo|x]

x|x
---|---
[foo|x]

/* A potential table cell boundary. */
if(table_mode && ch == _T('|')) {
/* A potential table cell boundary or wiki link label delimiter. */
if((table_mode || ctx->parser.flags & MD_FLAG_WIKILINKS) && ch == _T('|')) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed this issue in the code: Bit operator & has higher priority then || so the part of the condition in the brackets is equivalent to (table_mode | flags) & WIKILINKS. It should be table_mode | (flags & WIKILINKS).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Not true, it's == which usually makes troubles. But I would prefer if you add the parenthesis there anyway, please.

@niblo
Copy link
Contributor Author

niblo commented Oct 30, 2019

Thanks for the feedback. I will try to have a look at it this weekend.

Comment on lines -3875 to -3881
/* (3) Links. */
md_analyze_marks(ctx, lines, n_lines, 0, ctx->n_marks, _T("[]!"));
MD_CHECK(md_resolve_links(ctx, lines, n_lines));
BRACKET_OPENERS.head = -1;
BRACKET_OPENERS.tail = -1;
ctx->unresolved_link_head = -1;
ctx->unresolved_link_tail = -1;
Copy link
Contributor Author

@niblo niblo Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the issue goes away if I revert this, so that links are analyzed before tables. Will dig a bit deeper into this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. It's still there, but the crash/bad output does not happen all the time, and I did not run md2html with the bad input enough times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, wait. It actually fixes the issue. I was jumping back and forth between branches. What I meant was: If the links analysis is moved to after the table analysis, it fixes the issue. So I will attempt to dig a bit deeper into this. Sorry for the confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Analyzing links before tables produces the same crash on the master branch too.

@mity: Can you tell if I did something stupid when moving the link analysis to occur before the table analysis? That would be in md_analyze_inlines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sidenote, I seem to have found a related issue. This is on the master branch with no changes on top.


[a]: xyz

[abc]
<p><a href="xyz">abc</a></p>

[abc]: xyz

[a]
<p><a href="xyz">a</a></p>

[abc]: ijk
[a]: xyz

[a]
<p><a href="xyz">a</a></p>

[abb]: ijk
[a]: xyz

[a]
<p><a href="ijk">a</a></p>

[a]: ijk
[abc]: xyz

[abc]
<p><a href="xyz">abc</a></p>

[a]: ijk
[abb]: xyz

[abb]
<p><a href="ijk">abb</a></p>

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, maybe the culprit is elsewhere in md_is_table_row(). But it indeed should return FALSE. The pipe is inside the link body, links have higher priority and therefore it is not usable as a table cell boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md_is_link_reference is returning false when it should be true in md_resolve_links. It is the call that is commented with /* Might be collapsed reference link */.

Copy link
Owner

@mity mity Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch. Ouch. Ouch. That makes it all clear to me. And it points to an aspect which may make all the idea of links having higher priority then tables very problematic. We have a very ugly catch-22 here.

The problem is that md_is_table_row() is called when we are still determining whether the line belongs to a table block. To detect where the table ends and another block starts. I.e. during the very 1st pass for block analysis. Yes, md_is_table_row() calls some inline analysis of the given line to resolve the higher precedence of some stuff like code spans, escaped pipes over the pipes used so the code spans in the table are possible and it also elegantly solves the case of escaped pipes.

The issue with links is that they can only be really resolved when complete block analysis is over and we know all link reference definitions in the document. (Because of this, the hash-table is currently only populated after we know all of them and that's why we hit the issue even if the ref. def. is before.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I have just tried to do some experiments with github's Markdown. They clearly do not stop the table when there is a line without pipe.

A | B
--- | ---
no-pipe-here-but-still-a-table-row

renders into

A B
no-pipe-here-but-still-a-table-row

Hence, they likely do not need any line analysis at that early time. In that case I think we can change table parsing in a similar way. Improving the compatibility with them is a good thing on its own.

If I am right (and I believe I am) it should then fix the crash as a side effect. Actually it might even simplify quite a lot some things around the table parsing, maybe even remove the md_is_table_row() completely.

I'll try to cook something (in another branch because it is quite independent from this PR.)

Copy link
Contributor Author

@niblo niblo Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a clever workaround. Looking forward to seeing if it works.

@mity mity mentioned this pull request Nov 3, 2019
mity added a commit that referenced this pull request Nov 3, 2019
md_is_table_row did some crazy inline parsing to detect whether the line
contains at least one pipe which is not inside a code span or other
high-priority inline element.

This was very complicated under the hood and to was actually breaking
the clean design which separates block analysis parse and inline analysis
of each block contents.

We now just use the table underline for determining the block is table
and its properties.

This means a paragraph now cannot interrupt a table. This is a change in
a behavior but likely acceptable one as it actually brings the behavior
closer to behavior of tables in cmark-gfm in this regard.

Last but not least, it seems to prevent adoption of other useful
features, for about that, see the discussion in PR #92.
mity added a commit that referenced this pull request Nov 3, 2019
md_is_table_row() did some crazy inline parsing to detect whether the
line contains at least one pipe which is not inside a code span or other
high-priority inline element.

This was very complicated under the hood and to was actually breaking
the clean design which separates block analysis parse and inline analysis
of each block contents.

We now just use the table underline for determining the block is table
and its properties.

This means a paragraph now cannot interrupt a table. This is a change in
a behavior but likely acceptable one as it actually brings the behavior
closer to behavior of tables in cmark-gfm in this regard.

Last but not least, it seems to prevent adoption of other useful
features, for about that, see the discussion in PR #92.
@mity
Copy link
Owner

mity commented Nov 3, 2019

I have just experimentally merged this PR and PR #97 and indeed, it seems to fix the crash. So, I'll fuzz-test that over the night and if there is no other surprise I'll then merge both tomorrow.

mity added a commit that referenced this pull request Nov 4, 2019
We do so by removing the function md_is_table_row().

md_is_table_row() did some crazy inline parsing to detect whether the
line contains at least one pipe which is not inside a code span or other
high-priority inline element.

This was very complicated under the hood and to was actually breaking
the clean design which separates block analysis parse and inline analysis
of each block contents.

We now just use the table underline for determining the block is table
and its properties like e.g. the column count.

This means a paragraph now cannot interrupt a table. This is a change in
a behavior but likely acceptable one as it actually brings the behavior
closer to behavior of tables in cmark-gfm in this regard.

Last but not least, it seems to prevent adoption of other useful
features, for about that, see the discussion in PR #92.
@mity mity merged commit e336e64 into mity:master Nov 4, 2019
@mity
Copy link
Owner

mity commented Nov 4, 2019

Finally merged.

@niblo, thanks for all the work and also for your patience.

@niblo
Copy link
Contributor Author

niblo commented Nov 4, 2019

It was my pleasure. I would like to thank you too for the very same reasons. Your support made it possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants