-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wiki links #92
Wiki links #92
Conversation
Codecov Report
@@ Coverage Diff @@
## master #92 +/- ##
==========================================
+ Coverage 94.01% 94.26% +0.24%
==========================================
Files 1 1
Lines 2642 2722 +80
==========================================
+ Hits 2484 2566 +82
+ Misses 158 156 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I have only few comments from reading the code.
I haven't tried it yet in run-time at all.
Seems good to me now so far. Do you plan to continue to work in the foreseeable future in this PR? (It would be simpler for me, as such a feature deserves proper testing, including some fuzzing over-night so I would wait with that until you consider it complete to do it just once.) |
I will continue with adding support for the label part of the wiki link during the week. The goal is to complete the wiki link feature, with proper support for labels. I plan to get back to this in about 20 hours or so. In the meantime, thank you for the support. |
Ok. I would then fuzz-test it together when it is done. It takes AFL on my machine many hours to cover all the code paths and it is almost unusable for anything else in the mean time. Hopefully will qualify to https://github.com/google/oss-fuzz one day but until then... |
md4c/md4c.c
Outdated
|
||
/* The wiki link is ignored if the `|` delimiter is used, but without a | ||
* label. TODO Unsure if this is the desired behavior. How about | ||
* removing the `|` from the output instead, and keeping the link? */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would follow what other important wiki implementations (wikipedia?) usually do.
md4c/md4c.c
Outdated
break; | ||
} | ||
|
||
delim_index++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scanning for the |
likely opens the gates to quadratic behavior for [[[[[[[[foo|bar]]]]]]]]....
.
EDIT: Or maybe rather [[[[[[[[|||||||||||||]]]]]]]].....
md4c/md4c.c
Outdated
@@ -3497,6 +3535,74 @@ md_resolve_links(MD_CTX* ctx, const MD_LINE* lines, int n_lines) | |||
return 0; | |||
} | |||
|
|||
static int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some trouble understanding why this big function is needed. Even if it turns out we need it, it has some issues I commented in its body.
Generally, I assumed we can still scan only backwards from the ]]
as before. The only novelty would be that when we reach a |
, we use it as a beginning of the ID from that point (the closing MD_MARK
would be expanded to hide it from normal text flow), and that's it.
Also note we then do not have to collect |
marks beforehand for wiki-links: We reach that char by the scanning over the ID anyway, so it would simply the logic quite a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I assumed we can still scan only backwards from the
]]
as before. The only novelty would be that when we reach a|
, we use it as a beginning of the ID from that point (the closingMD_MARK
would be expanded to hide it from normal text flow), and that's it.
The closing mark? Don't you mean the opening mark?
Also note we then do not have to collect
|
marks beforehand for wiki-links: We reach that char by the scanning over the ID anyway, so it would simply the logic quite a lot.
That would be neat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The closing mark? Don't you mean the opening mark?
[[text|id]]
... I.e. the closer mark would be expanded back to cover not only ]]
but whole |id]]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, then it would be so, if not for the fact that the wiki link structure is the other way around: [[id|text]]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Mea culpa. So I got it whole wrong and we also need scanning forward from the opener, not backward. Whatever. Does not change the principle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- possible quadratic behavior (although I cant find this)
I spent some thinking about it.
Maybe it is not there, at least if the iterating over MD_MARK
s is done very carefully: If the ID length has a limit (say 100 characters), then it also limits the number of potential MD_MARK
we have to iterate in order to find the pipe: If we reach any (non-dummy) MD_MARK
which has MD_MARK::beg > opener.end + 100
, we know there is no such pipe mark.
We must be careful to now allow iteration over arbitrary (unlimited) stuff. But if we can guarantee it has a (reasonable) limit, it should be ok.
(But sorry, I don't have time today to dig in the code today to verify whether it works that way already. Busy with other things.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to make a call to md_analyze_line
. What I meant was that total_indent
could be saved in the MD_LINE_ANALYSIS
struct, and then handed/copied over to the MD_LINE
struct in md_add_line_into_current_block
. That way, in md_resolve_links
, because we have access to the line(s), we would have access to the total_indent
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See c48a709 for an implementation of this idea.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quadratic runtime behavior is definitely there (I did not test with a large enough link content before to notice it):
$ python -c "print '[[' * 4000 + 'x' * 400 + ']]' * 4000" | md2html/md2html --fwiki-links -s > /dev/null
Time spent on parsing: 90.94 ms.
$ python -c "print '[[' * 8000 + 'x' * 400 + ']]' * 8000" | md2html/md2html --fwiki-links -s > /dev/null
Time spent on parsing: 356.97 ms.
$ python -c "print '[[' * 16000 + 'x' * 400 + ']]' * 16000" | md2html/md2html --fwiki-links -s > /dev/null
Time spent on parsing: 1.497 s.
I set a limit in commit 2bcd0bc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This is the most critical structure for memory overhead of MD4C. We construct a vector of blocks and all their
MD_LINE
s per whole document before the inline processing. So 3 integers instead of 2 in this struct is a big change. I certainly won't do anything like that with a light heart.
If you do not want the new member there, it would be really simple to put the length restriction in the big if-statement leading into the wiki link handling:
/* Detect and resolve wiki links. */
if ((ctx->parser.flags & MD_FLAG_WIKILINKS) &&
next_opener != NULL && next_closer != NULL &&
(opener->end - opener->beg == 1) &&
(next_opener->beg == opener->beg - 1) &&
(next_closer->beg == closer->beg + 1) &&
(next_opener->end - next_opener->beg == 1) &&
(next_closer->end - next_closer->beg == 1) &&
(next_opener->ch == '[' && next_closer->ch == ']'))
{
But it could of course not account for potential leading block quote marks.
Or, if we want a consistent limit on the wiki link content that users can count on, counting the block quote markers "from scratch" is another alternative.
md4c/md4c.c
Outdated
int newlines = 0; | ||
int has_label = (opener->end - opener->beg > 2); | ||
|
||
/* Check for newlines. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a point in this at all. The block parsing (the1 st parsing big step) already guarantees we do not try to match [[
with ]]
across blocks (i.e. there is no blank line for example), and I do not see a principal reason why
[[a long text
over multiple
lines|ID]]
should be rejected.
So imho we can still happily only scan over the ID part.
Also your approach would not be reliable:
-
"\r\n"
is considered one new line digraph in Markdown. -
Once you do not simply fail when reaching a new line, you can happily scan over contents which does not belong to the block contents like block decorations or indentation, breaking the logic with the text length limit.
Consider e.g. a multi-line wiki-link used in a nested blockquote:
> > > > [[foo > > > > bar|id]
Also: - Permit multiple lines in a label - Prioritize links over tables cell boundaries - Add test cases, test with --ftables The limiting of content to 100 characters is still lacking in that it does not exclude block quote markers that appear after the linebreak of a label that spans multiple lines.
This does not affect anything in practice, because the limit of at most 100 characters is consistent between the first and second while-loop.
@niblo Sorry for the long delay. Solving some real-life stuff. |
Sounds good.
…On 19 Sep 2019, at 12:21, Martin Mitáš wrote:
@niblo Sorry for the long delay. Solving some real-life stuff.
I hope to get to it again during the week-end.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#92 (comment)
|
Fuzz testing has found many crash(es) in this branch, but likely they are all exhibitions of the same bug (multiple randomly chosen crash with the same backtrace). Looks to me that there its some interference of code for tables and for wiki-links as it crashes only with I'll try to minimize and textize the crashing input (it has tens of kB of ugly binary stuff) before posting it here. |
Minimized one of the crash inputs into this:
|
/* A potential table cell boundary. */ | ||
if(table_mode && ch == _T('|')) { | ||
/* A potential table cell boundary or wiki link label delimiter. */ | ||
if((table_mode || ctx->parser.flags & MD_FLAG_WIKILINKS) && ch == _T('|')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed this issue in the code: Bit operator &
has higher priority then ||
so the part of the condition in the brackets is equivalent to (table_mode | flags) & WIKILINKS
. It should be table_mode | (flags & WIKILINKS)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Not true, it's ==
which usually makes troubles. But I would prefer if you add the parenthesis there anyway, please.
Thanks for the feedback. I will try to have a look at it this weekend. |
/* (3) Links. */ | ||
md_analyze_marks(ctx, lines, n_lines, 0, ctx->n_marks, _T("[]!")); | ||
MD_CHECK(md_resolve_links(ctx, lines, n_lines)); | ||
BRACKET_OPENERS.head = -1; | ||
BRACKET_OPENERS.tail = -1; | ||
ctx->unresolved_link_head = -1; | ||
ctx->unresolved_link_tail = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the issue goes away if I revert this, so that links are analyzed before tables. Will dig a bit deeper into this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake. It's still there, but the crash/bad output does not happen all the time, and I did not run md2html with the bad input enough times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, wait. It actually fixes the issue. I was jumping back and forth between branches. What I meant was: If the links analysis is moved to after the table analysis, it fixes the issue. So I will attempt to dig a bit deeper into this. Sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Analyzing links before tables produces the same crash on the master branch too.
@mity: Can you tell if I did something stupid when moving the link analysis to occur before the table analysis? That would be in md_analyze_inlines
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a sidenote, I seem to have found a related issue. This is on the master branch with no changes on top.
[a]: xyz
[abc]
<p><a href="xyz">abc</a></p>
[abc]: xyz
[a]
<p><a href="xyz">a</a></p>
[abc]: ijk
[a]: xyz
[a]
<p><a href="xyz">a</a></p>
[abb]: ijk
[a]: xyz
[a]
<p><a href="ijk">a</a></p>
[a]: ijk
[abc]: xyz
[abc]
<p><a href="xyz">abc</a></p>
[a]: ijk
[abb]: xyz
[abb]
<p><a href="ijk">abb</a></p>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, maybe the culprit is elsewhere in md_is_table_row()
. But it indeed should return FALSE
. The pipe is inside the link body, links have higher priority and therefore it is not usable as a table cell boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md_is_link_reference
is returning false when it should be true in md_resolve_links
. It is the call that is commented with /* Might be collapsed reference link */
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch. Ouch. Ouch. That makes it all clear to me. And it points to an aspect which may make all the idea of links having higher priority then tables very problematic. We have a very ugly catch-22 here.
The problem is that md_is_table_row()
is called when we are still determining whether the line belongs to a table block. To detect where the table ends and another block starts. I.e. during the very 1st pass for block analysis. Yes, md_is_table_row()
calls some inline analysis of the given line to resolve the higher precedence of some stuff like code spans, escaped pipes over the pipes used so the code spans in the table are possible and it also elegantly solves the case of escaped pipes.
The issue with links is that they can only be really resolved when complete block analysis is over and we know all link reference definitions in the document. (Because of this, the hash-table is currently only populated after we know all of them and that's why we hit the issue even if the ref. def. is before.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I have just tried to do some experiments with github's Markdown. They clearly do not stop the table when there is a line without pipe.
A | B
--- | ---
no-pipe-here-but-still-a-table-row
renders into
A | B |
---|---|
no-pipe-here-but-still-a-table-row |
Hence, they likely do not need any line analysis at that early time. In that case I think we can change table parsing in a similar way. Improving the compatibility with them is a good thing on its own.
If I am right (and I believe I am) it should then fix the crash as a side effect. Actually it might even simplify quite a lot some things around the table parsing, maybe even remove the md_is_table_row()
completely.
I'll try to cook something (in another branch because it is quite independent from this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a clever workaround. Looking forward to seeing if it works.
md_is_table_row did some crazy inline parsing to detect whether the line contains at least one pipe which is not inside a code span or other high-priority inline element. This was very complicated under the hood and to was actually breaking the clean design which separates block analysis parse and inline analysis of each block contents. We now just use the table underline for determining the block is table and its properties. This means a paragraph now cannot interrupt a table. This is a change in a behavior but likely acceptable one as it actually brings the behavior closer to behavior of tables in cmark-gfm in this regard. Last but not least, it seems to prevent adoption of other useful features, for about that, see the discussion in PR #92.
md_is_table_row() did some crazy inline parsing to detect whether the line contains at least one pipe which is not inside a code span or other high-priority inline element. This was very complicated under the hood and to was actually breaking the clean design which separates block analysis parse and inline analysis of each block contents. We now just use the table underline for determining the block is table and its properties. This means a paragraph now cannot interrupt a table. This is a change in a behavior but likely acceptable one as it actually brings the behavior closer to behavior of tables in cmark-gfm in this regard. Last but not least, it seems to prevent adoption of other useful features, for about that, see the discussion in PR #92.
I have just experimentally merged this PR and PR #97 and indeed, it seems to fix the crash. So, I'll fuzz-test that over the night and if there is no other surprise I'll then merge both tomorrow. |
We do so by removing the function md_is_table_row(). md_is_table_row() did some crazy inline parsing to detect whether the line contains at least one pipe which is not inside a code span or other high-priority inline element. This was very complicated under the hood and to was actually breaking the clean design which separates block analysis parse and inline analysis of each block contents. We now just use the table underline for determining the block is table and its properties like e.g. the column count. This means a paragraph now cannot interrupt a table. This is a change in a behavior but likely acceptable one as it actually brings the behavior closer to behavior of tables in cmark-gfm in this regard. Last but not least, it seems to prevent adoption of other useful features, for about that, see the discussion in PR #92.
Finally merged. @niblo, thanks for all the work and also for your patience. |
It was my pleasure. I would like to thank you too for the very same reasons. Your support made it possible. |
This does not feature link labels, only the link target. Comments appreciated.