Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown reader: support GitHub wiki's internal links (#2923) #6458

Merged
merged 4 commits into from
Jan 17, 2021

Conversation

blackheaven
Copy link
Contributor

@blackheaven blackheaven commented Jun 14, 2020

closes #2923

Changes overview:

  • Add a Ext_markdown_github_wikilink extension
    • Added in the githubMarkdownExtensions list
    • Added in the getAllExtensions list
  • Add the parser githubWikiLink in Text.Pandoc.Readers.Markdown
    • Added it high in the inline parsers list
  • Add two tests, one for each form

Notes:

  • It has been done as part as ZuriHac 2020
  • It is my first contribution
  • I am available to pair on Discord if needed

Thanks by advance for your feedback.

src/Text/Pandoc/Extensions.hs Outdated Show resolved Hide resolved
@blackheaven
Copy link
Contributor Author

I have tried to take care of comments:

  • Change the extension name
  • Added documentation
  • Imported the MediaWiki's way to read wikilinks (drop 'Category' part and limit support to italics and strongs)
  • Added tests

Sadly, my tests are failing and I do not see what I am doing wrong, if someone can have a look and see something obvious, I would be grateful.

@blackheaven
Copy link
Contributor Author

Any blocking points?

src/Text/Pandoc/Extensions.hs Outdated Show resolved Hide resolved
, test markdownGH "bad link with title" $
"[[title|random string]]" =?>
para (link "random string" "wikilink" (text "title"))
]
, testGroup "Headers"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GitHub (or the markdown parser they use in wikis, which I think may be kramdown) have a suite of tests for this syntax? If so, it may be worth finding it and seeing if any relevant tests can be added.

Comment on lines 1747 to 1754
if T.any (== '|') contents
then do
st <- getState
setState $ st{ stateAllowLinks = False }
label <- mconcat <$> manyTill inline (char '|')
setState $ st{ stateAllowLinks = True }
url <- manyTillChar anyChar $ try $ string "]]"
return $ B.link url title <$> label
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems potentially unreliable. Example:

[[hello `|` there]]

IN this case contents will be hello `|` there, but when we do manyTill inline (char '|') we will keep parsing past the ]], because the | character occurs within an inline (a code span).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky, and I'm not sure how it should behave, because we don't know whether to treat the content as plain text or inlines until we know whether there's a | character, but that | character might be part of regular inline content. I'm really not sure what to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have actually borrowed it from the VimWiki parser, we may have the same issue.

@jgm
Copy link
Owner

jgm commented Jun 29, 2020

Worth figuring out whether the GitHub implementation of this allows inline formatting in the description part. If it's treated as a plain string, maybe we could do the same, and it would solve the problem noted above. Finding their test suite for this would be nice!

@blackheaven
Copy link
Contributor Author

If there is a test suite, it is well hidden.

I have looked for hours, tried babelmark2, no other implementation supports it, no documentation... I begin to doubt of it.

I have tried to make it recursive but it causes issues with autolinks.

@blackheaven
Copy link
Contributor Author

Great news! wikilinks are actually way simpler than we expected: https://github.com/blackheaven/pandoc/wiki

@blackheaven
Copy link
Contributor Author

it should be good now

githubWikiLink = try $ do
guardEnabled Ext_wikilinks
string "[["
inlinedContents <- option (return $ B.fromList [])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that B.fromList [] is just mempty which would be clearer and faster.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to get both the inline contents and the raw contents, you can use the withRaw combinator (T.P.Parsing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I do not know if I did correctly, but it broke the tests

let allowedInline x = case x of
B.Str _ -> True
B.Space -> True
B.Link {} -> True
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all markdown links allowed? Or just bare URIs?
If just bare URIs, then maybe a better approach would be to use a parser that just targets what we need.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub's documentation says: [[Link Text|nameofwikipage]]

So there are two questions. 1) What can go in "Link Text"? Your experiments appear to reveal that only plain text is allowed, no special formatting. 2) What can go in "nameofwikipage"? Your experiments show that a fully qualified URL is allowed, but clearly that's not all -- names of wiki pages are also allowed. I imagine that just about any plain text could be the name of a wiki page, so presumably we should allow e.g. [[Link Text|This is my page]]. For this we should emit a link with URL This is my page but also a special attribute indicating that it's a wikilink, so that users can do transformations on the path.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OK, looks like that's what you're doing, from the tests!)

Copy link
Contributor

@hftf hftf Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not pertinent to GitHub-flavored Markdown, but it may be informative to note that wikilinks have been growing in popularity due to the explosion of note-taking and (personal) knowledge management tools, in addition to static-site generators and digital gardens, which are often based on Markdown or Markdown-ish syntaxes. Some (nonstandard) implementations allow link text to contain inline formatting, and some even allow page titles (and thus link targets) to contain both formatting and nested links.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, seems to me we should try to simplify the parser. Instead of parsing inlines and raw, and then trying to piece things together, why not something like

wikilink = try $ do
  string "[["
  firstPart <- wikiText
  (char '|' *> complexWikilink firstPart) <|> (string "]]" *> return (wikilink with firstPart as link text and (stringify firstPart) as url)

complexWikilink firstPart = do
  url <- stringify <$> wikiText
  string "]]"
  return $ wikilink with firstPart as link text and url as URL

wikiText = many ( whitespace <|> bareURL <|> str <|> endline <|> escapedChar )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried you suggestion, but it seems that I miss the point

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay, let me merge your commit, and then I'll see if there's a way to simplify it.

@jgm
Copy link
Owner

jgm commented Jan 5, 2021

Sorry I left this sit for so long! I've made some comments in the code.

Changes overview:

 * Add a `Ext_markdown_github_wikilink` extension
   * Added in the `githubMarkdownExtensions` list
   * Added in the `getAllExtensions` list
 * Add the parser `githubWikiLink` in `Text.Pandoc.Readers.Markdown`
   * Added it high in the `inline` parsers list
 * Add two tests, one for each form
@blackheaven
Copy link
Contributor Author

Sorry I left this sit for so long! I've made some comments in the code.

No worries, my sparse time has drastically decreased in the mean time.

@jgm jgm merged commit 6efd346 into jgm:master Jan 17, 2021
@jgm
Copy link
Owner

jgm commented Jan 17, 2021

Thanks for this contribution!

jgm added a commit that referenced this pull request Jan 17, 2021
…#6458)"

This reverts commit 6efd346.

Since this extension is designed to be used with
GitHub markdown (gfm), we need to implement the parser
as a commonmark extension (commonmark-extensions),
rather than in pandoc's markdown reader.  When that is
done, we can add it here.
@jgm
Copy link
Owner

jgm commented Jan 17, 2021

Sorry! After merging this I realized something that has caused me to revert the commit.

Since this is designed to be used with gfm, we should implement the parser as a commonmark extension (commonmark-extensions), rather than here. The code in pandoc's markdown reader is NOT used when parsing gfm. I presume that most users who will want this feature will want to use it with gfm, so I think we should wait.

I'm sorry I didn't think of this earlier, before you put the time into this PR.

Writing a commonmark-hs extension is an entirely different thing, and I need to think more about how it would be done with this syntax.

@jgm
Copy link
Owner

jgm commented Jan 17, 2021

The work you've done in constructing the tests and in the extensions module can be re-used once we have the new commonmark-hs module in place.

@jgm
Copy link
Owner

jgm commented Jan 17, 2021

I've started to add a wikilinks extension in commonmark-hs.

@blackheaven
Copy link
Contributor Author

Sure, no problem

@marcelotto
Copy link

This is a really exciting extension. I was already about to submit a feature request when I came across this PR. I just want to undermine that this extension isn't only useful for Github Wiki pages. Using Wikilinks in Markdown is getting fairly popular recently and presumably even more in the future with the recent raise of local-first, Markdown-based knowledge management tools like Obsidian, Logseq, Dendron to name just a few.

@jgm
Copy link
Owner

jgm commented Feb 13, 2021

I've run into a problem implementing this.
There are two incompatible syntaxes (different orders) out there: see
jgm/commonmark-hs#69
Not quite sure what to do about it.

@hftf
Copy link
Contributor

hftf commented Feb 13, 2021

There are now many incompatible implementations and interpretations of "wikilinks" in existence, which is partly due to the recent growth of note-taking tools that I and others have mentioned above.

I foresee that many users of pandoc are going to want to operate on documents containing wikilinks of all kinds. This is a challenge.

Perhaps an encompassing solution would be "DIY." Instead of favoring one implementation out of many for a single extension called wikilinks, Pandoc could implement various small pieces of functionality commonly found in wikilink implementations, such as a pair of mutually exclusive options to handle either [[target|link text]] or [[link text|target]] (wikilink-piped-link-text-on-right and -left), or to allow or disallow nested inline formatting. Particularly well-known implementations could be aliased to combinations of options (e.g. wikilinks-gfm is an alias of wikilinks-option1+wikilinks-option3 etc.).

@marcelotto
Copy link

marcelotto commented Feb 14, 2021

Sorry @hftf, didn't notice your similar remarks. I guess you're right, it's a challenging problem which ideally would be solved in the most flexible way you've outlined, but I can image that this can lead to a maintainability nightmare. From what I can see two alternative extensions should be sufficient if the extensions are limited on capturing the pure links: One for classical Wikilinks ([[link]] and [[link|label]]) and one for Github-flavoured Wikilinks ([[link]] and [[label|link]]). Regarding the additional system-specific features on top of that (inline formatting etc.), either people with portability in mind are hesitant on their usage in the first place or have with these extensions at least the option to capture the links with a filter and do the additional interpretation on their own.

One remark regarding the understandably frustrated comment

This is how it's done in GitHub wikis. I don't know why everyone has to reinvent the wheel in incompatible ways!

by @jgm in the commonmark-hs issue. It should be noted that GitHub were the ones breaking with the established way. The other tools are using the classical Wikilink-style introduced by UseModWiki and used in almost all Wiki systems [1] and in all the new systems from the new generation of knowledge management tools (as far as I can see).

[1]: https://en.wikipedia.org/wiki/History_of_wikis#CamelCase_and_internal_links

@jgm
Copy link
Owner

jgm commented Feb 15, 2021

Fair point: GitHub was never great about looking at prior art in making markdown extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Markdown reader: add extension for GitHub wiki's internal link syntax [[page name]]
5 participants