Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Consider rewriting the markdown checker in rust #5405

Open
jodh-intel opened this issue Jan 24, 2023 · 6 comments
Open

Consider rewriting the markdown checker in rust #5405

jodh-intel opened this issue Jan 24, 2023 · 6 comments
Assignees
Labels
feature New functionality lang/rust Rust code outreach Outreach / internship related size/large Task of significant size

Comments

@jodh-intel
Copy link
Contributor

The check-markdown tool is used by the static-checks.sh script called by the CI.

Although the current golang implementation of check-markdown is ok, it isn't perfect as the blackfriday package it uses seems fragile (see https://github.com/kata-containers/tests/blob/main/cmd/check-markdown/hack.go).

Since we're moving everything else to rust, it might be worth considering rewriting check-markdown in rust too.

If anyone is interested, please add a comment here before starting work on it! 😄

@jodh-intel jodh-intel added feature New functionality needs-decision Requires input from the Architecture Committee lang/rust Rust code labels Jan 24, 2023
@jodh-intel jodh-intel added outreach Outreach / internship related and removed needs-decision Requires input from the Architecture Committee labels Feb 6, 2023
@Christopher-C-Robinson
Copy link

Christopher-C-Robinson commented Mar 1, 2023

@jodh-intel, I have some interest in this one.

@jodh-intel jodh-intel added the size/large Task of significant size label Mar 3, 2023
@Toreseen Toreseen assigned Toreseen and unassigned Toreseen Mar 10, 2023
@jodh-intel
Copy link
Contributor Author

Thanks @Christopher-C-Robinson! The first step is going to be to identify a good markdown parsing crate to use. I quick look suggests these are two popular ones:

If you look at the golang markdown checker, it uses the blackfriday golang package to handle the parsing of the doc into an AST. Markdown is a pretty permissive language so the parsers tend to "always work" and generate a tree of nodes. The rust tool (like the go version) will need to walk through the tree and perform checks on the different nodes and groups of nodes to determine whether we believe the markdown file is "valid". This regex shows most of the error scenarios the golang tool checks for:

$ git clone https://github.com/kata-containers/tests
$ cd tests/cmd/check-markdown
$ grep -Er 'errors.new|fmt.Errorf'                                                                                                          main [0]
utils.go:               return "", "", fmt.Errorf("invalid link %s: expected %d fields, found %d", linkName, expectedFields, foundFields)
utils.go:               return "", fmt.Errorf("need heading name")
utils.go:               return fmt.Errorf("expected %v node, found %v", expectedType, node.Type)
add.go:                 return "", fmt.Errorf("document root %q does not exist", docRoot)
heading.go:             return Heading{}, fmt.Errorf("heading name cannot be blank")
heading.go:             return Heading{}, fmt.Errorf("heading markdown name cannot be blank")
heading.go:             return Heading{}, fmt.Errorf("level needs to be atleast 1")
main.go:                return fmt.Errorf("no handler for format %q", format)
doc.go: return fmt.Errorf("file=%q: %s", d.Name, s)
display.go:             return fmt.Errorf("unknown show option: %v", what)
parse.go:               return fmt.Errorf("found %d parse error%s:\n%s",

A lot of what the checker does is checking markdown links between documents as we want to guarantee that all markdown links are valid. That's harder than it sounds as checking a single markdown file may link to every other markdown file in the repo (and then back to itself potentially!)

Separate from markdown links are URLs in markdown docs. Validating URLs isn't the focus of the current golang tool as we actually handle that in the CI here using:

  • The xurls command to extract URLs from markdown files.
  • the check_url() function in the statick-checks.sh script (that calls the curl command to actually connect to the URL).

However, it would be good if eventually the rust markdown checker could handle checking URLs itself (extracting the URLs from the markdown, verifying that they are syntactically valid addresses, and then optionally connecting to the URL to determine if those URLs are valid/alive still. But that doesn't need to be done for this issue ;)

One last comment: it would be a good idea to be familiar with the GitHub Flavoured Markdown spec for this piece of work:

@jodh-intel
Copy link
Contributor Author

@Christopher-C-Robinson - Here's how I'd break this task down:

We use clap everywhere so looking at some of the other tools should help too:

Also, I believe @gabevenberg is using it for kata-containers/kata-containers#5350.

As shown, there are quite a few steps so you might want to consider pairing up with someone else to work on this, or limiting the scope of what you plan to implement. We can more about this in the meeting on Friday or on Slack of course ;)

@Toreseen Toreseen self-assigned this Mar 27, 2023
@Toreseen
Copy link
Contributor

Toreseen commented Mar 28, 2023

@jodh-intel I have joined @Christopher-C-Robinson and we are teaming up to attempt to tackle this issue.

@jodh-intel
Copy link
Contributor Author

@Toreseen - great! ;)

@jodh-intel
Copy link
Contributor Author

As discussed with @Christopher-C-Robinson and @Toreseen today, once you've got a basic program that can iterate through all the markdown node types, you'll need to ensure it can access "heading" links and markdown links (not URLs).

You might need to save these nodes to a hash or a list (a Vec in rust). Once that's possible, you can write the first check which will validate the all markdown links that link to other parts of the current document (not URLs) are valid.
To do this, you'll need a function to map markdown heading text (such as # I am a heading) to the equivalent link name (#i-am-a-heading). I'd recommend looking at the existing golang markdown checker to see more of what is needed to validate headings. Once that's working, you can expand the check to consider links to other docs.

Checking markdown link in current file

## Section 1

foo

## Section 2

- This is a [valid link](#section-1).
- This is an [invalid link](#invalid-section).

Checking markdown link in another file

Note: This example shows a relative link to another document so the checker needs to ensure:

  1. That the file ../docs/design/README.md exists.
  2. That the heading "section-wibble" exists in that file.
This is a link to a [different file](../docs/design/README.md#section-wibble).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature New functionality lang/rust Rust code outreach Outreach / internship related size/large Task of significant size
Projects
None yet
Development

No branches or pull requests

3 participants