Apply `ruff` to `markdown` code blocks #3792

paddyroddy · 2023-03-29T09:26:30Z

It would be cool to have the functionality to run ruff over python codeblocks like this does for black https://github.com/adamchainz/blacken-docs

The text was updated successfully, but these errors were encountered:

JonathanPlasse · 2023-03-29T11:15:21Z

https://github.com/pydantic/pytest-examples
Does something similar.

evanrittenhouse · 2023-04-26T02:51:09Z

I'm interested in trying to implement this. Looking at the projects listed above, it seems like code blocks are determined via Regex. Do we want to copy that approach or use a more robust Markdown parser (pulldown-cmark seems like a popular one)?

JonathanPlasse · 2023-04-27T16:56:42Z

I would vote to use a markdown parser for robustness as you suggested.

MichaReiser · 2023-04-29T15:23:15Z

Using a markdown parser makes sense from my view. It may be worth to also consider tree-sitter because it supports many languages.

The part that's unclear to me how we want to solve it in the short term is the mapping of column and line numbers in the Messageemitters. Ruff supports Jupyter-notebooks today, but the line-mapping is somewhat "hacked-in" to make it work, and I'm a bit reluctant to add more exceptions to the line-number mapping.

https://github.com/charliermarsh/ruff/blob/483f4799995d1aea7987e6f7ee403f0411060023/crates/ruff/src/message/text.rs#L70-L87

I assume a similar mapping will be necessary for Markdown files because we only pass the code block's source to Ruff, but we should show users the absolute line number from the start of the Markdown document.

Long-term

Long-term, we'll need to support the following two features.

Multi-language support

Ruff should support linting different file types. For example, Ruff should be able to lint SQL, python, and Markdown files. The filetype decides which specific linter Ruff uses to lint the file. This includes that the LSP is in sync with the file-types supported by the CLI and what extensions map to which languages. Rome supports this today.

Embedded Languages

The idea is that a document written in one language can contain code written in another language. Examples are:

Code blocks in markdown documents
SQL in Python code
Python code in Jupyter Notebooks

The markdown file handler would recognize code blocks and delegate to Ruff to decide how to parse, lint, and format the code block's content. Ruff's infrastructure would correctly map the line-numbers between the "virtual" documents and the "physical" documents on disk.

evanrittenhouse · 2023-04-29T15:46:00Z

Yup, agree with all of your points @MichaReiser. As part of the design phase I've been thinking about how to generalize to different filetypes. It's an interesting problem - I will take a look at Treesitter parsers as well. Going down that route, we could support different languages relatively easily. I am going to submit PRs for this incrementally so that we can iterate over the design. The first one will be a parser that pulls code blocks out of Markdown - I'm leaning towards using Treesitter since that architecture could make it easier to provide "plugins" for other languages in the future.

As far as the line mappings, I was planning on making a generic LineMap struct (or something along those lines) which would contain line numbers mapped to their offset in the containing document. That struct could then be reused for embedded languages, like you mentioned, mapping relative line numbers in the chunk of interest to the absolute line number of the document. It should provide a flexible way to treat code blocks differently, regardless of if it's an embedded language, a markdown code block, a Jupyter cell, etc.

I'll have to look into Messages and how they work before finalizing the code though. I've been reading through the Jupyter implementation as part of this.

MichaReiser · 2023-04-29T17:25:18Z

As far as the line mappings, I was planning on making a generic LineMap struct (or something along those lines) which would contain line numbers mapped to their offset in the containing document.

Mapping line numbers should suffice for markdown documents but wouldn't be sufficient for e.g. SQL in python. I think it may actually be sufficient to simply add the byte offset of the markdown block to the range of every message/diagnostic.

evanrittenhouse · 2023-04-29T17:30:32Z

Yup, that's basically what I'm doing - adding a document_offset field to Messages and Diagnostics.

As far as the SQL in Python example, I was thinking the approach could work because the document_offset would be where the SQL code starts relative to the Python file. Line 1 of the SQL code could be on line 50 of the overarching .py file, so the document_offset for all diagnostics/messages in that block would be 50. Do you think there are flaws in that approach?

E: sorry, not mapping line numbers. Mapping byte offsets between SQL/Python lines. I still need to figure out a concrete design though, if you couldn't tell 😅

evanrittenhouse · 2023-04-29T22:29:32Z

Going with a treesitter integration requires installing the treesitter grammar for the language in question. While we could gate this behind a workspace feature flag, I'd like some other folk's thoughts on it. I don't want to needlessly increase the binary size. I need to figure out how it all works to give exact an estimate of the increase, though.

Advantage of pulldown-cmark as mentioned above is it's quite fast and won't affect the binary size. Disadvantage is that it obviously can't be used for anything except Markdown.

@MichaReiser @JonathanPlasse do you guys have any thoughts?

JonathanPlasse · 2023-04-29T23:30:46Z

It seems easier to first handle only markdown and in a second time expand with the tree-sitter grammar.

MichaReiser · 2023-05-01T09:25:05Z

Yup, that's basically what I'm doing - adding a document_offset field to Messages and Diagnostics.

Which use cases require a diagnostic or message to know its document offset? Would it be sufficient to mutate the diagnostic.range or message.range directly by adding the offset? I'm asking because increasing the size of Message and Diagnostic decrease performance because:

More data that must be written to and read from the cache
Overall more data to write when creating/copying diagnostics and read when generating the output (even for diagnostics not using embedded languages)

Advantage of pulldown-cmark as mentioned above is it's quite fast and won't affect the binary size.

My expectation is that pulling in pulldown-cmark increases the binary size because we include a new crate in our binary (that is 116 kB in size), or is pulldown-cmark already a (runtime) dependency ?

Going with a treesitter integration requires installing the treesitter grammar for the language in question.

Yeah, using treesitter is more complicated because it requires a custom build step. I'm not too opinionated on if we should use treesitter or not. I think we can start prototyping with either and defer the decision to later.

evanrittenhouse · 2023-05-01T12:58:50Z

Would it be sufficient to mutate the diagnostic.range or message.range directly by adding the offset?

Yep! I was working on it yesterday and that's also the route I've chosen to go down. Sorry for the confusion, I should've waiting until I had a fully fleshed out design before weighing in.

My expectation is that pulling in pulldown-cmark increases the binary size because we include a new crate in our binary (that is 116 kB in size), or is pulldown-cmark already a (runtime) dependency ?

It's not a runtime dependency, so it'll increase the crate because of its size, yes. I meant that it'd be less than the size delta from installing treesitter and associated grammar(s) which aren't totally necessary at this stage.

I think we can start prototyping with either and defer the decision to later.

Cool!

evanrittenhouse · 2023-06-29T23:50:17Z

Listing down some of the issues I see after hacking around locally:

Markdown, unlike Jupyter, is composed of discrete code blocks that don't necessarily affect each other. If we want to treat Markdown code blocks as discrete units, we'll have to rewrite most of the hot path since it's tightly coupled to a Path. As such, we have to rewrite lint_fix, lint_only, etc. to accept some sort of CodeBlock struct. Additionally, we have to figure out how to aggregate the diagnostics and display them/their fixes. If we don't want to treat them as discrete units, the code will be much simpler (almost akin to the Jupyter implementation).
The CodeBlock structure, at minimum, should contain the block's content and its offset from the top of the file (what I'm calling its "global offset"). Each diagnostic created while linting the block should have the code block's global offset added to it, meaning that when we display the line to users, they see the actual Markdown line number.
I didn't look into this, but we may need to expand UniversalNewLines to take Markdown line breaks as well, though again, I'm not sure how that works.
Unifying the Jupyter/Markdown parsers as Sources. I keep going back and forth on what the trait should entail, or if there should even be one.

Pinging @MichaReiser @charliermarsh just in case you guys have time to weigh in. I think the bare minimum question we need to answer is if we want to treat the code blocks as discrete units - I think we should, but that will require quite a bit of work.

P.S. sorry for the poor issue hygiene with the multiple closed PRs. I've been hacking around and didn't realize they're all on here.

MichaReiser · 2023-06-30T08:42:44Z

That's helpful knowledge that you gained with your prototype. Thanks for working on it.

I want to throw in two more use cases:

We could also lint the markdown, and not just the python code in markdown documents
We could lint SQL inside of Python files. This is similar to Markdown in that statements are independent from each other but it becomes necessary to call from the python linting into the SQL linting, inverting the direction

The conclusion I'm coming too is that it's probably worth abstracting over the file types instead where Ruff provides a lint, fix, format etc. functions per file type. I don't know what the exact signature would be. These file type specific functions can then pre-process the input and e.g. call into other linters. For example, the markdown linter can call out to the python linter if it finds a python code block and then post-process the diagnostics, to patch up the line numbers.

The way this would work for SQL is that the python linter calls the SQL linter if it finds a SQL expression when traversing the AST. This is inspired by Prettier's approach where Prettier detects template-literals that are tagged with graphql (this is probably not valid syntax, I don't remember all the details):

let result = useQuery(graphql`
	query myQuery() {
	}
`);

Rome implements something very close to this design. Each language supports different capabilities (we may support linting SQL but not formatting). How these capabilities are implemented is transparent to the CLI. All the CLI cares about is that there's a lint function to call and it knows how to lint the files content.

Definition of the JavaScript file type:

https://github.com/rome/tools/blob/38104b339da8ff688f469799e1fc3f4a46f3d2ec/crates/rome_service/src/file_handlers/javascript.rs#L86-L124

File agnostic API:

https://github.com/rome/tools/blob/38104b339da8ff688f469799e1fc3f4a46f3d2ec/crates/rome_service/src/workspace/server.rs#L446-L457

paddyroddy · 2023-06-30T09:58:16Z

@evanrittenhouse I was definitely considering them as discrete units when I raised this. For example here https://github.com/astro-informatics/sleplet/blob/c510795e7ebd91000b8afe53276e522b75cdfbb6/.github/workflows/examples.yml#L33 I use pytest-codeblocks to run some example code blocks (essentially acting as another type of test).

evanrittenhouse · 2023-06-30T12:27:54Z

@MichaReiser I agree with the file-agnostic hot path that calls into adapters for different languages. I think the approach you propose sort of mimics LSPs in that each language can have different capabilities, implemented differently. I'd love to work on this, but I think that there will be a lot of design decisions that you guys probably want to weigh in on. Is it even worth continuing work on this before we can have those talks?

MichaReiser · 2023-07-03T09:03:56Z

I think the approach you propose sort of mimics LSPs in that each language can have different capabilities, implemented differently.

That's a neat comparison!

Is it even worth continuing work on this before we can have those talks?

I'm not sure. What I outlined above is also only what I considered doing. There isn't any alignment on the team. Let's maybe first finish the fix refactor. That would allow me to focus on one side project only.

mjkanji · 2023-07-22T18:55:58Z

Chiming in here to ask if the development of this feature can also take Quarto (.qmd) into consideration. It's a great project aimed at providing a plaintext equivalent of Jupyter notebooks.

Hopefully, adding support shouldn't be too complicated because, like (vanilla) Markdown, .qmd files denote Python blocks explicitly, albeit with a slightly different syntax:

# This is a Markdown heading

This is regular text.

```{python}
#| echo: false

# This is a Python comment
print("Hello, Ruff!")
```

I don't imagine Ruff necessarily cares about specially formatted comments (though maybe it'll become relevant if you add formatting as a core functionality, #1904), but the #| echo: false at the top is a Quarto-specific comment that controls its behaviour. See here.

Unlike Markdown, Quarto documents are equivalent to Jupyter notebooks so you can (hopefully?) remix the work being done to add Jupyter and Markdown support.

evanrittenhouse · 2023-07-22T19:07:25Z

FWIW from my prototypes the Markdown parser will involve large changes to the hot path, so it's a ways down the road. This is because each code block creates a "context" independent of every other code block (vs. a normal Python/Jupyter file where the context is carried throughout the file). The current backend is tied to the assumption that one file contains one context.

If Quarto involves carrying the context throughout the file, it's probably closer to a Jupyter implementation than a (theoretical) Markdown one. May be worth raising a separate feature request, depending on the work, but sounds pretty cool!

cc @dhruvmanila (who implemented Jupyter support and is on the core team)

RonnyPfannschmidt · 2023-07-29T09:51:04Z

for sanity it might make absolute sense to consider each markdown fragment a "own file" with a offset starting point

this capability might actually make it easier to make IDE/language server helpers, where one might want to reformat a function/class alone while editing a file
(and the reformatting would basically work off a initial line/column/Position offset, plus the plain text of the lines intended for reformatting)

(im coming from entangled/entangled.py#5)

konstin · 2023-12-26T14:51:53Z

Formatting markdown code blocks is supported in v0.1.8: #9030

ddelange · 2023-12-26T18:22:34Z

tnx for the ping 👍

[tool.ruff.format]
docstring-code-format = true

paddyroddy · 2023-12-26T19:02:09Z

Formatting markdown code blocks is supported in v0.1.8: #9030

I believe that it is in the docstrings of functions as opposed to true code blocks in markdown that this issue was about

tvatter · 2024-01-26T09:07:45Z

@evanrittenhouse Regarding the "descrete units", for both Quarto notebooks (#6140) and Jupyter notebooks exported to markdown with Jupytext (#8800), all the code blocks are considered a single context, similar to the "normal Python/Jupyter file where the context is carried throughout the file" that you mentioned.

Avasam · 2024-08-18T05:23:08Z

Embedded Languages

The idea is that a document written in one language can contain code written in another language. Examples are:

Code blocks in markdown documents

SQL in Python code

Python code in Jupyter Notebooks

I'd like to mention HTML as well. Although that would require configuring a selector to find the node element responsible for containing Python code. That's because there is no established standards. The lang attribute exists, but it is somewhat out of spec to use it for non human languages (or more specifically, non-valid BCP 47 language tag). Most common solution is to use a custom class for styling anyway (<div class="language-python">), or a custom element entirely (<myapp-code-viewer language="python">).

Anyway I don't think that should come before Markdown is completely figured out. I just wanted to weigh in on the aspect of supporting Python code blocks in other languages in a generic way.

adamtheturtle · 2024-09-20T14:13:21Z

I have created a tool which enables this use case: doccmd.

For example, you might run:

$ doccmd --language=python --no-pad-file --command="ruff format" README.md CHANGELOG.rst
$ doccmd --language=python --command="ruff check" README.md CHANGELOG.rst

It is new and I'm very open to feedback on it.

By default, doccmd basically adds newlines to the beginning of the extracted code block to make line numbers in errors match the documentation file. This doesn't work for ruff format which expects to not have a bunch of newlines at the start, so there is the --no-pad-file option.

I'm also using it to run mypy, pyright, vulture, interrogate ... on my documentation.

charliermarsh added the wish Not on the current roadmap; maybe in the future label Mar 29, 2023

evanrittenhouse mentioned this issue May 11, 2023

Add support for linting Markdown code blocks #4375

Closed

evanrittenhouse mentioned this issue Jun 15, 2023

[Markdown Parser] Add CodeExtractor trait for extracting source code from a file #5123

Closed

evanrittenhouse mentioned this issue Jun 25, 2023

Run Ruff on Markdown code blocks #5356

Closed

mjkanji mentioned this issue Jul 28, 2023

Add support for Quarto notebooks #6140

Open

zanieb mentioned this issue Oct 24, 2023

Format Python code in documentation files astral-sh/ruff-pre-commit#55

Closed

StevenMaude mentioned this issue Oct 25, 2023

Ensure documentation examples are linted and formatted opensafely-core/ehrql#1346

Open

This was referenced Oct 26, 2023

Format python code in docstrings #8236

Closed

☂️ Formatter: Support formatting of embedded code #8237

Open

henryiii mentioned this issue Nov 17, 2023

Start using Scientific Python's repo-review astropy/astropy#15367

Merged

1 task

dhruvmanila mentioned this issue Nov 21, 2023

Support text Jupyter notebooks created with Jupytext #8800

Open

JacobCoffee mentioned this issue Dec 13, 2023

format doctests in docstrings #8811

Merged

3 tasks

Saransh-cpp mentioned this issue Dec 23, 2023

Migrate to ruff format pybamm-team/PyBaMM#3656

Merged

8 tasks

vdusek mentioned this issue Mar 1, 2024

Lint code examples apify/apify-docs#864

Open

paddyroddy mentioned this issue Sep 25, 2024

gh-170: add scientific python recommended blacken-docs glass-dev/glass#250

Merged

beroda mentioned this issue Dec 17, 2024

Format docstring with ruff pyvista/pyvista#6991

Merged

MichaReiser mentioned this issue Jan 21, 2025

Ruff support for Databricks notebook formatting - Ignore lines starting with ! or % #15650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply `ruff` to `markdown` code blocks #3792

Apply `ruff` to `markdown` code blocks #3792

paddyroddy commented Mar 29, 2023

JonathanPlasse commented Mar 29, 2023

evanrittenhouse commented Apr 26, 2023 •

edited

Loading

JonathanPlasse commented Apr 27, 2023

MichaReiser commented Apr 29, 2023 •

edited

Loading

evanrittenhouse commented Apr 29, 2023 •

edited

Loading

MichaReiser commented Apr 29, 2023

evanrittenhouse commented Apr 29, 2023 •

edited

Loading

evanrittenhouse commented Apr 29, 2023 •

edited

Loading

JonathanPlasse commented Apr 29, 2023 •

edited

Loading

MichaReiser commented May 1, 2023

evanrittenhouse commented May 1, 2023 •

edited

Loading

evanrittenhouse commented Jun 29, 2023 •

edited

Loading

MichaReiser commented Jun 30, 2023 •

edited

Loading

paddyroddy commented Jun 30, 2023

evanrittenhouse commented Jun 30, 2023

MichaReiser commented Jul 3, 2023

mjkanji commented Jul 22, 2023 •

edited

Loading

evanrittenhouse commented Jul 22, 2023 •

edited

Loading

RonnyPfannschmidt commented Jul 29, 2023

konstin commented Dec 26, 2023 •

edited

Loading

ddelange commented Dec 26, 2023

paddyroddy commented Dec 26, 2023

tvatter commented Jan 26, 2024 •

edited

Loading

Avasam commented Aug 18, 2024 •

edited

Loading

Embedded Languages

adamtheturtle commented Sep 20, 2024

Apply ruff to markdown code blocks #3792

Apply ruff to markdown code blocks #3792

Comments

paddyroddy commented Mar 29, 2023

JonathanPlasse commented Mar 29, 2023

evanrittenhouse commented Apr 26, 2023 • edited Loading

JonathanPlasse commented Apr 27, 2023

MichaReiser commented Apr 29, 2023 • edited Loading

Long-term

Multi-language support

Embedded Languages

evanrittenhouse commented Apr 29, 2023 • edited Loading

MichaReiser commented Apr 29, 2023

evanrittenhouse commented Apr 29, 2023 • edited Loading

evanrittenhouse commented Apr 29, 2023 • edited Loading

JonathanPlasse commented Apr 29, 2023 • edited Loading

MichaReiser commented May 1, 2023

evanrittenhouse commented May 1, 2023 • edited Loading

evanrittenhouse commented Jun 29, 2023 • edited Loading

MichaReiser commented Jun 30, 2023 • edited Loading

paddyroddy commented Jun 30, 2023

evanrittenhouse commented Jun 30, 2023

MichaReiser commented Jul 3, 2023

mjkanji commented Jul 22, 2023 • edited Loading

evanrittenhouse commented Jul 22, 2023 • edited Loading

RonnyPfannschmidt commented Jul 29, 2023

konstin commented Dec 26, 2023 • edited Loading

ddelange commented Dec 26, 2023

paddyroddy commented Dec 26, 2023

tvatter commented Jan 26, 2024 • edited Loading

Avasam commented Aug 18, 2024 • edited Loading

Embedded Languages

adamtheturtle commented Sep 20, 2024

Apply `ruff` to `markdown` code blocks #3792

Apply `ruff` to `markdown` code blocks #3792

evanrittenhouse commented Apr 26, 2023 •

edited

Loading

MichaReiser commented Apr 29, 2023 •

edited

Loading

evanrittenhouse commented Apr 29, 2023 •

edited

Loading

evanrittenhouse commented Apr 29, 2023 •

edited

Loading

evanrittenhouse commented Apr 29, 2023 •

edited

Loading

JonathanPlasse commented Apr 29, 2023 •

edited

Loading

evanrittenhouse commented May 1, 2023 •

edited

Loading

evanrittenhouse commented Jun 29, 2023 •

edited

Loading

MichaReiser commented Jun 30, 2023 •

edited

Loading

mjkanji commented Jul 22, 2023 •

edited

Loading

evanrittenhouse commented Jul 22, 2023 •

edited

Loading

konstin commented Dec 26, 2023 •

edited

Loading

tvatter commented Jan 26, 2024 •

edited

Loading

Avasam commented Aug 18, 2024 •

edited

Loading