-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply ruff
to markdown
code blocks
#3792
Comments
https://github.com/pydantic/pytest-examples |
I'm interested in trying to implement this. Looking at the projects listed above, it seems like code blocks are determined via Regex. Do we want to copy that approach or use a more robust Markdown parser (pulldown-cmark seems like a popular one)? |
I would vote to use a markdown parser for robustness as you suggested. |
Using a markdown parser makes sense from my view. It may be worth to also consider tree-sitter because it supports many languages. The part that's unclear to me how we want to solve it in the short term is the mapping of column and line numbers in the I assume a similar mapping will be necessary for Markdown files because we only pass the code block's source to Ruff, but we should show users the absolute line number from the start of the Markdown document. Long-termLong-term, we'll need to support the following two features. Multi-language supportRuff should support linting different file types. For example, Ruff should be able to lint SQL, python, and Markdown files. The filetype decides which specific linter Ruff uses to lint the file. This includes that the LSP is in sync with the file-types supported by the CLI and what extensions map to which languages. Rome supports this today. Embedded LanguagesThe idea is that a document written in one language can contain code written in another language. Examples are:
The markdown file handler would recognize code blocks and delegate to Ruff to decide how to parse, lint, and format the code block's content. Ruff's infrastructure would correctly map the line-numbers between the "virtual" documents and the "physical" documents on disk. |
Yup, agree with all of your points @MichaReiser. As part of the design phase I've been thinking about how to generalize to different filetypes. It's an interesting problem - I will take a look at Treesitter parsers as well. Going down that route, we could support different languages relatively easily. I am going to submit PRs for this incrementally so that we can iterate over the design. The first one will be a parser that pulls code blocks out of Markdown - I'm leaning towards using Treesitter since that architecture could make it easier to provide "plugins" for other languages in the future. As far as the line mappings, I was planning on making a generic LineMap struct (or something along those lines) which would contain line numbers mapped to their offset in the containing document. That struct could then be reused for embedded languages, like you mentioned, mapping relative line numbers in the chunk of interest to the absolute line number of the document. It should provide a flexible way to treat code blocks differently, regardless of if it's an embedded language, a markdown code block, a Jupyter cell, etc. I'll have to look into Messages and how they work before finalizing the code though. I've been reading through the Jupyter implementation as part of this. |
Mapping line numbers should suffice for markdown documents but wouldn't be sufficient for e.g. SQL in python. I think it may actually be sufficient to simply add the byte offset of the markdown block to the range of every message/diagnostic. |
Yup, that's basically what I'm doing - adding a As far as the SQL in Python example, I was thinking the approach could work because the E: sorry, not mapping line numbers. Mapping byte offsets between SQL/Python lines. I still need to figure out a concrete design though, if you couldn't tell 😅 |
Going with a treesitter integration requires installing the treesitter grammar for the language in question. While we could gate this behind a workspace Advantage of @MichaReiser @JonathanPlasse do you guys have any thoughts? |
It seems easier to first handle only markdown and in a second time expand with the tree-sitter grammar. |
Which use cases require a diagnostic or message to know its document offset? Would it be sufficient to mutate the
My expectation is that pulling in
Yeah, using treesitter is more complicated because it requires a custom build step. I'm not too opinionated on if we should use treesitter or not. I think we can start prototyping with either and defer the decision to later. |
Yep! I was working on it yesterday and that's also the route I've chosen to go down. Sorry for the confusion, I should've waiting until I had a fully fleshed out design before weighing in.
It's not a runtime dependency, so it'll increase the crate because of its size, yes. I meant that it'd be less than the size delta from installing treesitter and associated grammar(s) which aren't totally necessary at this stage.
Cool! |
Listing down some of the issues I see after hacking around locally:
Pinging @MichaReiser @charliermarsh just in case you guys have time to weigh in. I think the bare minimum question we need to answer is if we want to treat the code blocks as discrete units - I think we should, but that will require quite a bit of work. P.S. sorry for the poor issue hygiene with the multiple closed PRs. I've been hacking around and didn't realize they're all on here. |
That's helpful knowledge that you gained with your prototype. Thanks for working on it. I want to throw in two more use cases:
The conclusion I'm coming too is that it's probably worth abstracting over the file types instead where Ruff provides a The way this would work for SQL is that the python linter calls the SQL linter if it finds a SQL expression when traversing the AST. This is inspired by Prettier's approach where Prettier detects template-literals that are tagged with let result = useQuery(graphql`
query myQuery() {
}
`); Rome implements something very close to this design. Each language supports different capabilities (we may support linting SQL but not formatting). How these capabilities are implemented is transparent to the CLI. All the CLI cares about is that there's a Definition of the JavaScript file type: File agnostic API: |
@evanrittenhouse I was definitely considering them as discrete units when I raised this. For example here https://github.com/astro-informatics/sleplet/blob/c510795e7ebd91000b8afe53276e522b75cdfbb6/.github/workflows/examples.yml#L33 I use pytest-codeblocks to run some example code blocks (essentially acting as another type of test). |
@MichaReiser I agree with the file-agnostic hot path that calls into adapters for different languages. I think the approach you propose sort of mimics LSPs in that each language can have different capabilities, implemented differently. I'd love to work on this, but I think that there will be a lot of design decisions that you guys probably want to weigh in on. Is it even worth continuing work on this before we can have those talks? |
That's a neat comparison!
I'm not sure. What I outlined above is also only what I considered doing. There isn't any alignment on the team. Let's maybe first finish the fix refactor. That would allow me to focus on one side project only. |
Chiming in here to ask if the development of this feature can also take Quarto ( Hopefully, adding support shouldn't be too complicated because, like (vanilla) Markdown, # This is a Markdown heading
This is regular text.
```{python}
#| echo: false
# This is a Python comment
print("Hello, Ruff!")
``` I don't imagine Ruff necessarily cares about specially formatted comments (though maybe it'll become relevant if you add formatting as a core functionality, #1904), but the Unlike Markdown, Quarto documents are equivalent to Jupyter notebooks so you can (hopefully?) remix the work being done to add Jupyter and Markdown support. |
FWIW from my prototypes the Markdown parser will involve large changes to the hot path, so it's a ways down the road. This is because each code block creates a "context" independent of every other code block (vs. a normal Python/Jupyter file where the context is carried throughout the file). The current backend is tied to the assumption that one file contains one context. If Quarto involves carrying the context throughout the file, it's probably closer to a Jupyter implementation than a (theoretical) Markdown one. May be worth raising a separate feature request, depending on the work, but sounds pretty cool! cc @dhruvmanila (who implemented Jupyter support and is on the core team) |
for sanity it might make absolute sense to consider each markdown fragment a "own file" with a offset starting point this capability might actually make it easier to make IDE/language server helpers, where one might want to reformat a function/class alone while editing a file (im coming from entangled/entangled.py#5) |
tnx for the ping 👍 [tool.ruff.format]
docstring-code-format = true |
@evanrittenhouse Regarding the "descrete units", for both Quarto notebooks (#6140) and Jupyter notebooks exported to markdown with Jupytext (#8800), all the code blocks are considered a single context, similar to the "normal Python/Jupyter file where the context is carried throughout the file" that you mentioned. |
I'd like to mention HTML as well. Although that would require configuring a selector to find the node element responsible for containing Python code. That's because there is no established standards. The Anyway I don't think that should come before Markdown is completely figured out. I just wanted to weigh in on the aspect of supporting Python code blocks in other languages in a generic way. |
I have created a tool which enables this use case: For example, you might run: $ doccmd --language=python --no-pad-file --command="ruff format" README.md CHANGELOG.rst
$ doccmd --language=python --command="ruff check" README.md CHANGELOG.rst It is new and I'm very open to feedback on it. By default, I'm also using it to run |
It would be cool to have the functionality to run
ruff
overpython
codeblocks like this does forblack
https://github.com/adamchainz/blacken-docsThe text was updated successfully, but these errors were encountered: