format doctests in docstrings #8811

BurntSushi · 2023-11-21T20:14:21Z

Summary

This PR adds opt-in support for formatting doctests in docstrings. This reflects initial support and it is intended to add support for Markdown and reStructuredText Python code blocks in the future. But I believe this PR lays the groundwork, and future additions for Markdown and reST should be less costly to add.

It's strongly recommended to review this PR commit-by-commit. The last few commits in particular implement the bulk of the work here and represent the denser portions.

Some things worth mentioning:

The formatter is itself not perfect, and it is possible for it to produce invalid Python code. Because of this, reformatted code snippets are checked for Python validity. If they aren't valid, then we (unfortunately silently) bail on formatting that code snippet.
There are a couple places where it would be nice to at least warn the user that doctest formatting failed, but it wasn't clear to me what the best way to do that is.
I haven't yet run this in anger on a real world code base. I think that should happen before merging.

Closes #7146

Test Plan

Pass the local test suite.
Scrutinize ecosystem changes.
Run this formatter on extant code and scrutinize the results. (e.g., CPython, numpy.)

github-actions · 2023-11-21T20:30:56Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

crates/ruff_python_formatter/src/expression/string.rs

crates/ruff_python_formatter/resources/test/fixtures/ruff/docstring_code_examples.py

crates/ruff_python_formatter/src/expression/string.rs

crates/ruff_workspace/src/settings.rs

crates/ruff_python_formatter/src/expression/string.rs

charliermarsh

Overall, this looks great, and I appreciate the way you broke it down by commit. I ended up reading the commits one-by-one and found the flow very clear.

BurntSushi · 2023-11-22T18:06:50Z

I ran the docstring code formatter on CPython (after formatting the code without docstring code formatting enabled) and put the results in one commit: BurntSushi/cpython@19d1c85

I did a quick pass through them and everything looks like it passes the eyeball test. My next step is to make sure the doctests still run and pass.

BurntSushi · 2023-11-22T19:00:50Z

My next step is to make sure the doctests still run and pass.

I checked out the main branch of CPython and built it:

$ export CFLAGS="${CFLAGS/-O2/-O3} -ffat-lto-objects"
$ ./configure --prefix=/usr \
              --enable-shared \
              --with-computed-gotos \
              --enable-optimizations \
              --with-lto \
              --enable-ipv6 \
              --with-system-expat \
              --with-dbmliborder=gdbm:ndbm \
              --with-system-libmpdec \
              --enable-loadable-sqlite-extensions \
              --without-ensurepip \
              --with-tzpath=/usr/share/zoneinfo
$ LC_CTYPE=en_US.UTF-8 make -j8 EXTRA_CFLAGS="$CFLAGS"

Then confirmed that I could run some doctests:

$ ./python -m doctest Lib/ipaddress.py -v

And some of that failed, but the summarize_address_range doctest passed.

Then I checked out my branch with the reformatted code snippets and re-ran the above. The results remain unchanged.

So at least in that one specific case that looked potentially odd, reformatting didn't break the doctest. (Specifically, the doctest directive in this example is still picked up.)

I haven't been able to figure out how to run all of the doctests yet though.

konstin · 2023-11-23T11:23:41Z

crates/ruff_python_formatter/tests/normalizer.rs

@@ -12,6 +16,10 @@ use ruff_python_ast::{self as ast, Expr, Stmt};
 ///   between `class C: ...` and `class C(): ...`, which is part of our AST but not `CPython`'s.
 /// - Normalize strings. The formatter can re-indent docstrings, so we need to compare string
 ///   contents ignoring whitespace. (Black does the same.)
+/// - The formatter can also reformat code snippets when they're Python code, which can of


This changes the AST normalizer to remove anything that looks like a
code snippet.

Could we instead remove docstrings from the equality check?

I think we could, but I think that would potentially reduce the effectiveness of the test. Today, we still test that the substance of the docstring remains the same (modulo whitespace and code snippets). But if we just removed docstrings entirely, we wouldn't be checking anything.

I don't feel too strongly personally.

crates/ruff_workspace/src/options.rs

crates/ruff_workspace/src/settings.rs

crates/ruff_python_formatter/src/expression/string.rs

This changes the AST normalizer to remove anything that looks like a code snippet. The idea here is that, at least in tests, we want to check that reformatting Python code does *not* change its AST. Well, it is supposed to change its AST in some ways, and this normalizer tries to erase those changes. It turns out that reformatting code examples in docstrings will change the AST in even more profound ways. Namely, it can arbitrarily rewrite docstring contents. Because of this, it doesn't really seem feasible to normalize the strings in any way other than to remove anything that looks like a code snippet.

This adds a new internal-only knob to enable formatting of code snippets inside of docstrings.

And also update any extant tests that this new option impacts.

This adds some additional information to test failures that occur when formatting is not idempotent. Specifically, we want to see what formatting options were used.

This augments the Python's formatter context to include state about a docstring's quote style. Namely, the quote style refers to the kind of quotes used for a docstring that contains a code example that is currently being formatted. This state will allow us to choose the correct quote style to use while reformatting code snippets and will avoid writing invalid Python (in most cases).

This splits the docstring function for printing individual lines into a bulkier abstraction, but doesn't otherwise change anything. The idea here is to give the line printing a little more space to breathe for supporting code snippet reformatting. Namely, we will want line printing to have some kind of state for aggregate code snippets to reformat before printing.

This adds a few types for describing some data that helps facilitate formatting code snippets in docstrings. Basically, we have lines, code example lines, and different types of code examples. A passing familiarity with these types will help grok subsequent commits.

This commit adds a small but central component to code snippet formatting in docstrings: it specifically implements the state transitions needed to recognize and collect code snippets from doctests. This means looking for PS1 and PS2 prompts and extracting the code portion of each line. This also introduces a "code example add action" which we will use in a subsequent commit to control the higher level docstring line printer.

This connects all of the pieces together from the previous commit and makes the docstring line printer reformat doctest code snippets. This also includes a new (and possibly first?) recursive call to the formatter, so extra scrutiny there is most appreciated.

It looks like no previous snapshot tests have configured the line ending to anything other than the default, so this wasn't included in the test output. But we would actually like to try and test that line endings are correctly preserved when reformatting code snippets, so we make the option visible in the snapshot.

This test ensures that CRLF line endings are set correctly within a reformatted code snippet.

BurntSushi · 2023-11-27T15:39:05Z

I've split the commit that adds a user facing config option out into #8854 so that we should be able to merge this PR without any user facing changes.

BurntSushi · 2023-11-27T16:14:36Z

All righty, I've filed issues for problems that probably aren't blocking this PR being merged:

I'm going to go ahead and merge this PR. I'd be happy to address any other feedback folks might have!

Suggested here: #8811 (comment)

@charliermarsh

This turns `string` into a parent module with a `docstring` sub-module. I arranged things this way because there are parts of the `string` module that the `docstring` module wants to know about (such as a `NormalizedString`). The alternative I think would be to make `docstring` a sibling module and expose more of `string`'s internals. I think I overall like this change because it gives docstring handling a bit more room to breath. It has grown quite a bit with the addition of code snippet formatting. [This was suggested by @charliermarsh.](#8811 (comment))

MichaReiser

Thanks for writing the excellent documentation and organizing your commits so thoughtfully. It helped a lot when reviewing the PR (also to find savepoints when being interrupted)

Really impressive work. Well done. I've a few smaller comments. The most important one from a correctness point of view is the handling of newlines inside of multiline strings.

I think I'll learn many new English words from your commit messages 😆. So far:

rejigger
grok

crates/ruff_python_formatter/tests/normalizer.rs

crates/ruff_python_formatter/src/expression/string.rs

MichaReiser · 2023-11-28T00:25:32Z

crates/ruff_python_formatter/src/expression/string.rs

+            let indent = indent.to_string();
+            let code = code.to_string();


What's the reason for converting these to String rather than storing borrowed &str?

In addition to my other answer, at least with respect to code, I think when I was writing the types it wasn't obvious to me that the code portion of a line would necessarily be a proper substring of it. But I can't think of any contrary case.

MichaReiser · 2023-11-28T00:28:57Z

crates/ruff_python_formatter/src/expression/string.rs

+    // non-doctest line.
+    //
+    // [1]: https://github.com/python/cpython/blob/0ff6368519ed7542ad8b443de01108690102420a/Lib/doctest.py#L733
+    if ps1_indent != ps2_indent {


I addmit I haven't read the code in full, but would it be sufficient to check if the indents have the same column/character length as done in the Lexer or is it necessary that the strings match exactly?

See Indentation in our lexer.

The Python doctest module actually requires that the indent be made purely of ASCII space characters. Then it somewhat circuitously takes the length of the indentation in characters, reconstitutes the indentation as a string and then checks that all PS2 prompt lines have the same indentation prefix: https://github.com/python/cpython/blob/0ff6368519ed7542ad8b443de01108690102420a/Lib/doctest.py#L727-L733

Technically we are a little more relaxed here because we allow any kind of whitespace in the indentation. But I preserved the "indentation should be byte-for-byte equivalent" check.

So bottom line is that it's hard for me to say whether this is necessary or not. Probably not. But I do think it is pretty simple? Are you concerned about the costs of carrying the indentation string around? I think for code snippet formatting it is probably a lot less of a concern than when parsing arbitrary Python. The size scales are different.

I was mainly trying to understand the semantics to ensure we are as strict as Python when it comes to handling indentation. Thanks for the explanation.

MichaReiser · 2023-11-28T00:36:43Z

crates/ruff_python_formatter/src/expression/string.rs

+            .iter()
+            .map(|line| &*line.code)
+            .collect::<Vec<&str>>()
+            .join("\n");


How about line endings inside of multiline strings? Ideally, we would preserve these. Although we may deliberately decide not to and encourage users to instead use explicit line endings. But changing the line endings inside of strings is theoretically a semantic change.

I think Charlie had a similar comment above. The thinking here is that while I'm using \n to assemble a code snippet, that snippet is then fed into the formatter and the line endings used correspond to the user's setting. There is a test for example that this writes out CRLF line endings in code snippets when the user has configured CRLF line endings.

Did I understand you concern correctly here? I feel like I might be missing it.

Also, there are places in the docstring handling (that existed before my changes) that seem to assume line feed terminators? For example:

ruff/crates/ruff_python_formatter/src/expression/string/docstring.rs

Lines 237 to 238 in 2ade84a

// We know that the normalized string has \n line endings.

self.offset += line.line.text_len() + "\n".text_len();

Yeah, I think the problem is not new due to your changes. It only became evident that this might now be a problem when thinking about code snippets in docstrings (it shouldn't be a problem for non-code text).

a = """A multiline string with windows line endings """ b = c

My concern is (was) that changing the newlines inside of the multiline string could be a semantic change. I tried to read through the Python documentation and only found:

In triple-quoted literals, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the literal. (A “quote” is the character used to open the literal, i.e. either ' or ".) source

What I understand from the spec is that newlines are retained. Although running the above in a print on my mac normalizes the newlines to \n.

I then tested what the about for

a = b"""A multiline string with windows line endings """ from binascii import hexlify print(hexlify(a))

and it seems python normalizes the newlines even for binary strings to \n. So the retain only means that the newlines are preserved but not in the form they're present in the source code.

So I guess this is not a problem after all (and ruff and Black both normalize newlines inside multiline strings)

MichaReiser · 2023-11-28T00:37:32Z

crates/ruff_python_formatter/src/expression/string.rs

+        // a docstring. As we fix corner cases over time, we can perhaps
+        // remove this check. See the `doctest_invalid_skipped` tests in
+        // `docstring_code_examples.py` for when this check is relevant.
+        let wrapped = match self.quote_style {


Could we short-circuit if the source string doesn't contain any triple quotes (or escaped triple quotes)?

Yeah I think you mentioned this in the tracking issue for this. I think you're correct. And if so, yes, I believe we could short circuit. I'm not sure though if we want to keep this check as-is here for now as a conservative posture.

Understood. My only concern is performance because our parser isn't very fast today (I believe it's about 30% or more of the overall formatting time). An alternative could be to only lex the code and see if there are any lexer errors. But I don't know if that's sufficient to detect the kind of errors that could be introduced.

But agree, I don't think it's a high priority. Just trying to brainstorm a few ideas with the hope that there might be a low hanging perf improvement.

If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)

Ref: #8811 (comment)

@MichaReiser

This PR contains a few small clean-ups that are responses to @MichaReiser's review of my #8811 PR.

If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)

Ref: #8811 (comment)

If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)

Ref: #8811 (comment)

This PR does the plumbing to make a new formatting option, `docstring-code-format`, available in the configuration for end users. It is disabled by default (opt-in). It is opt-in at least initially to reflect a conservative posture. The intent is to make it opt-out at some point in the future. This was split out from #8811 in order to make #8811 easier to merge. Namely, once this is merged, docstring code snippet formatting will become available to end users. (See comments below for how we arrived at the name.) Closes #7146 ## Test Plan Other than the standard test suite, I ran the formatter over the CPython and polars projects to ensure both that the result looked sensible and that tests still passed. At time of writing, one issue that currently appears is that reformatting code snippets trips the long line lint: https://github.com/BurntSushi/polars/actions/runs/7006619426/job/19058868021

JacobCoffee · 2023-12-13T20:37:35Z

~~Can #3792, #8237 be closed then?~~
Oh i see,

This reflects initial support and it is intended to add support for Markdown and reStructuredText Python code blocks in the future.

🔜 ™️

BurntSushi · 2023-12-14T12:49:57Z

@JacobCoffee Yeah I think those issues are about applying ruff to Python code in contexts other than Python source files. Like .md or .rst files. And I think it also applies to ruff generally and not just ruff format.

BurntSushi requested review from charliermarsh and konstin November 21, 2023 20:14

charliermarsh reviewed Nov 21, 2023

View reviewed changes

crates/ruff_python_formatter/src/expression/string.rs Show resolved Hide resolved

charliermarsh reviewed Nov 21, 2023

View reviewed changes

crates/ruff_workspace/src/settings.rs Outdated Show resolved Hide resolved

charliermarsh reviewed Nov 21, 2023

View reviewed changes

crates/ruff_python_formatter/src/expression/string.rs Show resolved Hide resolved

charliermarsh reviewed Nov 21, 2023

View reviewed changes

crates/ruff_python_formatter/src/expression/string.rs Outdated Show resolved Hide resolved

charliermarsh reviewed Nov 21, 2023

View reviewed changes

crates/ruff_python_formatter/src/expression/string.rs Show resolved Hide resolved

charliermarsh reviewed Nov 21, 2023

View reviewed changes

BurntSushi force-pushed the ag/fmt/docstrings branch from cb98df2 to 9d466b9 Compare November 22, 2023 17:27

konstin approved these changes Nov 23, 2023

View reviewed changes

BurntSushi mentioned this pull request Nov 27, 2023

Using ruff to format code examples in docstrings #7146

Closed

BurntSushi requested a review from MichaReiser November 27, 2023 15:16

BurntSushi added 10 commits November 27, 2023 10:20

ruff_python_formatter: add new internal 'docstring-code' knob

4a6f113

This adds a new internal-only knob to enable formatting of code snippets inside of docstrings.

ruff_python_formatter: add docstring option to test failure output

172e4c4

And also update any extant tests that this new option impacts.

ruff_python_formatter: tweak test failure output

6d3e446

This adds some additional information to test failures that occur when formatting is not idempotent. Specifically, we want to see what formatting options were used.

ruff_python_formatter: add CRLF test

47e6e76

This test ensures that CRLF line endings are set correctly within a reformatted code snippet.

BurntSushi force-pushed the ag/fmt/docstrings branch from 9d466b9 to 47e6e76 Compare November 27, 2023 15:30

BurntSushi mentioned this pull request Nov 27, 2023

config: add new docstring-code-format knob #8854

Merged

This was referenced Nov 27, 2023

docstring code formatter: figure out how to handle line width #8855

Closed

docstring code formatter: emit warning messages when code snippets fail to format #8856

Open

docstring code formatter: remove "invalid Python" check #8857

Open

BurntSushi merged commit d9845a2 into main Nov 27, 2023
17 checks passed

BurntSushi deleted the ag/fmt/docstrings branch November 27, 2023 16:14

This was referenced Nov 27, 2023

docstring code formatter: add support for reStructuredText Python code snippets #8859

Closed

docstring code formatter: add support for Markdown Python code snippets #8860

Closed

BurntSushi added a commit that referenced this pull request Nov 27, 2023

ruff_python_formatter: move docstring handling to sub-module

a43afa7

Suggested here: #8811 (comment)

BurntSushi mentioned this pull request Nov 27, 2023

ruff_python_formatter: move docstring handling to a submodule #8861

Merged

MichaReiser reviewed Nov 28, 2023

View reviewed changes

BurntSushi added a commit that referenced this pull request Nov 28, 2023

ruff_python_formatter: make code snippet elision more explicit

0f604fa

If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Nov 28, 2023

ruff_python_formatter: improve docs on when a line is owned

50093e6

Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Nov 28, 2023

ruff_python_formatter: add link to doctest regex internals

c7db543

Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Nov 28, 2023

ruff_python_formatter: update stale comment

2ade84a

Ref: #8811 (comment)

BurntSushi mentioned this pull request Nov 28, 2023

ruff_python_formatter: small cleanups in doctest formatting #8871

Merged

BurntSushi added a commit that referenced this pull request Nov 28, 2023

ruff_python_formatter: small cleanups in doctest formatting (#8871)

4957d94

This PR contains a few small clean-ups that are responses to @MichaReiser's review of my #8811 PR.

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: make code snippet elision more explicit

313a073

If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: improve docs on when a line is owned

cf96ed8

Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: add link to doctest regex internals

61c73c5

Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: update stale comment

65b7ba5

Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: make code snippet elision more explicit

0cbfce6

If a test failure occurs, this at least makes it a little clearer that a code snippet has been removed. Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: improve docs on when a line is owned

0b3c38a

Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: add link to doctest regex internals

6361b07

Ref: #8811 (comment)

BurntSushi added a commit that referenced this pull request Dec 1, 2023

ruff_python_formatter: update stale comment

8bed0d9

Ref: #8811 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

format doctests in docstrings #8811

format doctests in docstrings #8811

BurntSushi commented Nov 21, 2023 •

edited

Loading

github-actions bot commented Nov 21, 2023 •

edited

Loading

charliermarsh left a comment

BurntSushi commented Nov 22, 2023

BurntSushi commented Nov 22, 2023

konstin Nov 23, 2023

BurntSushi Nov 27, 2023

BurntSushi commented Nov 27, 2023

BurntSushi commented Nov 27, 2023 •

edited

Loading

MichaReiser left a comment

MichaReiser Nov 28, 2023

BurntSushi Nov 28, 2023

MichaReiser Nov 28, 2023

BurntSushi Nov 28, 2023

MichaReiser Nov 28, 2023

MichaReiser Nov 28, 2023

BurntSushi Nov 28, 2023

BurntSushi Nov 28, 2023

MichaReiser Nov 28, 2023

MichaReiser Nov 28, 2023

BurntSushi Nov 28, 2023

MichaReiser Nov 28, 2023 •

edited

Loading

JacobCoffee commented Dec 13, 2023 •

edited

Loading

BurntSushi commented Dec 14, 2023

		let indent = indent.to_string();
		let code = code.to_string();

	// We know that the normalized string has \n line endings.
	self.offset += line.line.text_len() + "\n".text_len();

format doctests in docstrings #8811

format doctests in docstrings #8811

Conversation

BurntSushi commented Nov 21, 2023 • edited Loading

Summary

Test Plan

github-actions bot commented Nov 21, 2023 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

charliermarsh left a comment

Choose a reason for hiding this comment

BurntSushi commented Nov 22, 2023

BurntSushi commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi commented Nov 27, 2023

BurntSushi commented Nov 27, 2023 • edited Loading

MichaReiser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

JacobCoffee commented Dec 13, 2023 • edited Loading

BurntSushi commented Dec 14, 2023

BurntSushi commented Nov 21, 2023 •

edited

Loading

github-actions bot commented Nov 21, 2023 •

edited

Loading

`ruff-ecosystem` results

BurntSushi commented Nov 27, 2023 •

edited

Loading

MichaReiser Nov 28, 2023 •

edited

Loading

JacobCoffee commented Dec 13, 2023 •

edited

Loading