Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format doctests in docstrings #8811

Merged
merged 11 commits into from
Nov 27, 2023
Merged

format doctests in docstrings #8811

merged 11 commits into from
Nov 27, 2023

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented Nov 21, 2023

Summary

This PR adds opt-in support for formatting doctests in docstrings. This reflects initial support and it is intended to add support for Markdown and reStructuredText Python code blocks in the future. But I believe this PR lays the groundwork, and future additions for Markdown and reST should be less costly to add.

It's strongly recommended to review this PR commit-by-commit. The last few commits in particular implement the bulk of the work here and represent the denser portions.

Some things worth mentioning:

  • The formatter is itself not perfect, and it is possible for it to produce invalid Python code. Because of this, reformatted code snippets are checked for Python validity. If they aren't valid, then we (unfortunately silently) bail on formatting that code snippet.
  • There are a couple places where it would be nice to at least warn the user that doctest formatting failed, but it wasn't clear to me what the best way to do that is.
  • I haven't yet run this in anger on a real world code base. I think that should happen before merging.

Closes #7146

Test Plan

  • Pass the local test suite.
  • Scrutinize ecosystem changes.
  • Run this formatter on extant code and scrutinize the results. (e.g., CPython, numpy.)

Copy link
Contributor

github-actions bot commented Nov 21, 2023

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great, and I appreciate the way you broke it down by commit. I ended up reading the commits one-by-one and found the flow very clear.

@BurntSushi
Copy link
Member Author

I ran the docstring code formatter on CPython (after formatting the code without docstring code formatting enabled) and put the results in one commit: BurntSushi/cpython@19d1c85

I did a quick pass through them and everything looks like it passes the eyeball test. My next step is to make sure the doctests still run and pass.

@BurntSushi
Copy link
Member Author

My next step is to make sure the doctests still run and pass.

I checked out the main branch of CPython and built it:

$ export CFLAGS="${CFLAGS/-O2/-O3} -ffat-lto-objects"
$ ./configure --prefix=/usr \
              --enable-shared \
              --with-computed-gotos \
              --enable-optimizations \
              --with-lto \
              --enable-ipv6 \
              --with-system-expat \
              --with-dbmliborder=gdbm:ndbm \
              --with-system-libmpdec \
              --enable-loadable-sqlite-extensions \
              --without-ensurepip \
              --with-tzpath=/usr/share/zoneinfo
$ LC_CTYPE=en_US.UTF-8 make -j8 EXTRA_CFLAGS="$CFLAGS"

Then confirmed that I could run some doctests:

$ ./python -m doctest Lib/ipaddress.py -v

And some of that failed, but the summarize_address_range doctest passed.

Then I checked out my branch with the reformatted code snippets and re-ran the above. The results remain unchanged.

So at least in that one specific case that looked potentially odd, reformatting didn't break the doctest. (Specifically, the doctest directive in this example is still picked up.)

I haven't been able to figure out how to run all of the doctests yet though.

@@ -12,6 +16,10 @@ use ruff_python_ast::{self as ast, Expr, Stmt};
/// between `class C: ...` and `class C(): ...`, which is part of our AST but not `CPython`'s.
/// - Normalize strings. The formatter can re-indent docstrings, so we need to compare string
/// contents ignoring whitespace. (Black does the same.)
/// - The formatter can also reformat code snippets when they're Python code, which can of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the AST normalizer to remove anything that looks like a
code snippet.

Could we instead remove docstrings from the equality check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could, but I think that would potentially reduce the effectiveness of the test. Today, we still test that the substance of the docstring remains the same (modulo whitespace and code snippets). But if we just removed docstrings entirely, we wouldn't be checking anything.

I don't feel too strongly personally.

crates/ruff_workspace/src/options.rs Outdated Show resolved Hide resolved
crates/ruff_workspace/src/settings.rs Outdated Show resolved Hide resolved
crates/ruff_python_formatter/src/expression/string.rs Outdated Show resolved Hide resolved
This changes the AST normalizer to remove anything that looks like a
code snippet. The idea here is that, at least in tests, we want to check
that reformatting Python code does *not* change its AST. Well, it is
supposed to change its AST in some ways, and this normalizer tries to
erase those changes.

It turns out that reformatting code examples in docstrings will change
the AST in even more profound ways. Namely, it can arbitrarily rewrite
docstring contents. Because of this, it doesn't really seem feasible to
normalize the strings in any way other than to remove anything that
looks like a code snippet.
This adds a new internal-only knob to enable formatting of code
snippets inside of docstrings.
And also update any extant tests that this new option impacts.
This adds some additional information to test failures that
occur when formatting is not idempotent. Specifically, we want
to see what formatting options were used.
This augments the Python's formatter context to include
state about a docstring's quote style. Namely, the quote
style refers to the kind of quotes used for a docstring
that contains a code example that is currently being
formatted. This state will allow us to choose the correct
quote style to use while reformatting code snippets and
will avoid writing invalid Python (in most cases).
This splits the docstring function for printing individual lines into a
bulkier abstraction, but doesn't otherwise change anything. The idea
here is to give the line printing a little more space to breathe for
supporting code snippet reformatting. Namely, we will want line printing
to have some kind of state for aggregate code snippets to reformat
before printing.
This adds a few types for describing some data that helps
facilitate formatting code snippets in docstrings.

Basically, we have lines, code example lines, and different types
of code examples.

A passing familiarity with these types will help grok subsequent
commits.
This commit adds a small but central component to code snippet
formatting in docstrings: it specifically implements the state
transitions needed to recognize and collect code snippets from
doctests. This means looking for PS1 and PS2 prompts and extracting
the code portion of each line.

This also introduces a "code example add action" which we will
use in a subsequent commit to control the higher level docstring
line printer.
This connects all of the pieces together from the previous commit
and makes the docstring line printer reformat doctest code snippets.

This also includes a new (and possibly first?) recursive call to the
formatter, so extra scrutiny there is most appreciated.
It looks like no previous snapshot tests have configured the line ending
to anything other than the default, so this wasn't included in the test
output. But we would actually like to try and test that line endings are
correctly preserved when reformatting code snippets, so we make the
option visible in the snapshot.
This test ensures that CRLF line endings are set correctly within a
reformatted code snippet.
@BurntSushi
Copy link
Member Author

I've split the commit that adds a user facing config option out into #8854 so that we should be able to merge this PR without any user facing changes.

@BurntSushi
Copy link
Member Author

BurntSushi commented Nov 27, 2023

@BurntSushi BurntSushi merged commit d9845a2 into main Nov 27, 2023
17 checks passed
@BurntSushi BurntSushi deleted the ag/fmt/docstrings branch November 27, 2023 16:14
BurntSushi added a commit that referenced this pull request Nov 27, 2023
BurntSushi added a commit that referenced this pull request Nov 27, 2023
This turns `string` into a parent module with a `docstring` sub-module.
I arranged things this way because there are parts of the `string`
module that the `docstring` module wants to know about (such as a
`NormalizedString`). The alternative I think would be to make
`docstring` a sibling module and expose more of `string`'s internals.

I think I overall like this change because it gives docstring handling a
bit more room to breath. It has grown quite a bit with the addition of
code snippet formatting.

[This was suggested by
@charliermarsh.](#8811 (comment))
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing the excellent documentation and organizing your commits so thoughtfully. It helped a lot when reviewing the PR (also to find savepoints when being interrupted)

Really impressive work. Well done. I've a few smaller comments. The most important one from a correctness point of view is the handling of newlines inside of multiline strings.

I think I'll learn many new English words from your commit messages 😆. So far:

  • rejigger
  • grok

crates/ruff_python_formatter/src/expression/string.rs Outdated Show resolved Hide resolved
Comment on lines +1141 to +1142
let indent = indent.to_string();
let code = code.to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for converting these to String rather than storing borrowed &str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to my other answer, at least with respect to code, I think when I was writing the types it wasn't obvious to me that the code portion of a line would necessarily be a proper substring of it. But I can't think of any contrary case.

// non-doctest line.
//
// [1]: https://github.com/python/cpython/blob/0ff6368519ed7542ad8b443de01108690102420a/Lib/doctest.py#L733
if ps1_indent != ps2_indent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addmit I haven't read the code in full, but would it be sufficient to check if the indents have the same column/character length as done in the Lexer or is it necessary that the strings match exactly?

See Indentation in our lexer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python doctest module actually requires that the indent be made purely of ASCII space characters. Then it somewhat circuitously takes the length of the indentation in characters, reconstitutes the indentation as a string and then checks that all PS2 prompt lines have the same indentation prefix: https://github.com/python/cpython/blob/0ff6368519ed7542ad8b443de01108690102420a/Lib/doctest.py#L727-L733

Technically we are a little more relaxed here because we allow any kind of whitespace in the indentation. But I preserved the "indentation should be byte-for-byte equivalent" check.

So bottom line is that it's hard for me to say whether this is necessary or not. Probably not. But I do think it is pretty simple? Are you concerned about the costs of carrying the indentation string around? I think for code snippet formatting it is probably a lot less of a concern than when parsing arbitrary Python. The size scales are different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly trying to understand the semantics to ensure we are as strict as Python when it comes to handling indentation. Thanks for the explanation.

.iter()
.map(|line| &*line.code)
.collect::<Vec<&str>>()
.join("\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about line endings inside of multiline strings? Ideally, we would preserve these. Although we may deliberately decide not to and encourage users to instead use explicit line endings. But changing the line endings inside of strings is theoretically a semantic change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Charlie had a similar comment above. The thinking here is that while I'm using \n to assemble a code snippet, that snippet is then fed into the formatter and the line endings used correspond to the user's setting. There is a test for example that this writes out CRLF line endings in code snippets when the user has configured CRLF line endings.

Did I understand you concern correctly here? I feel like I might be missing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are places in the docstring handling (that existed before my changes) that seem to assume line feed terminators? For example:

// We know that the normalized string has \n line endings.
self.offset += line.line.text_len() + "\n".text_len();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the problem is not new due to your changes. It only became evident that this might now be a problem when thinking about code snippets in docstrings (it shouldn't be a problem for non-code text).

a = """A multiline string
with windows line endings
"""

b = c
Screenshot 2023-11-29 at 07 23 49

My concern is (was) that changing the newlines inside of the multiline string could be a semantic change. I tried to read through the Python documentation and only found:

In triple-quoted literals, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the literal. (A “quote” is the character used to open the literal, i.e. either ' or ".) source

What I understand from the spec is that newlines are retained. Although running the above in a print on my mac normalizes the newlines to \n.

I then tested what the about for

a = b"""A multiline string
with windows line endings
"""

from binascii import hexlify

print(hexlify(a))

and it seems python normalizes the newlines even for binary strings to \n. So the retain only means that the newlines are preserved but not in the form they're present in the source code.

So I guess this is not a problem after all (and ruff and Black both normalize newlines inside multiline strings)

// a docstring. As we fix corner cases over time, we can perhaps
// remove this check. See the `doctest_invalid_skipped` tests in
// `docstring_code_examples.py` for when this check is relevant.
let wrapped = match self.quote_style {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we short-circuit if the source string doesn't contain any triple quotes (or escaped triple quotes)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you mentioned this in the tracking issue for this. I think you're correct. And if so, yes, I believe we could short circuit. I'm not sure though if we want to keep this check as-is here for now as a conservative posture.

Copy link
Member

@MichaReiser MichaReiser Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. My only concern is performance because our parser isn't very fast today (I believe it's about 30% or more of the overall formatting time). An alternative could be to only lex the code and see if there are any lexer errors. But I don't know if that's sufficient to detect the kind of errors that could be introduced.

But agree, I don't think it's a high priority. Just trying to brainstorm a few ideas with the hope that there might be a low hanging perf improvement.

BurntSushi added a commit that referenced this pull request Nov 28, 2023
If a test failure occurs, this at least makes it a little clearer that a
code snippet has been removed.

Ref: #8811 (comment)
BurntSushi added a commit that referenced this pull request Nov 28, 2023
BurntSushi added a commit that referenced this pull request Nov 28, 2023
This PR contains a few small clean-ups that are responses to
@MichaReiser's review of my #8811 PR.
BurntSushi added a commit that referenced this pull request Dec 1, 2023
If a test failure occurs, this at least makes it a little clearer that a
code snippet has been removed.

Ref: #8811 (comment)
BurntSushi added a commit that referenced this pull request Dec 1, 2023
BurntSushi added a commit that referenced this pull request Dec 1, 2023
If a test failure occurs, this at least makes it a little clearer that a
code snippet has been removed.

Ref: #8811 (comment)
BurntSushi added a commit that referenced this pull request Dec 1, 2023
BurntSushi added a commit that referenced this pull request Dec 13, 2023
This PR does the plumbing to make a new formatting option,
`docstring-code-format`, available in the configuration for end users.
It is disabled by default (opt-in). It is opt-in at least initially to
reflect a conservative posture. The intent is to make it opt-out at some
point in the future.

This was split out from #8811 in order to make #8811 easier to merge.
Namely, once this is merged, docstring code snippet formatting will
become available to end users. (See comments below for how we arrived at
the name.)

Closes #7146

## Test Plan

Other than the standard test suite, I ran the formatter over the CPython
and polars projects to ensure both that the result looked sensible and
that tests still passed. At time of writing, one issue that currently
appears is that reformatting code snippets trips the long line lint:
https://github.com/BurntSushi/polars/actions/runs/7006619426/job/19058868021
@JacobCoffee
Copy link
Contributor

JacobCoffee commented Dec 13, 2023

Can #3792, #8237 be closed then?
Oh i see,

This reflects initial support and it is intended to add support for Markdown and reStructuredText Python code blocks in the future.

🔜 ™️

@BurntSushi
Copy link
Member Author

@JacobCoffee Yeah I think those issues are about applying ruff to Python code in contexts other than Python source files. Like .md or .rst files. And I think it also applies to ruff generally and not just ruff format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using ruff to format code examples in docstrings
6 participants