Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docstring code formatter: remove "invalid Python" check #8857

Open
BurntSushi opened this issue Nov 27, 2023 · 4 comments
Open

docstring code formatter: remove "invalid Python" check #8857

BurntSushi opened this issue Nov 27, 2023 · 4 comments
Labels
docstring Related to docstring linting or formatting formatter Related to the formatter performance Potential performance improvement wish Not on the current roadmap; maybe in the future

Comments

@BurntSushi
Copy link
Member

#8811 added a docstring code snippet formatter. As part of the initial implementation, it is actually possible for the reformatter to transform valid Python to invalid Python, usually as a result of corner cases related to triple quoting. Since these are odd cases, for expediency, the initial implementation checks if the reformatted code is valid. If it isn't, then it bails out of reformatting and skips the code snippets entirely.

Ideally, we would be able to have more confidence in our code snippet reformatter to the point that we could remove this check for invalid Python code. Doing this will likely require some refactoring for how nested triple quotes are handled.

Here's a good example from the tests that doesn't work today:

# Test that a doctest containing a triple quoted string at least
# does not result in invalid Python code. Ideally this would format
# correctly, but at time of writing it does not.
def doctest_invalid_skipped_with_triple_double_in_single_quote_string():
"""
Do cool stuff.
>>> x = '\"\"\"'
"""
pass

Namely, the code snippet there ought to be formatted, but today, it is skipped because the reformatting currently generates invalid Python code.

@BurntSushi BurntSushi added performance Potential performance improvement formatter Related to the formatter wish Not on the current roadmap; maybe in the future labels Nov 27, 2023
@MichaReiser
Copy link
Member

Do we have a benchmark that tests the performance of docstring formatting? Adding one could help to understand the performance characteristics better and validate any improvements to it.

Do we always perform the re-parse today or do we use a heuristic when the re-parse is necessary (only when the source text contains triple quoted strings?). A text search should be multiple times faster than a full re-parse.

@MichaReiser MichaReiser added the docstring Related to docstring linting or formatting label Nov 27, 2023
@BurntSushi
Copy link
Member Author

There aren't any benchmarks for code snippet formatting yet. Using a heuristic to avoid a re-parse seems plausible, but only if we buy that the cases where invalid Python can be produced are limited to triple quoting. (I think I buy that, otherwise I don't see how it could break out of the enclosing docstring.)

For an ad hoc benchmark, if I run the formatter with and without docstring-code enabled, then runtime is about the same. I ran it a few times and couldn't notice a difference. (Not that this is a good substitute for a real benchmark, but perhaps suggestive that docstring code formatting doesn't have a huge impact on perf.)

@MichaReiser
Copy link
Member

MichaReiser commented Nov 28, 2023

For an ad hoc benchmark, if I run the formatter with and without docstring-code enabled, then runtime is about the same. I ran it a few times and couldn't notice a difference. (Not that this is a good substitute for a real benchmark, but perhaps suggestive that docstring code formatting doesn't have a huge impact on perf.)

I assume you did that for a project that has docstrings?

If you happen to have a test file with code snippets, then you could add it to this benchmark:

fn create_test_cases() -> Result<Vec<TestCase>, TestFileDownloadError> {
Ok(vec![
TestCase::fast(TestFile::try_download("numpy/globals.py", "https://raw.githubusercontent.com/numpy/numpy/89d64415e349ca75a25250f22b874aa16e5c0973/numpy/_globals.py")?),
TestCase::fast(TestFile::try_download("unicode/pypinyin.py", "https://raw.githubusercontent.com/mozillazg/python-pinyin/9521e47d96e3583a5477f5e43a2e82d513f27a3f/pypinyin/standard.py")?),
TestCase::normal(TestFile::try_download(
"pydantic/types.py",
"https://raw.githubusercontent.com/pydantic/pydantic/83b3c49e99ceb4599d9286a3d793cea44ac36d4b/pydantic/types.py",
)?),
TestCase::normal(TestFile::try_download("numpy/ctypeslib.py", "https://raw.githubusercontent.com/numpy/numpy/e42c9503a14d66adfd41356ef5640c6975c45218/numpy/ctypeslib.py")?),
TestCase::slow(TestFile::try_download(
"large/dataset.py",
"https://raw.githubusercontent.com/DHI/mikeio/b7d26418f4db2909b0aa965253dbe83194d7bb5b/tests/test_dataset.py",
)?),
])
}

And change the options to always enable the docstring code option here

let options = PyFormatOptions::from_extension(Path::new(case.name()));

The benchmark will then run automatically as part of codspeed.

@BurntSushi
Copy link
Member Author

I assume you did that for a project that has docstrings?

Yeah. I ran it on CPython and polars.

I added a new issue tracking adding benchmarks specifically targeted to code snippet reformatting: #8909

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docstring Related to docstring linting or formatting formatter Related to the formatter performance Potential performance improvement wish Not on the current roadmap; maybe in the future
Projects
None yet
Development

No branches or pull requests

2 participants