Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures: Python 3.12-dev: pycode.parser.DefinitionFinder line numbering assertion failures #11436

Closed
jayaddison opened this issue May 22, 2023 · 6 comments

Comments

@jayaddison
Copy link
Contributor

Describe the bug

With recent development builds of Python 3.12, a few parsing-related tests have begun failing. These are built upon Python's tokenizer module and are implemented in the pycode.parser.DefinitionFinder class.

The failures appear to relate to the logic that determines the ending line number for definitions.

Example build logs:

cc @picnixz

How to Reproduce

Minimal repro steps pending; in the meantime, this appears reproducible for most pushes/pull requests that initiate GitHub Actions unit test workflows from this repository.

Environment Information

Python 3.12.0a7+

(sourced from: https://github.com/sphinx-doc/sphinx/actions/runs/4974642742/jobs/8901224179?pr=11423#step:5:10)

Sphinx extensions

N/A

Additional context

Some discussion from today (2023-05-22) in #11435.

@mgmacias95
Copy link

A consequence of porting the Python based implementation of the tokenizer to C, is that now dedents are limited to existing lines in the file. For example, the following python code:

if foo:
    if bar:
        pass

Produces the following output before the change:

$ python -m tokenize < test.py 
1,0-1,2:            NAME           'if'           
1,3-1,6:            NAME           'foo'          
1,6-1,7:            OP             ':'            
1,7-1,8:            NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,6:            NAME           'if'           
2,7-2,10:           NAME           'bar'          
2,10-2,11:          OP             ':'            
2,11-2,12:          NEWLINE        '\n'           
3,0-3,8:            INDENT         '        '     
3,8-3,12:           NAME           'pass'         
3,12-3,13:          NEWLINE        '\n'           
4,0-4,0:            DEDENT         ''             
4,0-4,0:            DEDENT         ''             
4,0-4,0:            ENDMARKER      ''             

And produces this output after them:

$ python.exe -m tokenize < lel.py 
1,0-1,2:            NAME           'if'           
1,3-1,6:            NAME           'foo'          
1,6-1,7:            OP             ':'            
1,7-1,8:            NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,6:            NAME           'if'           
2,7-2,10:           NAME           'bar'          
2,10-2,11:          OP             ':'            
2,11-2,12:          NEWLINE        '\n'           
3,0-3,8:            INDENT         '        '     
3,8-3,12:           NAME           'pass'         
3,12-3,13:          NEWLINE        '\n'           
3,13-3,13:          DEDENT         ''             
3,13-3,13:          DEDENT         ''             
4,0-4,0:            ENDMARKER      ''       

Notice the difference in the last two DEDENT tokens. In the first output, these DEDENTS are marked in line 4 and in the second, they are marked in line 3. Notice the file only has 3 lines.

Checking this code, it seems doing -1 is no longer necessary:

end_pos = self.current.end[0] - 1
while emptyline_re.match(self.get_line(end_pos)):
end_pos -= 1

@jayaddison
Copy link
Contributor Author

That is brilliant - thank you very much, @mgmacias95!

jayaddison added a commit to jayaddison/sphinx that referenced this issue May 22, 2023
…e references on py3.12+

This adjustment is no longer required since accurate line-end information is emitted as of Python 3.12.0a7

Refs: sphinx-doc#11436
jayaddison added a commit to jayaddison/sphinx that referenced this issue May 22, 2023
…e references on py3.12+

This adjustment is no longer required since accurate line-end information is emitted as of Python 3.12.0a7

Refs: sphinx-doc#11436
jayaddison added a commit to jayaddison/sphinx that referenced this issue May 22, 2023
…e references on py3.12+

This adjustment is no longer required since accurate line-end information is emitted as of Python 3.12.0a7

Refs: sphinx-doc#11436
@jayaddison
Copy link
Contributor Author

I seem to be thrashing around at potential solutions here (I don't have a local install of 3.12.0a7, so I've been leaning on the continuous integration results - but three attempts is the limit of how much of my misunderstanding (and GitHub's CPU time) that I'm willing to spend). I should take a rest and re-read the behaviour changes (and/or let someone else step in with a better fix).

@jayaddison
Copy link
Contributor Author

With the change from 95985e3 applied (essentially: add a special-case for py3.12+ when dedent tokens are found on the ending line of the file), 8 of the 12 failing tests begin passing.

One of the remaining tests is test_LiteralIncludeReader_pyobject2, where code is tokenized and the we attempt to filter the output to include only the Bar object definition.

Again from commit 95985e3, that test fails because some unexpected trailing code is included in the output (diff as found by pytest shown below):

    class Bar:
        def baz():
            pass
  + 
  + # comment after Bar class definition
  + def bar(): pass

As an experiment, I ran the python -m tokenize module on the relevant test code file (tests/roots/test-directive-code/literal.inc) for the matrix of Python versions that Sphinx uses for testing. The output of that tokenization appears to be identical in all cases:

# copied and pasted the output of each 'Tokenize sample file (for issue 11436)' step into local files
$ sha256sum sphinx-11436-py3*
a2fa3da51720b54caef9f539338a7a220aa4c43c8874675698f684d544c9570d  sphinx-11436-py310
a2fa3da51720b54caef9f539338a7a220aa4c43c8874675698f684d544c9570d  sphinx-11436-py311
a2fa3da51720b54caef9f539338a7a220aa4c43c8874675698f684d544c9570d  sphinx-11436-py312
a2fa3da51720b54caef9f539338a7a220aa4c43c8874675698f684d544c9570d  sphinx-11436-py38
a2fa3da51720b54caef9f539338a7a220aa4c43c8874675698f684d544c9570d  sphinx-11436-py39

@jayaddison
Copy link
Contributor Author

It seems that the 'navigate back one line' (subtract one from the dedent line position) logic in the DefinitionFinder exists because dedent tokens often (always?) have an end line number that is associated with a non-empty line of code (that sorta makes sense: most/all indented code blocks in Python have to be implicitly closed by a subsequent, less-indented code line).

Since the code intends to find the start/end range of the preceding code block, it should always omit that non-empty line, and then also omit any empty lines between there and the end of the definition that is being processed.

The potential fix in #11440 attempts to handle the situation without relying on Python version number checks.

@jayaddison
Copy link
Contributor Author

I think this has been resolved by an adjustment from python/cpython#104980 (could you double-check me on that @mgmacias95?).

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants