Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: report line and column on requirements parser errors #2100

Merged

Conversation

samypr100
Copy link
Contributor

@samypr100 samypr100 commented Mar 1, 2024

Summary

Closes #2012

This changes RequirementsTxtParserError::Parser to take a line, column instead of cursor location to improve reporting of parser errors. A new function was added to compute the line and column based on the content and cursor location when a parser error occurs for simplicity.

Given uv pip compile .\requirements.txt of below

numpy>=1,<2
  --borken
tqdm

Before:

error: Unexpected '-', expected '-c', '-e', '-r' or the start of a requirement in `.\requirements.txt` at position 14

After:

error: Unexpected '-', expected '-c', '-e', '-r' or the start of a requirement in `.\requirements.txt` at position 2:3

Open Question: Do we want to support line:column for other types of errors? I didn't look dig other potential error types where this might be desired.

Test Plan

New test was added to requirements-txt crate with this example.

@samypr100 samypr100 marked this pull request as ready for review March 1, 2024 03:38
break;
}
// This should work fine for both Windows and Linux line endings
if char == '\n' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, if we see \r\n, I think we technically want to avoid incrementing column after we see the \r. I can't see how this would matter in practice for this use-case, but e.g., you can see an example of how that's done here in Ruff: https://github.com/astral-sh/ruff/blob/c9931a548ff07c031130571f3664343bea224026/crates/ruff_source_file/src/line_index.rs#L29

Copy link
Contributor Author

@samypr100 samypr100 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we technically want to avoid incrementing column after we see the \r

Agreed, I left this as-is on purpose but I really debated to either keep this for simplicity or track the prev_char reference for these types of checks. Luckily it can be changed easily whichever route we want to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to change it just for completeness, in case this gets reused elsewhere. Are you ok to modify it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this is what you had in mind d0f7161

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was aiming for something more like: if we see \r, check if the next character is \n; if so, skip it. (That would correctly handle \r, \n, and \r\n.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay (work 😆), hopefully this is closer 223fd99

write!(
f,
"{message} in `{}` at position {location}",
"{message} in `{}` at position {line}:{column}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make that format at <REQUIREMENTS_TXT>:<line>:<col>? This format is support by many IDEs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 223fd99

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, mileage may vary, I noticed that for full-paths it does highlight on my IDE, but not relative paths like ./requirements.txt. We could try to canonicalize always?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using current_dir() and joining the path on top of it is the best way, canonicalize could be a path outside the apparent project root (we used to canonicalize everything but rolled that back since it caused problems where software expected the "virtual" name of the link).

let mut line = 1;
let mut column = 1;

for (index, char) in content.char_indices() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want byte indices rather than char indices here? CC @BurntSushi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_indices does actually return byte offsets. The index is referring to the index in content of where char starts. For example, this code:

let content = "a💩b";
for (i, ch) in content.char_indices() {
    eprintln!("{}:{}", i, ch);
}

Has this output:

0:a
1:💩
5:b

Since 💩 has a UTF-8 encoding that uses 4 bytes.

let mut line = 1;
let mut column = 1;

for (index, char) in content.char_indices() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_indices does actually return byte offsets. The index is referring to the index in content of where char starts. For example, this code:

let content = "a💩b";
for (i, ch) in content.char_indices() {
    eprintln!("{}:{}", i, ch);
}

Has this output:

0:a
1:💩
5:b

Since 💩 has a UTF-8 encoding that uses 4 bytes.

line += 1;
column = 1;
} else if char != '\r' {
column += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the unicode-width crate to compute the visual width of the codepoint via UnicodeWidthChar::width.

Copy link
Member

@MichaReiser MichaReiser Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be using width here. I would need to check again but what I remember is that editors use character offsets for column indices and not their width. Ruff also uses character offsets and not the width for diagnostics. What uv output should match editors or shortcuts like go to position won't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that we should do whatever editors are likely to use here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think char_indices is okay here, but I would like to tweak the newline handling slightly in line with the Ruff reference above.

@@ -317,6 +317,27 @@ pub struct RequirementsTxt {
pub no_index: bool,
}

/// Calculates column and line based on the cursor and content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you use unicode-width per my suggestion below, can you just add a quick note here documenting that we define column in this context as the, "offset according to the visual width of each codepoint."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unicode-width is not the right thing to use, then just adding, "offset according to the number of Unicode codepoints."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it as a comment in 223fd99

});

Ok(())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a couple of tests that include non-ASCII codepoints. More concretely, one test with a non-ASCII codepoint like, say, 💩. And then another test with grapheme cluster made up of more than one codepoint like à̖ (that's a\u{0300}\u{0316} as a string literal in Rust).

Copy link
Contributor Author

@samypr100 samypr100 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 223fd99, included one with two codepoints and your example one with three.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

}
match char {
'\r' => {
// If the next character is a newline, skip it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tweaked this a little because \r on its own should be considered a newline (it's not used on (m)any modern platforms, but older macOS did use it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np, I went a bit back and forth on that. Glad it's settled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the back-and-forth, I just have scars from Ruff where we didn't handle all the newline kinds and eventually hit panics in certain projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, makes sense, and feedback is always welcomed 💯

@charliermarsh charliermarsh added the error messages Messaging when something goes wrong label Mar 1, 2024
@charliermarsh charliermarsh enabled auto-merge (squash) March 1, 2024 21:41
@charliermarsh charliermarsh merged commit c7c3aff into astral-sh:main Mar 1, 2024
7 checks passed
@samypr100 samypr100 deleted the requirements-parsing-line-column branch March 11, 2024 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error messages Messaging when something goes wrong
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report line numbers in requirements.txt parser
5 participants