Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: report line and column on requirements parser errors #2100

Merged
Merged
130 changes: 103 additions & 27 deletions crates/requirements-txt/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,27 @@ pub struct RequirementsTxt {
pub no_index: bool,
}

/// Calculates column and line based on the cursor and content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you use unicode-width per my suggestion below, can you just add a quick note here documenting that we define column in this context as the, "offset according to the visual width of each codepoint."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unicode-width is not the right thing to use, then just adding, "offset according to the number of Unicode codepoints."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it as a comment in 223fd99

fn calculate_line_column_pair(content: &str, position: usize) -> (usize, usize) {
let mut line = 1;
let mut column = 1;

for (index, char) in content.char_indices() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want byte indices rather than char indices here? CC @BurntSushi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_indices does actually return byte offsets. The index is referring to the index in content of where char starts. For example, this code:

let content = "a💩b";
for (i, ch) in content.char_indices() {
    eprintln!("{}:{}", i, ch);
}

Has this output:

0:a
1:💩
5:b

Since 💩 has a UTF-8 encoding that uses 4 bytes.

if index >= position {
break;
}
// This should work fine for both Windows and Linux line endings
if char == '\n' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, if we see \r\n, I think we technically want to avoid incrementing column after we see the \r. I can't see how this would matter in practice for this use-case, but e.g., you can see an example of how that's done here in Ruff: https://github.com/astral-sh/ruff/blob/c9931a548ff07c031130571f3664343bea224026/crates/ruff_source_file/src/line_index.rs#L29

Copy link
Collaborator Author

@samypr100 samypr100 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we technically want to avoid incrementing column after we see the \r

Agreed, I left this as-is on purpose but I really debated to either keep this for simplicity or track the prev_char reference for these types of checks. Luckily it can be changed easily whichever route we want to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to change it just for completeness, in case this gets reused elsewhere. Are you ok to modify it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this is what you had in mind d0f7161

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was aiming for something more like: if we see \r, check if the next character is \n; if so, skip it. (That would correctly handle \r, \n, and \r\n.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay (work 😆), hopefully this is closer 223fd99

line += 1;
column = 1;
} else if char != '\r' {
column += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the unicode-width crate to compute the visual width of the codepoint via UnicodeWidthChar::width.

Copy link
Member

@MichaReiser MichaReiser Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be using width here. I would need to check again but what I remember is that editors use character offsets for column indices and not their width. Ruff also uses character offsets and not the width for diagnostics. What uv output should match editors or shortcuts like go to position won't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that we should do whatever editors are likely to use here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think char_indices is okay here, but I would like to tweak the newline handling slightly in line with the Ruff reference above.

}
}

(line, column)
}

impl RequirementsTxt {
/// See module level documentation
#[instrument(skip_all, fields(requirements_txt = requirements_txt.as_ref().as_os_str().to_str()))]
Expand Down Expand Up @@ -412,9 +433,11 @@ impl RequirementsTxt {
}
RequirementsTxtStatement::IndexUrl(url) => {
if data.index_url.is_some() {
let (line, column) = calculate_line_column_pair(content, s.cursor());
return Err(RequirementsTxtParserError::Parser {
message: "Multiple `--index-url` values provided".to_string(),
location: s.cursor(),
line,
column,
});
}
data.index_url = Some(url);
Expand Down Expand Up @@ -453,36 +476,36 @@ fn parse_entry(
eat_wrappable_whitespace(s);
while s.at(['\n', '\r', '#']) {
// skip comments
eat_trailing_line(s)?;
eat_trailing_line(content, s)?;
eat_wrappable_whitespace(s);
}

let start = s.cursor();
Ok(Some(if s.eat_if("-r") || s.eat_if("--requirement") {
let requirements_file = parse_value(s, |c: char| !['\n', '\r', '#'].contains(&c))?;
let requirements_file = parse_value(content, s, |c: char| !['\n', '\r', '#'].contains(&c))?;
let end = s.cursor();
eat_trailing_line(s)?;
eat_trailing_line(content, s)?;
RequirementsTxtStatement::Requirements {
filename: requirements_file.to_string(),
start,
end,
}
} else if s.eat_if("-c") || s.eat_if("--constraint") {
let constraints_file = parse_value(s, |c: char| !['\n', '\r', '#'].contains(&c))?;
let constraints_file = parse_value(content, s, |c: char| !['\n', '\r', '#'].contains(&c))?;
let end = s.cursor();
eat_trailing_line(s)?;
eat_trailing_line(content, s)?;
RequirementsTxtStatement::Constraint {
filename: constraints_file.to_string(),
start,
end,
}
} else if s.eat_if("-e") || s.eat_if("--editable") {
let path_or_url = parse_value(s, |c: char| !['\n', '\r'].contains(&c))?;
let path_or_url = parse_value(content, s, |c: char| !['\n', '\r'].contains(&c))?;
let editable_requirement = EditableRequirement::parse(path_or_url, working_dir)
.map_err(|err| err.with_offset(start))?;
RequirementsTxtStatement::EditableRequirement(editable_requirement)
} else if s.eat_if("-i") || s.eat_if("--index-url") {
let given = parse_value(s, |c: char| !['\n', '\r'].contains(&c))?;
let given = parse_value(content, s, |c: char| !['\n', '\r'].contains(&c))?;
let url = VerbatimUrl::parse(given)
.map(|url| url.with_given(given.to_owned()))
.map_err(|err| RequirementsTxtParserError::Url {
Expand All @@ -493,7 +516,7 @@ fn parse_entry(
})?;
RequirementsTxtStatement::IndexUrl(url)
} else if s.eat_if("--extra-index-url") {
let given = parse_value(s, |c: char| !['\n', '\r'].contains(&c))?;
let given = parse_value(content, s, |c: char| !['\n', '\r'].contains(&c))?;
let url = VerbatimUrl::parse(given)
.map(|url| url.with_given(given.to_owned()))
.map_err(|err| RequirementsTxtParserError::Url {
Expand All @@ -506,7 +529,7 @@ fn parse_entry(
} else if s.eat_if("--no-index") {
RequirementsTxtStatement::NoIndex
} else if s.eat_if("--find-links") || s.eat_if("-f") {
let path_or_url = parse_value(s, |c: char| !['\n', '\r'].contains(&c))?;
let path_or_url = parse_value(content, s, |c: char| !['\n', '\r'].contains(&c))?;
let path_or_url = FindLink::parse(path_or_url, working_dir).map_err(|err| {
RequirementsTxtParserError::Url {
source: err,
Expand All @@ -524,11 +547,13 @@ fn parse_entry(
editable: false,
})
} else if let Some(char) = s.peek() {
let (line, column) = calculate_line_column_pair(content, s.cursor());
return Err(RequirementsTxtParserError::Parser {
message: format!(
"Unexpected '{char}', expected '-c', '-e', '-r' or the start of a requirement"
),
location: s.cursor(),
line,
column,
});
} else {
// EOF
Expand All @@ -549,7 +574,7 @@ fn eat_wrappable_whitespace<'a>(s: &mut Scanner<'a>) -> &'a str {
}

/// Eats the end of line or a potential trailing comma
fn eat_trailing_line(s: &mut Scanner) -> Result<(), RequirementsTxtParserError> {
fn eat_trailing_line(content: &str, s: &mut Scanner) -> Result<(), RequirementsTxtParserError> {
s.eat_while([' ', '\t']);
match s.eat() {
None | Some('\n') => {} // End of file or end of line, nothing to do
Expand All @@ -563,9 +588,11 @@ fn eat_trailing_line(s: &mut Scanner) -> Result<(), RequirementsTxtParserError>
}
}
Some(other) => {
let (line, column) = calculate_line_column_pair(content, s.cursor());
return Err(RequirementsTxtParserError::Parser {
message: format!("Expected comment or end-of-line, found '{other}'"),
location: s.cursor(),
line,
column,
});
}
}
Expand Down Expand Up @@ -669,8 +696,8 @@ fn parse_requirement_and_hashes(
}
})?;
let hashes = if has_hashes {
let hashes = parse_hashes(s)?;
eat_trailing_line(s)?;
let hashes = parse_hashes(content, s)?;
eat_trailing_line(content, s)?;
hashes
} else {
Vec::new()
Expand All @@ -679,32 +706,35 @@ fn parse_requirement_and_hashes(
}

/// Parse `--hash=... --hash ...` after a requirement
fn parse_hashes(s: &mut Scanner) -> Result<Vec<String>, RequirementsTxtParserError> {
fn parse_hashes(content: &str, s: &mut Scanner) -> Result<Vec<String>, RequirementsTxtParserError> {
let mut hashes = Vec::new();
if s.eat_while("--hash").is_empty() {
let (line, column) = calculate_line_column_pair(content, s.cursor());
return Err(RequirementsTxtParserError::Parser {
message: format!(
"Expected '--hash', found '{:?}'",
s.eat_while(|c: char| !c.is_whitespace())
),
location: s.cursor(),
line,
column,
});
}
let hash = parse_value(s, |c: char| !c.is_whitespace())?;
let hash = parse_value(content, s, |c: char| !c.is_whitespace())?;
hashes.push(hash.to_string());
loop {
eat_wrappable_whitespace(s);
if !s.eat_if("--hash") {
break;
}
let hash = parse_value(s, |c: char| !c.is_whitespace())?;
let hash = parse_value(content, s, |c: char| !c.is_whitespace())?;
hashes.push(hash.to_string());
}
Ok(hashes)
}

/// In `-<key>=<value>` or `-<key> value`, this parses the part after the key
fn parse_value<'a, T>(
content: &str,
s: &mut Scanner<'a>,
while_pattern: impl Pattern<T>,
) -> Result<&'a str, RequirementsTxtParserError> {
Expand All @@ -716,9 +746,11 @@ fn parse_value<'a, T>(
s.eat_whitespace();
Ok(s.eat_while(while_pattern).trim_end())
} else {
let (line, column) = calculate_line_column_pair(content, s.cursor());
Err(RequirementsTxtParserError::Parser {
message: format!("Expected '=' or whitespace, found {:?}", s.peek()),
location: s.cursor(),
line,
column,
})
}
}
Expand Down Expand Up @@ -746,7 +778,8 @@ pub enum RequirementsTxtParserError {
MissingEditablePrefix(String),
Parser {
message: String,
location: usize,
line: usize,
column: usize,
},
UnsupportedRequirement {
source: Pep508Error,
Expand Down Expand Up @@ -786,9 +819,14 @@ impl RequirementsTxtParserError {
Self::UnsupportedUrl(url) => Self::UnsupportedUrl(url),
Self::MissingRequirementPrefix(given) => Self::MissingRequirementPrefix(given),
Self::MissingEditablePrefix(given) => Self::MissingEditablePrefix(given),
Self::Parser { message, location } => Self::Parser {
Self::Parser {
message,
line,
column,
} => Self::Parser {
message,
location: location + offset,
line,
column,
},
Self::UnsupportedRequirement { source, start, end } => Self::UnsupportedRequirement {
source,
Expand Down Expand Up @@ -831,8 +869,12 @@ impl Display for RequirementsTxtParserError {
"Requirement `{given}` looks like a directory but was passed as a package name. Did you mean `-e {given}`?"
)
}
Self::Parser { message, location } => {
write!(f, "{message} at position {location}")
Self::Parser {
message,
line,
column,
} => {
write!(f, "{message} at position {line}:{column}")
}
Self::UnsupportedRequirement { start, end, .. } => {
write!(f, "Unsupported requirement in position {start} to {end}")
Expand Down Expand Up @@ -903,10 +945,14 @@ impl Display for RequirementsTxtFileError {
self.file.simplified_display(),
)
}
RequirementsTxtParserError::Parser { message, location } => {
RequirementsTxtParserError::Parser {
message,
line,
column,
} => {
write!(
f,
"{message} in `{}` at position {location}",
"{message} in `{}` at position {line}:{column}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make that format at <REQUIREMENTS_TXT>:<line>:<col>? This format is support by many IDEs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 223fd99

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, mileage may vary, I noticed that for full-paths it does highlight on my IDE, but not relative paths like ./requirements.txt. We could try to canonicalize always?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using current_dir() and joining the path on top of it is the best way, canonicalize could be a path outside the apparent project root (we used to canonicalize everything but rolled that back since it caused problems where software expected the "virtual" name of the link).

self.file.simplified_display(),
)
}
Expand Down Expand Up @@ -1279,4 +1325,34 @@ mod test {
Some(("../editable[", "[dev]"))
);
}

#[test]
fn parser_error_line_and_column() -> Result<()> {
let temp_dir = assert_fs::TempDir::new()?;
let requirements_txt = temp_dir.child("requirements.txt");
requirements_txt.write_str(indoc! {"
numpy>=1,<2
--borken
tqdm
"})?;

let error = RequirementsTxt::parse(requirements_txt.path(), temp_dir.path()).unwrap_err();
let errors = anyhow::Error::new(error).chain().join("\n");

let requirement_txt =
regex::escape(&requirements_txt.path().simplified_display().to_string());
let filters = vec![
(requirement_txt.as_str(), "<REQUIREMENTS_TXT>"),
(r"\\", "/"),
];
insta::with_settings!({
filters => filters
}, {
insta::assert_display_snapshot!(errors, @r###"
Unexpected '-', expected '-c', '-e', '-r' or the start of a requirement in `<REQUIREMENTS_TXT>` at position 2:3
"###);
});

Ok(())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a couple of tests that include non-ASCII codepoints. More concretely, one test with a non-ASCII codepoint like, say, 💩. And then another test with grapheme cluster made up of more than one codepoint like à̖ (that's a\u{0300}\u{0316} as a string literal in Rust).

Copy link
Collaborator Author

@samypr100 samypr100 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 223fd99, included one with two codepoints and your example one with three.

}