Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect boundaries of parse inputs #522

Merged
merged 18 commits into from
Sep 17, 2024
Merged

Detect boundaries of parse inputs #522

merged 18 commits into from
Sep 17, 2024

Conversation

lionel-
Copy link
Contributor

@lionel- lionel- commented Sep 13, 2024

Progress towards posit-dev/positron#1326.

parse_boundaries() uses the R parser to detect the line boundaries of:

  • Complete inputs (zero or more)
  • Incomplete inputs (optional)
  • Invalid inputs (optional)

The boundaries are for lines of inputs rather than expressions. For instance, foo; bar has one input whereas foo\nbar has two inputs.

Invariants:

  • Empty lines and lines containing only whitespace or comments are complete inputs.
  • There is always at least one range since the empty string is a complete input.
  • The ranges are sorted and non-overlapping.
  • Inputs are classified in complete, incomplete, and error sections. The
    sections are sorted in this order (there cannot be complete inputs after
    an incomplete or error one, there cannot be an incomplete input after
    an error one, and error inputs are always trailing).
  • There is only one incomplete and one error input in a set of inputs.

Approach:

I originally thought I'd use the parse data artifacts created by side effects to detect boundaries of complete expressions in case of incomplete and invalid inputs (see infrastructure implemented in #508). The goal was to avoid non-linear performance as the number of lines increases. I didn't end up doing that because the parse data is actually unusable for more complex cases than what I tried during my exploration.

Instead, I had to resort to parsing line by line. I start with the entire set of lines and back up one line at a time until the parse fully completes. Along the way I keep track of the error and incomplete sections of the input. In the most common cases (valid inputs, short incomplete input at the end), this should only require one or a few iterations. The boundaries of complete expressions are retrieved from the source references of the parsed expressions (using infrastructure from #482) and then transformed to complete inputs.

Supporting infrastructure:

  • New CharacterVector::slice() method that returns a &[SEXP] and a corresponding CharacterVector::TryFrom<&[SEXP]> method. (Eventually List should gain those as well.) These methods make it easy to slice character vectors from the Rust side. I use this to shorten the vector of lines without having to start from a &str and reallocate CHARSXPs everytime.

  • SrcFile can now be constructed from a CharacterVector with TryFrom.

  • Vector::create() is no longer unsafe.

  • ArkRange::TryFrom<SrcRef> method, although I didn't end up using it.

@lionel-
Copy link
Contributor Author

lionel- commented Sep 14, 2024

TODO:

Error section should contain line number and error message

Whitespace/comment inputs should be tagged with a boolean

Comment on lines +251 to +253
pub unsafe fn chr_cbegin(x: SEXP) -> *const SEXP {
libr::DATAPTR_RO(x) as *const SEXP
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else here is safe. Do you want to eventually make these unsafe again? If so I'd propose keeping it safe for consistency for now and then we move away from it (but to me this feels like it should be safe)

Copy link
Contributor Author

@lionel- lionel- Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think it's unsafe because it's taking an untyped SEXP and it doesn't return any Result.

We might argue that it's fine because it throws R errors if the type isn't right (but note that it's an implementation detail that it does so instead of crashing like other accessors like CAR()).

But I think the criterion for a safe function is that it guarantees a crash or an R error can't happen (either by dynamic checks feeding the Err path of the Result or by taking typed inputs with strong guarantees such that errors or crashes can't happen).

Does that make sense?

Comment on lines -101 to +118
libr::ParseStatus_PARSE_ERROR => Err(crate::Error::ParseSyntaxError {
libr::ParseStatus_PARSE_ERROR => Ok(ParseResult::SyntaxError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a great idea

crates/harp/src/parser/srcref.rs Outdated Show resolved Hide resolved
Comment on lines 88 to 97
// Return type would be clearer if syntax error was integrated in `ParseResult`
pub fn parse_with_parse_data(text: &str) -> crate::Result<(ParseResult, ParseData)> {
let srcfile = srcref::SrcFile::try_from(text)?;

// Fill parse data in `srcfile` by side effect
let status = parse_status(&ParseInput::SrcFile(&srcfile))?;

let parse_data = ParseData::from_srcfile(&srcfile)?;

Ok((status, parse_data))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This method does not seem to be used
  • The leading comment about Return type would be clearer seems outdated because syntax error is integrated into ParseResult now I think

Copy link
Contributor Author

@lionel- lionel- Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer used, neither is ParseData, but I've left those for later in case they are needed. E.g. maybe it'll be useful to generate rowan trees from the R parser in a more efficient way than traversing R AST.

I removed the dangling comment, good catch.

Comment on lines +124 to +131
impl ParseDataNode {
pub fn as_point_range(&self) -> std::ops::Range<(usize, usize)> {
std::ops::Range {
start: (self.line.start, self.column.start),
end: (self.line.end, self.column.end),
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used?

Also feels like it should return a typed ParseDataRange to be more self documenting

(I've noticed that the ruff team uses custom types for almost everything, and I really think it makes the code easier to read)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it's a balance because too much vocabulary can also make code opaque instead of transparent.

But I'm willing to try and make more type aliases to see how this feels.

pub fn slice(&self) -> &[SEXP] {
unsafe {
let data = harp::chr_cbegin(self.object.sexp);
std::slice::from_raw_parts(data, self.len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that this does len * mem::size_of::<T>() internally automatically

crates/harp/src/vector/character_vector.rs Outdated Show resolved Hide resolved
Comment on lines +16 to +18
pub complete: Vec<std::ops::Range<usize>>,
pub incomplete: Option<std::ops::Range<usize>>,
pub error: Option<std::ops::Range<usize>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where we'd benefit from something like TextRange
https://github.com/astral-sh/ruff/blob/bb12fe9d0c71fcb36a5000260f62dbf8411b74b4/crates/ruff_text_size/src/range.rs#L15-L19

I know we have ArkRange but that is tree-sitter specific right?

We probably will want our own typed TextRange (or similar) that we use everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just now realizing a better name for this is probably really LineRange (TextRange is for a byte range in a document)

The fact that it took me until the test section of the PR to figure this out probably does mean a named wrapper like this would be very helpful

struct LineRange {
  start: usize,
  end: usize,
}

Copy link
Contributor Author

@lionel- lionel- Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right documenting the contents with a type name would go a long way here.

ArkRange is Ark specific rather thant LSP- or TS-specific. It uses the row/col approach instead of byte offsets.

For byte offsets and ranges we may want to use https://github.com/rust-analyzer/text-size/

But that's storing u32 instead of usize, which makes it harder to interface with the rest of Rust, and we might not need such storage performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types make it harder to provide generic algorithms, e.g. for fn merge_overlapping<T>(ranges: Vec<std::ops::Range<T>>) -> Vec<std::ops::Range<T>>. Not that we really need the genericity now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah but the ruff types are directly imported from rust-analyzer's text-size crate!

crates/ark/src/analysis/parse_boundaries.rs Outdated Show resolved Hide resolved
crates/ark/src/analysis/parse_boundaries.rs Outdated Show resolved Hide resolved

// Grab all code up to current line
let subset = &lines_r.slice()[..current_line + 1];
let subset = CharacterVector::try_from(subset)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike that you have to do a (potentially large) redundant allocation on the 1st iteration, which most of the time is all you need to do (i.e. no incomplete and no error)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you only reference &subset below, it does seem like you could add a special case for the first iteration that remapped lines_r directly as subset so you don't need that first allocation

I know that is uglier but it might be worth it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this, with a let mut first = true; outside the loop?

        // Parse within source file to get source references.
        // Avoid allocation on the first iteration, since often there are no issues.
        let srcfile = if first {
            first = false;
            harp::srcref::SrcFile::try_from(&lines_r)?
        } else {
            // Grab all code up to current line
            let subset = &lines_r.slice()[..current_line + 1];
            let subset = CharacterVector::try_from(subset)?;
            harp::srcref::SrcFile::try_from(&subset)?
        };

Copy link
Contributor Author

@lionel- lionel- Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the proposed approach we allocate one big string that is then split by R into a vector of lines (see R implementation of srcfilecopy()).

We could avoid the unnecessary splicing though. I made it:

        // Grab all code up to current line. We don't slice the vector in the
        // first iteration as it's not needed.
        let subset = if current_line == n_lines - 1 {
            lines_r.clone()
        } else {
            CharacterVector::try_from(&lines_r.slice()[..=current_line])?
        };

To support this, CharacterVector is now clonable (it does not deep-copy since none of our wrappers make owning guarantees for the wrapped data).

crates/ark/src/analysis/parse_boundaries.rs Outdated Show resolved Hide resolved
crates/ark/src/analysis/parse_boundaries.rs Outdated Show resolved Hide resolved
Comment on lines 208 to 222
// Fill trailing whitespace (between complete and incomplete|error\eof)
let last_boundary = filled.last().map(|r| r.end).unwrap_or(0);
let next_boundary = boundaries
.incomplete
.as_ref()
.or(boundaries.error.as_ref())
.map(|r| r.start)
.unwrap_or(n_lines);

for start in last_boundary..next_boundary {
filled.push(range_from(start))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there can't be trailing whitespace between incomplete/error and eof?

Like, this adds ranges for lines from [end_last_complete, start_incomplete_or_error], but what about [end_incomplete_or_error, eof]? I guess end_incomplete_or_error == eof?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the comment clearer, it's trying to say exactly what you say, the trailing whitespace that needs to be transformed to complete inputs is between the complete expressions and the rest.

let boundaries = parse_boundaries("foo").unwrap();
#[rustfmt::skip]
assert_eq!(boundaries.complete, vec![
std::ops::Range { start: 0, end: 1 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A custom type would also let us add a new() method for something like LineRange::new(0, 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point though that's arguably straying away from the standard style.

crates/ark/src/analysis/parse_boundaries.rs Show resolved Hide resolved
Base automatically changed from feature/parse-data to main September 17, 2024 15:14
@lionel- lionel- force-pushed the feature/exprs-boundaries branch from dba279a to 77abff8 Compare September 17, 2024 15:16
@lionel- lionel- merged commit ee03eee into main Sep 17, 2024
1 of 3 checks passed
@lionel- lionel- deleted the feature/exprs-boundaries branch September 17, 2024 15:17
@github-actions github-actions bot locked and limited conversation to collaborators Sep 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants