Detect boundaries of parse inputs #522

lionel- · 2024-09-13T13:00:16Z

Progress towards posit-dev/positron#1326.

parse_boundaries() uses the R parser to detect the line boundaries of:

Complete inputs (zero or more)
Incomplete inputs (optional)
Invalid inputs (optional)

The boundaries are for lines of inputs rather than expressions. For instance, foo; bar has one input whereas foo\nbar has two inputs.

Invariants:

Empty lines and lines containing only whitespace or comments are complete inputs.
There is always at least one range since the empty string is a complete input.
The ranges are sorted and non-overlapping.
Inputs are classified in complete, incomplete, and error sections. The
sections are sorted in this order (there cannot be complete inputs after
an incomplete or error one, there cannot be an incomplete input after
an error one, and error inputs are always trailing).
There is only one incomplete and one error input in a set of inputs.

Approach:

I originally thought I'd use the parse data artifacts created by side effects to detect boundaries of complete expressions in case of incomplete and invalid inputs (see infrastructure implemented in #508). The goal was to avoid non-linear performance as the number of lines increases. I didn't end up doing that because the parse data is actually unusable for more complex cases than what I tried during my exploration.

Instead, I had to resort to parsing line by line. I start with the entire set of lines and back up one line at a time until the parse fully completes. Along the way I keep track of the error and incomplete sections of the input. In the most common cases (valid inputs, short incomplete input at the end), this should only require one or a few iterations. The boundaries of complete expressions are retrieved from the source references of the parsed expressions (using infrastructure from #482) and then transformed to complete inputs.

Supporting infrastructure:

New CharacterVector::slice() method that returns a &[SEXP] and a corresponding CharacterVector::TryFrom<&[SEXP]> method. (Eventually List should gain those as well.) These methods make it easy to slice character vectors from the Rust side. I use this to shorten the vector of lines without having to start from a &str and reallocate CHARSXPs everytime.
SrcFile can now be constructed from a CharacterVector with TryFrom.
Vector::create() is no longer unsafe.
ArkRange::TryFrom<SrcRef> method, although I didn't end up using it.

lionel- · 2024-09-14T11:36:52Z

TODO:

Error section should contain line number and error message

Whitespace/comment inputs should be tagged with a boolean

DavisVaughan · 2024-09-16T18:04:26Z

crates/harp/src/object.rs

+pub unsafe fn chr_cbegin(x: SEXP) -> *const SEXP {
+    libr::DATAPTR_RO(x) as *const SEXP
+}


Everything else here is safe. Do you want to eventually make these unsafe again? If so I'd propose keeping it safe for consistency for now and then we move away from it (but to me this feels like it should be safe)

hmm I think it's unsafe because it's taking an untyped SEXP and it doesn't return any Result.

We might argue that it's fine because it throws R errors if the type isn't right (but note that it's an implementation detail that it does so instead of crashing like other accessors like CAR()).

But I think the criterion for a safe function is that it guarantees a crash or an R error can't happen (either by dynamic checks feeding the Err path of the Result or by taking typed inputs with strong guarantees such that errors or crashes can't happen).

Does that make sense?

DavisVaughan · 2024-09-16T18:06:33Z

crates/harp/src/parse.rs

-            libr::ParseStatus_PARSE_ERROR => Err(crate::Error::ParseSyntaxError {
+            libr::ParseStatus_PARSE_ERROR => Ok(ParseResult::SyntaxError {


Oh that's a great idea

crates/harp/src/parser/srcref.rs

DavisVaughan · 2024-09-16T18:16:17Z

crates/harp/src/parse.rs

+// Return type would be clearer if syntax error was integrated in `ParseResult`
+pub fn parse_with_parse_data(text: &str) -> crate::Result<(ParseResult, ParseData)> {
+    let srcfile = srcref::SrcFile::try_from(text)?;
+
+    // Fill parse data in `srcfile` by side effect
+    let status = parse_status(&ParseInput::SrcFile(&srcfile))?;
+
+    let parse_data = ParseData::from_srcfile(&srcfile)?;
+
+    Ok((status, parse_data))
+}


This method does not seem to be used

The leading comment about Return type would be clearer seems outdated because syntax error is integrated into ParseResult now I think

It's no longer used, neither is ParseData, but I've left those for later in case they are needed. E.g. maybe it'll be useful to generate rowan trees from the R parser in a more efficient way than traversing R AST.

I removed the dangling comment, good catch.

DavisVaughan · 2024-09-16T18:21:20Z

crates/harp/src/parser/parse_data.rs

+impl ParseDataNode {
+    pub fn as_point_range(&self) -> std::ops::Range<(usize, usize)> {
+        std::ops::Range {
+            start: (self.line.start, self.column.start),
+            end: (self.line.end, self.column.end),
+        }
+    }
+}


Not used?

Also feels like it should return a typed ParseDataRange to be more self documenting

(I've noticed that the ruff team uses custom types for almost everything, and I really think it makes the code easier to read)

In general I think it's a balance because too much vocabulary can also make code opaque instead of transparent.

But I'm willing to try and make more type aliases to see how this feels.

DavisVaughan · 2024-09-16T18:23:29Z

crates/harp/src/vector/character_vector.rs

+    pub fn slice(&self) -> &[SEXP] {
+        unsafe {
+            let data = harp::chr_cbegin(self.object.sexp);
+            std::slice::from_raw_parts(data, self.len())


Cool that this does len * mem::size_of::<T>() internally automatically

crates/harp/src/vector/character_vector.rs

DavisVaughan · 2024-09-16T18:36:14Z

crates/ark/src/analysis/parse_boundaries.rs

+    pub complete: Vec<std::ops::Range<usize>>,
+    pub incomplete: Option<std::ops::Range<usize>>,
+    pub error: Option<std::ops::Range<usize>>,


I think this is where we'd benefit from something like TextRange
https://github.com/astral-sh/ruff/blob/bb12fe9d0c71fcb36a5000260f62dbf8411b74b4/crates/ruff_text_size/src/range.rs#L15-L19

I know we have ArkRange but that is tree-sitter specific right?

We probably will want our own typed TextRange (or similar) that we use everywhere

I'm just now realizing a better name for this is probably really LineRange (TextRange is for a byte range in a document)

The fact that it took me until the test section of the PR to figure this out probably does mean a named wrapper like this would be very helpful

struct LineRange { start: usize, end: usize, }

You're right documenting the contents with a type name would go a long way here.

ArkRange is Ark specific rather thant LSP- or TS-specific. It uses the row/col approach instead of byte offsets.

For byte offsets and ranges we may want to use https://github.com/rust-analyzer/text-size/

But that's storing u32 instead of usize, which makes it harder to interface with the rest of Rust, and we might not need such storage performance.

These types make it harder to provide generic algorithms, e.g. for fn merge_overlapping<T>(ranges: Vec<std::ops::Range<T>>) -> Vec<std::ops::Range<T>>. Not that we really need the genericity now...

ah but the ruff types are directly imported from rust-analyzer's text-size crate!

crates/ark/src/analysis/parse_boundaries.rs

DavisVaughan · 2024-09-16T19:05:29Z

crates/ark/src/analysis/parse_boundaries.rs

+
+        // Grab all code up to current line
+        let subset = &lines_r.slice()[..current_line + 1];
+        let subset = CharacterVector::try_from(subset)?;


I dislike that you have to do a (potentially large) redundant allocation on the 1st iteration, which most of the time is all you need to do (i.e. no incomplete and no error)

Since you only reference &subset below, it does seem like you could add a special case for the first iteration that remapped lines_r directly as subset so you don't need that first allocation

I know that is uglier but it might be worth it

How about this, with a let mut first = true; outside the loop?

// Parse within source file to get source references. // Avoid allocation on the first iteration, since often there are no issues. let srcfile = if first { first = false; harp::srcref::SrcFile::try_from(&lines_r)? } else { // Grab all code up to current line let subset = &lines_r.slice()[..current_line + 1]; let subset = CharacterVector::try_from(subset)?; harp::srcref::SrcFile::try_from(&subset)? };

With the proposed approach we allocate one big string that is then split by R into a vector of lines (see R implementation of srcfilecopy()).

We could avoid the unnecessary splicing though. I made it:

// Grab all code up to current line. We don't slice the vector in the // first iteration as it's not needed. let subset = if current_line == n_lines - 1 { lines_r.clone() } else { CharacterVector::try_from(&lines_r.slice()[..=current_line])? };

To support this, CharacterVector is now clonable (it does not deep-copy since none of our wrappers make owning guarantees for the wrapped data).

crates/ark/src/analysis/parse_boundaries.rs

DavisVaughan · 2024-09-16T19:31:45Z

crates/ark/src/analysis/parse_boundaries.rs

+    // Fill trailing whitespace (between complete and incomplete|error\eof)
+    let last_boundary = filled.last().map(|r| r.end).unwrap_or(0);
+    let next_boundary = boundaries
+        .incomplete
+        .as_ref()
+        .or(boundaries.error.as_ref())
+        .map(|r| r.start)
+        .unwrap_or(n_lines);
+
+    for start in last_boundary..next_boundary {
+        filled.push(range_from(start))
+    }


I guess there can't be trailing whitespace between incomplete/error and eof?

Like, this adds ranges for lines from [end_last_complete, start_incomplete_or_error], but what about [end_incomplete_or_error, eof]? I guess end_incomplete_or_error == eof?

I'll make the comment clearer, it's trying to say exactly what you say, the trailing whitespace that needs to be transformed to complete inputs is between the complete expressions and the rest.

DavisVaughan · 2024-09-16T19:33:56Z

crates/ark/src/analysis/parse_boundaries.rs

+            let boundaries = parse_boundaries("foo").unwrap();
+            #[rustfmt::skip]
+            assert_eq!(boundaries.complete, vec![
+                std::ops::Range { start: 0, end: 1 },


A custom type would also let us add a new() method for something like LineRange::new(0, 1)

Good point though that's arguably straying away from the standard style.

crates/ark/src/analysis/parse_boundaries.rs

And a matching conversion method from `&[SEXP]`

Co-authored-by: Davis Vaughan <davis@rstudio.com>

lionel- requested a review from DavisVaughan September 13, 2024 13:05

DavisVaughan approved these changes Sep 16, 2024

View reviewed changes

DavisVaughan reviewed Sep 16, 2024

View reviewed changes

lionel- force-pushed the feature/parse-data branch from 83f04fb to eba82d4 Compare September 17, 2024 15:01

Base automatically changed from feature/parse-data to main September 17, 2024 15:14

lionel- and others added 18 commits September 17, 2024 17:16

Detect boundaries of complete expressions

f10c61a

Integrate syntax error in ParseResult

6fa8c1f

Add parse_with_parse_data() helper

638c61e

Use iterative approach

6c08327

Propagate line OOB errors

0ec321b

Vector::create() is not unsafe

d69afc5

Add conversion method from character vector to SrcFile

4095ec6

Add slice method for CharacterVector

c20c328

And a matching conversion method from `&[SEXP]`

Only report lines

521aab1

Merge overlapping expressions caused by semi-colons

0cc32cb

Record incomplete and error boundaries with no complete expression

dffced3

Fix handling of whitespace inputs

a6585b2

Apply suggestions from code review

3a26e8d

Co-authored-by: Davis Vaughan <davis@rstudio.com>

Make SrcFile::new_virtual() private

aaa44f1

Remove dangling comment

d8a4611

Avoid slicing lines of code in the first iteration

4075e26

Clearer comment

02b0fea

Add missing test

77abff8

lionel- force-pushed the feature/exprs-boundaries branch from dba279a to 77abff8 Compare September 17, 2024 15:16

lionel- merged commit ee03eee into main Sep 17, 2024
1 of 3 checks passed

lionel- deleted the feature/exprs-boundaries branch September 17, 2024 15:17

github-actions bot locked and limited conversation to collaborators Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect boundaries of parse inputs #522

Detect boundaries of parse inputs #522

lionel- commented Sep 13, 2024 •

edited

Loading

lionel- commented Sep 14, 2024

DavisVaughan Sep 16, 2024

lionel- Sep 17, 2024 •

edited

Loading

DavisVaughan Sep 16, 2024

DavisVaughan Sep 16, 2024

lionel- Sep 17, 2024 •

edited

Loading

DavisVaughan Sep 16, 2024

lionel- Sep 17, 2024

DavisVaughan Sep 16, 2024

DavisVaughan Sep 16, 2024

DavisVaughan Sep 16, 2024

lionel- Sep 17, 2024 •

edited

Loading

lionel- Sep 17, 2024

lionel- Sep 17, 2024

DavisVaughan Sep 16, 2024

DavisVaughan Sep 16, 2024

DavisVaughan Sep 16, 2024

lionel- Sep 17, 2024 •

edited

Loading

DavisVaughan Sep 16, 2024

lionel- Sep 17, 2024

DavisVaughan Sep 16, 2024

lionel- Sep 17, 2024

		libr::ParseStatus_PARSE_ERROR => Err(crate::Error::ParseSyntaxError {
		libr::ParseStatus_PARSE_ERROR => Ok(ParseResult::SyntaxError {

Detect boundaries of parse inputs #522

Detect boundaries of parse inputs #522

Conversation

lionel- commented Sep 13, 2024 • edited Loading

lionel- commented Sep 14, 2024

Choose a reason for hiding this comment

lionel- Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel- Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel- Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel- Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel- commented Sep 13, 2024 •

edited

Loading

lionel- Sep 17, 2024 •

edited

Loading

lionel- Sep 17, 2024 •

edited

Loading

lionel- Sep 17, 2024 •

edited

Loading

lionel- Sep 17, 2024 •

edited

Loading