-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect boundaries of parse inputs #522
Conversation
TODO: Error section should contain line number and error message Whitespace/comment inputs should be tagged with a boolean |
pub unsafe fn chr_cbegin(x: SEXP) -> *const SEXP { | ||
libr::DATAPTR_RO(x) as *const SEXP | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else here is safe. Do you want to eventually make these unsafe
again? If so I'd propose keeping it safe for consistency for now and then we move away from it (but to me this feels like it should be safe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think it's unsafe because it's taking an untyped SEXP
and it doesn't return any Result
.
We might argue that it's fine because it throws R errors if the type isn't right (but note that it's an implementation detail that it does so instead of crashing like other accessors like CAR()
).
But I think the criterion for a safe function is that it guarantees a crash or an R error can't happen (either by dynamic checks feeding the Err
path of the Result
or by taking typed inputs with strong guarantees such that errors or crashes can't happen).
Does that make sense?
libr::ParseStatus_PARSE_ERROR => Err(crate::Error::ParseSyntaxError { | ||
libr::ParseStatus_PARSE_ERROR => Ok(ParseResult::SyntaxError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a great idea
crates/harp/src/parse.rs
Outdated
// Return type would be clearer if syntax error was integrated in `ParseResult` | ||
pub fn parse_with_parse_data(text: &str) -> crate::Result<(ParseResult, ParseData)> { | ||
let srcfile = srcref::SrcFile::try_from(text)?; | ||
|
||
// Fill parse data in `srcfile` by side effect | ||
let status = parse_status(&ParseInput::SrcFile(&srcfile))?; | ||
|
||
let parse_data = ParseData::from_srcfile(&srcfile)?; | ||
|
||
Ok((status, parse_data)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This method does not seem to be used
- The leading comment about
Return type would be clearer
seems outdated because syntax error is integrated intoParseResult
now I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's no longer used, neither is ParseData
, but I've left those for later in case they are needed. E.g. maybe it'll be useful to generate rowan trees from the R parser in a more efficient way than traversing R AST.
I removed the dangling comment, good catch.
impl ParseDataNode { | ||
pub fn as_point_range(&self) -> std::ops::Range<(usize, usize)> { | ||
std::ops::Range { | ||
start: (self.line.start, self.column.start), | ||
end: (self.line.end, self.column.end), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used?
Also feels like it should return a typed ParseDataRange
to be more self documenting
(I've noticed that the ruff team uses custom types for almost everything, and I really think it makes the code easier to read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think it's a balance because too much vocabulary can also make code opaque instead of transparent.
But I'm willing to try and make more type aliases to see how this feels.
pub fn slice(&self) -> &[SEXP] { | ||
unsafe { | ||
let data = harp::chr_cbegin(self.object.sexp); | ||
std::slice::from_raw_parts(data, self.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool that this does len * mem::size_of::<T>()
internally automatically
pub complete: Vec<std::ops::Range<usize>>, | ||
pub incomplete: Option<std::ops::Range<usize>>, | ||
pub error: Option<std::ops::Range<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where we'd benefit from something like TextRange
https://github.com/astral-sh/ruff/blob/bb12fe9d0c71fcb36a5000260f62dbf8411b74b4/crates/ruff_text_size/src/range.rs#L15-L19
I know we have ArkRange
but that is tree-sitter specific right?
We probably will want our own typed TextRange
(or similar) that we use everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just now realizing a better name for this is probably really LineRange
(TextRange
is for a byte range in a document)
The fact that it took me until the test section of the PR to figure this out probably does mean a named wrapper like this would be very helpful
struct LineRange {
start: usize,
end: usize,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right documenting the contents with a type name would go a long way here.
ArkRange
is Ark specific rather thant LSP- or TS-specific. It uses the row/col approach instead of byte offsets.
For byte offsets and ranges we may want to use https://github.com/rust-analyzer/text-size/
But that's storing u32 instead of usize, which makes it harder to interface with the rest of Rust, and we might not need such storage performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These types make it harder to provide generic algorithms, e.g. for fn merge_overlapping<T>(ranges: Vec<std::ops::Range<T>>) -> Vec<std::ops::Range<T>>
. Not that we really need the genericity now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah but the ruff types are directly imported from rust-analyzer's text-size crate!
|
||
// Grab all code up to current line | ||
let subset = &lines_r.slice()[..current_line + 1]; | ||
let subset = CharacterVector::try_from(subset)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike that you have to do a (potentially large) redundant allocation on the 1st iteration, which most of the time is all you need to do (i.e. no incomplete and no error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you only reference &subset
below, it does seem like you could add a special case for the first iteration that remapped lines_r
directly as subset
so you don't need that first allocation
I know that is uglier but it might be worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this, with a let mut first = true;
outside the loop?
// Parse within source file to get source references.
// Avoid allocation on the first iteration, since often there are no issues.
let srcfile = if first {
first = false;
harp::srcref::SrcFile::try_from(&lines_r)?
} else {
// Grab all code up to current line
let subset = &lines_r.slice()[..current_line + 1];
let subset = CharacterVector::try_from(subset)?;
harp::srcref::SrcFile::try_from(&subset)?
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the proposed approach we allocate one big string that is then split by R into a vector of lines (see R implementation of srcfilecopy()
).
We could avoid the unnecessary splicing though. I made it:
// Grab all code up to current line. We don't slice the vector in the
// first iteration as it's not needed.
let subset = if current_line == n_lines - 1 {
lines_r.clone()
} else {
CharacterVector::try_from(&lines_r.slice()[..=current_line])?
};
To support this, CharacterVector
is now clonable (it does not deep-copy since none of our wrappers make owning guarantees for the wrapped data).
// Fill trailing whitespace (between complete and incomplete|error\eof) | ||
let last_boundary = filled.last().map(|r| r.end).unwrap_or(0); | ||
let next_boundary = boundaries | ||
.incomplete | ||
.as_ref() | ||
.or(boundaries.error.as_ref()) | ||
.map(|r| r.start) | ||
.unwrap_or(n_lines); | ||
|
||
for start in last_boundary..next_boundary { | ||
filled.push(range_from(start)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there can't be trailing whitespace between incomplete/error and eof?
Like, this adds ranges for lines from [end_last_complete, start_incomplete_or_error]
, but what about [end_incomplete_or_error, eof]
? I guess end_incomplete_or_error == eof
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make the comment clearer, it's trying to say exactly what you say, the trailing whitespace that needs to be transformed to complete inputs is between the complete expressions and the rest.
let boundaries = parse_boundaries("foo").unwrap(); | ||
#[rustfmt::skip] | ||
assert_eq!(boundaries.complete, vec![ | ||
std::ops::Range { start: 0, end: 1 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A custom type would also let us add a new()
method for something like LineRange::new(0, 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point though that's arguably straying away from the standard style.
83f04fb
to
eba82d4
Compare
And a matching conversion method from `&[SEXP]`
Co-authored-by: Davis Vaughan <davis@rstudio.com>
dba279a
to
77abff8
Compare
Progress towards posit-dev/positron#1326.
parse_boundaries()
uses the R parser to detect the line boundaries of:The boundaries are for lines of inputs rather than expressions. For instance,
foo; bar
has one input whereasfoo\nbar
has two inputs.Invariants:
sections are sorted in this order (there cannot be complete inputs after
an incomplete or error one, there cannot be an incomplete input after
an error one, and error inputs are always trailing).
Approach:
I originally thought I'd use the parse data artifacts created by side effects to detect boundaries of complete expressions in case of incomplete and invalid inputs (see infrastructure implemented in #508). The goal was to avoid non-linear performance as the number of lines increases. I didn't end up doing that because the parse data is actually unusable for more complex cases than what I tried during my exploration.
Instead, I had to resort to parsing line by line. I start with the entire set of lines and back up one line at a time until the parse fully completes. Along the way I keep track of the error and incomplete sections of the input. In the most common cases (valid inputs, short incomplete input at the end), this should only require one or a few iterations. The boundaries of complete expressions are retrieved from the source references of the parsed expressions (using infrastructure from #482) and then transformed to complete inputs.
Supporting infrastructure:
New
CharacterVector::slice()
method that returns a&[SEXP]
and a correspondingCharacterVector::TryFrom<&[SEXP]>
method. (EventuallyList
should gain those as well.) These methods make it easy to slice character vectors from the Rust side. I use this to shorten the vector of lines without having to start from a&str
and reallocateCHARSXP
s everytime.SrcFile
can now be constructed from aCharacterVector
withTryFrom
.Vector::create()
is no longerunsafe
.ArkRange::TryFrom<SrcRef>
method, although I didn't end up using it.