Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lexing and parsing for all WGSL number literal types #1184

Merged
merged 14 commits into from
Aug 24, 2021

Conversation

hcsch
Copy link
Contributor

@hcsch hcsch commented Aug 10, 2021

I tried to implement lexing for all different types of number literals
as described by the current working draft and editor's draft
(they are the same for the literal definitions at the time of writing).
This should bring naga closer to being able to parse all kinds of number literals,
partially resolving #866 for example.

The next step would be to also allow for parsing of these new kinds of number literals,
as the currently used FromStr::parse won't work for hexadecimal literals (float or integer), see rust-lang/rfcs#1098.
The relevant code for this seems to be here:

naga/src/front/wgsl/mod.rs

Lines 1027 to 1060 in e384bce

fn get_constant_inner<'a>(
word: &'a str,
ty: char,
width: &'a str,
token: TokenSpan<'a>,
) -> Result<ConstantInner, Error<'a>> {
let span = token.1;
let value = match ty {
'i' => word
.parse()
.map(crate::ScalarValue::Sint)
.map_err(|e| Error::BadI32(span.clone(), e))?,
'u' => word
.parse()
.map(crate::ScalarValue::Uint)
.map_err(|e| Error::BadU32(span.clone(), e))?,
'f' => word
.parse()
.map(crate::ScalarValue::Float)
.map_err(|e| Error::BadFloat(span.clone(), e))?,
_ => unreachable!(),
};
Ok(crate::ConstantInner::Scalar {
value,
width: if width.is_empty() {
4
} else {
match width.parse::<crate::Bytes>() {
Ok(bits) if (bits % 8) == 0 => Ok(bits / 8),
_ => Err(Error::BadScalarWidth(span, width)),
}?
},
})
}

I have added the test cases I had written up for #866 to tests/in/operators.wgsl since they seem to fit in there
(though the file name might need some bikeshedding, should they stay there).
I have not yet adjusted any of the output files for the test, as I'm somewhat unsure about what they'd look like.

While I have tried my best to make the code easy to understand, it has become somewhat unwieldy due to the state machinery required to (hopefully) correctly lex the literals.
This state machine code also still needs to see some testing / fuzzing, which I haven't done myself so far.
I have tested how naga fails for operators.wgsl and my small tests seem to indicate that the literals are lexed correctly.

For example, with the following WGSL code:

let hex_float_lit_0 : f32 = -0xa.p1;

the modified code reports the following error:

expected floating-point literal, found `-0xa.p1`

wheras the old code reports the following:

the type of `hex_float_lit_0` is expected to be [1]

An alternative to lexing the literals this way, with a manually written state machine, would be to use the regexes given in the WGSL spec directly (after correcting them, since they use + ambiguously for example) with some existing regex engine like https://github.com/rust-lang/regex, this would pull in additional dependencies, but might be worth it, to reduce maintenance efforts, should the spec change it's definitions.

@hcsch hcsch marked this pull request as ready for review August 10, 2021 20:54
Copy link
Member

@kvark kvark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very elaborate!

tests/in/operators.wgsl Outdated Show resolved Hide resolved
src/front/wgsl/lexer.rs Outdated Show resolved Hide resolved
src/front/wgsl/lexer.rs Outdated Show resolved Hide resolved
@hcsch
Copy link
Contributor Author

hcsch commented Aug 10, 2021

I just noticed that I broke the lexing of the width of the integer literal (i.e. the number after a type suffix, e.g. 32 after a u), the WGSL spec doesn't really have this concept though (at least for now).
Should I remove the width attribute from Token::Number, or should I make the lexing code accept that?

@kvark
Copy link
Member

kvark commented Aug 10, 2021

WGSL has types defined as "xW", where W is the number of bits. We definitely need to parse that.

@hcsch
Copy link
Contributor Author

hcsch commented Aug 11, 2021

WGSL has types defined as "xW", where W is the number of bits. We definitely need to parse that.

I think I might be missing something in the spec. I was looking at https://gpuweb.github.io/gpuweb/wgsl/#literals so far. Could you perhaps point me to the right section in the spec, that describes this part of the language?

As far as I can tell (from reading parts of the naga code) the width field describes the width in bytes bits of a number.
Currently the WGSL language only seems to support 4-byte wide numeric scalar types (i32, u32, f32) (edit: and notably no width suffixes for int, uint and float literals).
Am I correct in these assumptions?

Edit2: The width returned in consume_number should probably always be "32" at the moment, right?

Edit3: I think part of the misunderstanding may be, that the suffixes for number literals are not the type names (i.e. i32, u32, f32), but rather a specific thing of their own. And currently only uints seem to have such a suffix (i.e. just u, no 32 after it).

@kvark
Copy link
Member

kvark commented Aug 11, 2021

Ok, yeah, the suffixes are not type names, they just look the same. Parsing a literal can just return the width of 4 (in bytes).

@hcsch
Copy link
Contributor Author

hcsch commented Aug 11, 2021

Currently the width of a Token::Number is a &str (likely due to the assumption of Rust-y suffixes), see

Number {
value: &'a str,
ty: char,
width: &'a str,
},

Should I just return "32" then? Since the width seems to be a string with the decimal digits for the number of bits, see

naga/src/front/wgsl/mod.rs

Lines 1551 to 1553 in 551b711

(Token::Number { value, ty, width }, _) => {
Self::get_constant_inner(value, ty, width, first_token_span)?
}

naga/src/front/wgsl/mod.rs

Lines 1051 to 1058 in 551b711

width: if width.is_empty() {
4
} else {
match width.parse::<crate::Bytes>() {
Ok(bits) if (bits % 8) == 0 => Ok(bits / 8),
_ => Err(Error::BadScalarWidth(span, width)),
}?
},

@kvark
Copy link
Member

kvark commented Aug 11, 2021

Just return an empty string there?

@hcsch
Copy link
Contributor Author

hcsch commented Aug 11, 2021

With 9ff29df the only tests that still fail, fail because of the parsing code currently not being capable of handling hexadecimal integers and floats. And with c5817f0 the code compiles with Rust 1.43 (at least on my machine).
I'd still like some input for whether or not to keep the now unused syntactical verification code. Other than that I'm pretty happy with the code as it now.

@kvark
Copy link
Member

kvark commented Aug 11, 2021

syntactical verification code

which code is this?

@hcsch
Copy link
Contributor Author

hcsch commented Aug 11, 2021

syntactical verification code

which code is this?

The following are only needed if the state machine should also check for validity due to stuff like negative zeros (not allowed), floats with a uint suffix, etc.:
https://github.com/hcsch/naga/blob/c5817f0374f69b8e2243d5b6b88c2a4ad1811b6c/src/front/wgsl/lexer.rs#L41
https://github.com/hcsch/naga/blob/c5817f0374f69b8e2243d5b6b88c2a4ad1811b6c/src/front/wgsl/lexer.rs#L51-L74

Edit: Except for negative zeros (which the spec does allow for hexadecimal integers and all kinds of floats, just not for decimal integers), these issues would all also fail at the parsing stage later on.

@kvark
Copy link
Member

kvark commented Aug 11, 2021

I don't have a clear idea on where to draw the line in responsibilities between the new state machine and the parser. Keeping this code around isn't bad, since it's fairly isolated. So it seems to me that we can proceed and then see if it can be simplified?

@hcsch
Copy link
Contributor Author

hcsch commented Aug 11, 2021

There are still two TODO comments left with regards to the verification code. One where the verification code could be called and one suggesting that the verification code should probably use Results. Should I remove them so as to not pollute greps over the code-base or are they alright?

@kvark
Copy link
Member

kvark commented Aug 11, 2021

No need to remove. Keeping them is fine.

@hcsch
Copy link
Contributor Author

hcsch commented Aug 12, 2021

I just took a quick look at the parsing side of things:
Hexadecimal integers are ok, with the small caveat of i32::from_str_radix and u32::from_str_radix not supporting the case of a 0x being between the minus sign and the digits (requiring a string allocation to remove the 0x while keeping the -, manual parsing, other trickery, or the use of a crate that does account for this).
Hex floats will probably require the use of an existing crate or porting of some parsing algorithm from another language (edit: as the Rust stdlib doesn't provide hex float parsing). The format doesn't seem too complicated to parse and convert into a value, but then again, I'm kinda careful around anything to do with floats (and their wonderful edge-cases).

@hcsch
Copy link
Contributor Author

hcsch commented Aug 18, 2021

I don't have a clear idea on where to draw the line in responsibilities between the new state machine and the parser. Keeping this code around isn't bad, since it's fairly isolated. So it seems to me that we can proceed and then see if it can be simplified?

How would you like the code to be simplified? I'm not sure the state machine part of the code (the giant match that is) can be simplified much further without accepting a broader or a narrower syntactical definition of the literals.

One avenue of simplification could be to parse the number literals into Rust types in the lexer already (since we already have the structure of the literal available here: hex or not?, decimal digits or not?, etc.). Searching around, some lexers seem to do this (i.e. have the token contain the value, not the string), but that would require returning a Result<Token, SomeErrorType> rather than just a Token.
This parsing of literals at the lexing stage would match what the lexer does for booleans.

Another option would be to just not care too much about the syntactical detail in the lexer and rely on the parser to do the validation of the syntax, but this would complicate the parser, as the syntax of WGSL is sufficiently different from that of Rust (and the parsing tools it offers).

@kvark
Copy link
Member

kvark commented Aug 19, 2021

Anything block you on this?

@hcsch
Copy link
Contributor Author

hcsch commented Aug 19, 2021

Anything block you on this?

Well I guess I'd like to hear from you how you'd like to see the code structured (see previous comment from me). Personally I'd just keep the lexing code as-is and add some modifications to the parsing code to allow for hex integers. Hex float parsing could be done with something like hexf-parse, though v0.2.1 requires Rust ≥1.45, v0.1.0 is Rust ≥ 1.15. I haven't yet looked at the changes between these versions.
Then there is the String allocation for parsing of negative hexadecimal numbers with i/u32::from_str_radix, but I guess that is probably acceptable.

Edit: regarding hex floats, I could also just add an error reporting that hex floats are a NYI literal format for now.

@hcsch
Copy link
Contributor Author

hcsch commented Aug 20, 2021

There's also next_uint_literal and the like here in the lexer code:

fn _next_float_literal(&mut self) -> Result<f32, Error<'a>> {
match self.next() {
(Token::Number { value, .. }, span) => {
value.parse().map_err(|e| Error::BadFloat(span, e))
}
other => Err(Error::Unexpected(other, ExpectedToken::Float)),
}
}
pub(super) fn next_uint_literal(&mut self) -> Result<u32, Error<'a>> {
match self.next() {
(Token::Number { value, .. }, span) => {
let v = value.parse();
v.map_err(|e| Error::BadU32(span, e))
}
other => Err(Error::Unexpected(other, ExpectedToken::Uint)),
}
}
pub(super) fn next_sint_literal(&mut self) -> Result<i32, Error<'a>> {
match self.next() {
(Token::Number { value, .. }, span) => {
value.parse().map_err(|e| Error::BadI32(span, e))
}
other => Err(Error::Unexpected(other, ExpectedToken::Sint)),
}
}

These do their own parsing, unaware of hexadecimal representations and should probably replaced with functions in the parsing code. They only seem to be used for integer literals as parameters to attributes like location.

@hcsch
Copy link
Contributor Author

hcsch commented Aug 20, 2021

I think I'll just write up a commit with the changes I think this needs to allow for parsing & lexing of hex integers (hex float parsing can wait a bit). If something is off with that, you can just tell me then ^^

@kvark
Copy link
Member

kvark commented Aug 20, 2021

Yes, that sounds good, thank you!

@hcsch
Copy link
Contributor Author

hcsch commented Aug 20, 2021

Mmm... might take me a bit to make a reasonably correct parsing impl for integer literals because of the fun with non-negative / positive / possibly uint or sint, but no larger than i32::MAX attribute values. Currently naga handles this semi-correctly, parsing with those lexer parsing methods which ignore suffixes (sometimes uint literals are allowed, like for array element counts, for attributes it mostly isn't, with the exception of workgroup_size).
I'm gonna leave value restrictions (with the exception of non-negativity for fields that can be both uint or sint, but no larger than i32::MAX, since I need u32s there) to the validation stage.

(Edit: this is an issue because it complicates replacing the lexer parsing methods that kinda break separation of concerns)

@hcsch hcsch force-pushed the wgsl-number-literals branch from c5817f0 to e6efcfa Compare August 20, 2021 03:23
@hcsch hcsch changed the title Implement lexing for all WGSL number literal types Implement lexing and parsing for all WGSL number literal types Aug 20, 2021
@hcsch
Copy link
Contributor Author

hcsch commented Aug 20, 2021

Ok, with e6efcfa I've implemented parsing for hexadecimal integers. Now there are only two open points for this PR:

  • Hex float parsing. Do we use a crate for this, implement something of our own for this PR or defer it to a future PR and error on hex floats for now (what my code currently does)?
  • Non-hex negative signed zeros, which aren't allowed by the spec (currently there is a test case for this erroring, but not error is implemented for this yet). Do we drop this restriction or do we error on inputs violating it?

@hcsch hcsch force-pushed the wgsl-number-literals branch 2 times, most recently from 35e0360 to 0355553 Compare August 20, 2021 07:03
@kvark
Copy link
Member

kvark commented Aug 20, 2021

For hex parsing, the group hasn't figured out the full story yet - https://github.com/gpuweb/gpuweb/issues?q=is%3Aissue+is%3Aopen+hex+
The main concern here is friction: group decides to parse with one set of restrictions, but the crate disagrees. So we'd either have to duplicate some code, or tightly coordinate with crates authors.
If you know some of the authors, that would help. Do you have a recommendation?

@hcsch
Copy link
Contributor Author

hcsch commented Aug 20, 2021

For hex parsing, the group hasn't figured out the full story yet - https://github.com/gpuweb/gpuweb/issues?q=is%3Aissue+is%3Aopen+hex+
The main concern here is friction: group decides to parse with one set of restrictions, but the crate disagrees. So we'd either have to duplicate some code, or tightly coordinate with crates authors.
If you know some of the authors, that would help. Do you have a recommendation?

First of all, thanks for bringing this to my attention, I was not aware of the fact that there is still bikeshedding happening at the time. I don't know any of the authors of the spec, I am in fact rather new to WGSL in general. I just wanted to use wgpu with WGSL shaders ported from my current GLSL shaders (since WGSL and SPIR-V are the two shader langs with first class support in wgpu).

Does naga already make guarantees about the backwards compatibility with regards to parsing and lexing? Since the spec still seems to be in flux, especially with something important like gpuweb/gpuweb#775, this doesn't really seem possible at them moment.
I understand that you worry about friction between the changing spec and the implementation, especially with regards to number literals, since they are a core part of a language (and stuff like b-5 possibly changing meaning, probably necessitates some breakage of existing shaders).
At the moment it seems to me, that floats are much more a target of friction than hex integers (with the exception of x vs X, which is an easy fix). And I doubt we can just not support (regular, non-hex) floats and wait till the spec has finalized them :P
Hex float however don't seem all that commonly used in shaders at the moment (to me), but that is more a vague feeling than anything close to a quantitative study on my side.

I think one thing to at least make people aware of the friction / possible changes would be to prominently display the current spec version that the implementation is based on in the README.md and similar places. That's not really a solution, but I at least makes it possible for people using naga to see why something might not match the newest spec (until after a future update), or why something changed after an update.

Sadly, I don't think the spec has something like the Redline editions of the IEEE specs (i.e. a page where a fully rendered spec has changed sections marked up accordingly). The next best thing is probably looking at the changes to the sources in the git repo.

I believe my implementation of the lexing is relatively easy to modify in terms of changes similar to accepting the exponent marker case-insensitively. Harder changes would be in the greediness of number parsing or the - changing to not being part of a number in some cases (or in general).

The best scenario for naga would probably be an author or somebody working closely with the authors of the spec writing issues / pinging contributors to naga in case of changes. But that person would likely have to know the naga implementation of the spec in order to know how the changes affect naga.

I can't really offer much valuable advice as I'm still rather inexperienced in the world of specs, spec implementations and especially implementations of specs in the drafting stage, but I hope my thoughts above are at least somewhat helpful (and sorry for the giant block of text).

@hcsch hcsch force-pushed the wgsl-number-literals branch from 0355553 to 1aa0d42 Compare August 22, 2021 13:24
@hcsch
Copy link
Contributor Author

hcsch commented Aug 22, 2021

I've now implemented parsing of hex floats with the hexf-parse crate. Version 0.2.1 of hexf-parse is compatible with 1.43 and the README.md/code/documentation/CI has been updated accordingly in the repo. I also added some fuzzing code that checks for panics and compares with libc::strtod which the seemingly very responsive maintainers / collaborators have rather promptly merged :D.

@hcsch
Copy link
Contributor Author

hcsch commented Aug 22, 2021

I haven't updated the test output files in tests/out, since there are some changes from the master branch that have not yet been applied to the files. It'd probably be good if that was done in a commit separate from this PR, onto which I could then rebase.

@kvark
Copy link
Member

kvark commented Aug 23, 2021

I haven't updated the test output files in tests/out, since there are some changes from the master branch that have not yet been applied to the files. It'd probably be good if that was done in a commit separate from this PR, onto which I could then rebase.

This should not happen. I recently found and fixed a case for GLSL outputs. If you rebase, the test outputs should be all matching.

@hcsch
Copy link
Contributor Author

hcsch commented Aug 23, 2021

Ah yes, I've rebased onto master now and now the test outputs are up to date again. My changes do affect them at the moment in exactly one way:
My code parses the f32 literals as f32s instead of f64s as with the old code. The f32 are converted to f64s afterwards though, since that is what ScalarValue::Float uses. This causes something like 0.1 to become 0.10000000149011612 in the output, since the writer uses the doubles to write out the values and those show the imprecision of the f32s the value was parsed into.
Should I parse the values as f64s again to allow for prettier output or do I keep using f32s?
The benefit of parsing as f32s is having over-/underflows detected by naga, the downside is the above and that usable precision is lost in the case of a target language with 64-bit floats (though I'm pretty sure most if not all shader langs have f32s by default).

There's also the option of casting back down to an f32 on the back-end side for float output, where appropriate, this shouldn't result in any loss of precision from the original f32 value and would cause the value to be printed as intended.

@hcsch hcsch force-pushed the wgsl-number-literals branch from 1aa0d42 to d5820e3 Compare August 23, 2021 06:53
@hcsch hcsch force-pushed the wgsl-number-literals branch from d5820e3 to 1a4021c Compare August 23, 2021 07:26
@kvark
Copy link
Member

kvark commented Aug 23, 2021

Those f64 are internal, and the assumption is that we can always just cast it to a smaller size, like f32. So the example of 0.1 seems fine?

@hcsch
Copy link
Contributor Author

hcsch commented Aug 23, 2021

Yeah, I mean when interpreted as an f32 the value written out by the current wgsl back-end will end up being the same. I just wanted to make sure you are aware of the change in the parsing and output files. 1a4021c has them as the current code will produce them.

Other than that, I'd say the code is fine now. Do you have any points you'd like me to address still?

Copy link
Member

@kvark kvark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you! Just a few organizational things.

src/front/wgsl/lexer.rs Outdated Show resolved Hide resolved
src/front/wgsl/mod.rs Outdated Show resolved Hide resolved
src/front/wgsl/mod.rs Outdated Show resolved Hide resolved
@hcsch
Copy link
Contributor Author

hcsch commented Aug 23, 2021

Should I squash some commits together or even squash them all into one after you give your stamp of approval? Some commits like the removal of Rust 1.43 incompatible code are prime candidates for squashing, including the cleanup commits just now (which I mostly did so you can get a better idea of what I changed).

@kvark kvark merged commit 58a5b7d into gfx-rs:master Aug 24, 2021
@hcsch hcsch deleted the wgsl-number-literals branch August 24, 2021 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants