-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoupling byte-level encoding #535
Comments
Adding a dependency to |
While
I do have to admit that all of these issues are minor and I do not know why anyone would ever need to use succinct errors (other than cool error reporting), but the approach I'm proposing is the properly decoupled Haskell view of things. One thing to note is that I haven't looked deep into the structure of Hoehrmann's C-based decoder, but from what I see the by-the-book decoding is just a chain of up to thirteen comparisons, so I don't yet understand the need for a complex state machine here (other than code shortness of course, but Haskell isn't C). |
For performance reasons two array lookups are much better than up to 13 comparisons. Once |
Isn't this only true if the entire lookup table resides in L1 cache? Sure this will work fine for C parsers, but I don't know if any random Haskell parser interleaved with the algorithm can guarantee this. Also it's 1 comparison for 00..7F and 5 for 80..7FF, so for really simple strings even two array lookups in L1 cache seem like an overkill. Rolling a benchmark to compare the two approaches should be easy, so perhaps I should do that. |
The main blocker for this proposal is going to be performance. I'd be surprised if you can use your API to write a streaming JSON parser whose performance is comparable to using the There is an intentional trade off of a tiny bit of imprecision for a lot of performance. The parser state fits in a single byte (
Making that state unreachable is really the main point of your API, and as you mention it's unclear what the use case would be.
The fields are one word each. The expectation is that they are going to be unpacked in a tight loop that does not allocate. This is much cheaper than allocating a thunk for the partial codepoint to be evaluated only if it is used. If you don't need the partial code point (only doing validation), then you can use |
This is likely to be true irrespective of caches: bear in mind that in your case each comparison is a condition and involves 1 or 2 jump instructions depending on branch chosen. |
So after a week of tinkering I made a benchmark repository (link). The algorithms I wrote include no low-level magic, just inlines and strict copying. The benchmark timings can be found on the
Problems I could not resolve:
Also I wonder why For the record all of my benchmarks have been executed on a laptop CPU, so, as with all things cache-related, YMMV, and extra benchmark subjects are welcome. |
I concede that you can get your data structure to be inlined. But that relies on unrolling the loop yourself so you always start an iteration at a code point boundary. Performance-wise, the extra branching may have a detrimental effect on branch prediction. Your initial comment about IO made me assume you didn't want simdutf but I misunderstood. If the main loop uses simdutf then performance for the error branch is much less of a concern. I'm still not convinced a more fine grained API for UTF-8 is really better. I disagree that in comparison "
That sounds like a bug, right?
That's a feature request for simdutf. I don't know what the current status is, but it would indeed let us simplify UTF-8 parsing further. Also, another API you haven't mentioned is |
Yes, it should probably have its own issue. Can be replicated through the tests here.
If you wish to respect maximal subpart rule, an error encountered on the first byte results in byte consumption and an error on any successive byte does not. As such you need to track byte boundaries, that's four repeats with the array lookup algorithm. The full unroll is just as deep on the 4-byte branch, and every other branch is shallower than that.
I admit my phrasing on this point is incorrect, if anything the fact that it's in an I'm fine with it existing in a separate module with proper documentation, as it may indeed be useful for some highly specific parsers, but so far the benchmarks I linked above show it's not even performance-critical in this library.
I have, it's the third bullet point of this issue. A JSON parser needs to treat |
Having conceded that performance is not an issue, the only remaining difference with The trade-off is that you have to write a big tree of nested cases to effectively use that API, since every byte within a codepoint results in a different type of state. Those 40 lines of code correspond to these 7 lines of code in the text library. So even purely in terms of aesthetics ("the properly decoupled Haskell view of things.") it's a hard sell. The proposed API actually makes things more coupled than In that case, would making |
I don't think "lets" is a correct term here, you can weave any state you want into it, it's just a datatype.
The entire point is that it exposes all the innerworkings while abstracting all the hard parts. "Decoupled" doesn't mean "short" or "convenient", it just means you get the power to write whatever you want with no frills. It's obviously a low-level module, so people using it will be ready to spend five extra minutes and duplicate 30 lines.
It would definitely help with other people using it, but at this point I would rather carry around a 170 line module that does it in a much more streamlined fashion with predictable performance. This applies to the For the record you don't need a strong reason to deny this proposal, a simple "we don't have people for this, sorry" is enough. The reason I'm pushing for it is because I already have two different places I want to use it in and I don't want to toss it onto my personal pile of "should be on Hackage, but uploading is a nuisance, I haven't tested it enough and the previous maintainer is nowhere to be seen" projects. |
I'm just making sure that I'm not completely missing the point of your proposal. Beyond that, we'll indeed have to agree to disagree, unless another @haskell/text maintainer wants to chime in. |
There are three engines for UTF-8 validation in
I'm not sure what the supposed story for JavaScript backend: it might happen that it's better to use the naive engine instead of compiling C decoder from If you want to benchmark Haskell native implementations, pass |
There are ways to embellish I agree that a more fine-grained error reporting has its use cases, but I fell that it's better to iterate on it outside of |
While I was missing the fact that
The performance concern applies specifically to the sidecase of using
My original point was that I wanted to share error handling with Right now the proposal grinds down to the following points:
As the main point of this issue is adding algorithms that are not immediately needed within the library and cannot be abstracted into a separate boot library for management reasons, this issue is indeed dead in the water. If noone else has any strong opinions on this topic, I will close the issue at the end of this week. |
That's extremely strange, could you share a reproducer? Might be worse to raise a bug against
My point above was that there are situations when the naive decoder is the only available, and its performance matters. If one wants to make a statements about performance, this case should be measured by disabling
Makes sense to me.
That's largely true. Unfortunately, it's very difficult to iterate on a better interface without repeatedly breaking clients. There is not much demand although: usually clients treat pretty much any UTF-8 decoding error is just "hey, this data is not UTF-8", and precise offence reason matters less. I appreciate that JSON decoding is somewhat less forgiving. Anyways thanks for your efforts and interest! |
Okay, I was able to run the benchmarks without SIMD by The results are surprisingly bad for the array lookup algorithm.
I'm going to need someone to replicate this on their end and to check my findings for correctness, of course. |
For the sake of benchmark reproducibility I incorporated the changes in a fork. I have inlined everything best I could, the only thing I did not touch is The following list includes every single library benchmark that matches a pattern of 73620de -- HEAD 17bb010 -- HEAD without SIMD validation
|
Thanks for benchmarking @BurningWitness. Sorry, I'm extra busy this week, will take a look later. |
@BurningWitness sorry again, I didn't forget about your work here, but still no time to dive in properly. |
When writing a JSON parser (GaloisInc/json#17) I needed some way to decode UTF-8 and to my dismay I found all existing solutions do not fit my expectations:
GHC.Encoding.UTF8
andGHC.IO.Encoding
areIO
-based and I don't want that in a parser;Data.Text.Internal.Encoding.Utf8
, while pure, appears to both returnReject
as an error and has a rather complex interface;Data.Text.Encoding.*
andData.Text.Lazy.Encoding.*
are already parsers themselves, too high-level for this task;utf-string
'sCodec.Binary.UTF8.String
consumes and returns lists, so it isn't parser-compatible.I decided to handroll the UTF-8 decoding, which allowed me to categorize the errors (see Encoding.Mixed.Error) and resulted in a lot of code on the parser side that has little to do with consuming bytes per se (see Codec.Web.JSON.Parse.String).
However the code I wrote can instead be generalized to:
Parsing then is simply unwrapping
UTF8
. This decouples character validation and conversion, the only part of decoding left is ensuring only the maximal subpart of an ill-formed sequence is consumed, which is the responsibility of the parser.My proposal is creating a separate package with a focus specifically on decoding/encoding UTF-8/UTF-16/UTF-32 on byte-level. Then
text
can drop some internal modules in favor of a simpler common interface.This proposal is however naive: I do not know whether GHC can inline these datatypes reliably or, indeed, at all. Based on my cursory reading of the Secrets of the Glasgow Haskell Compiler inliner paper it should, as each of these expressions is trivial.
This doesn't clash with the issue of GHC's many UTF-8 implementations (outlined in
GHC.Encoding.UTF8
) as all other algorithms are inIO
.Other concerns:
text
is a core library, so I assume an extra dependency can't just be added on a whim;utf
already exists and is deprecated. I don't know how hard reclaiming deprecated packages is.The text was updated successfully, but these errors were encountered: