Revisit UTF-8 validation #136

essen · 2024-02-26T12:16:24Z

The code in

Lines 581 to 588 in cc04201

    
           %% Based on the Flexible and Economical UTF-8 Decoder algorithm by 
        
           %% Bjoern Hoehrmann <bjoern@hoehrmann.de> (http://bjoern.hoehrmann.de/utf-8/decoder/dfa/). 
        
           %% 
        
           %% The original algorithm has been unrolled into all combinations of values for C and State 
        
           %% each with a clause. The common clauses were then grouped together. 
        
           %% 
        
           %% This function returns 0 on success, 1 on error, and 2..8 on incomplete data. 
        
           validate_utf8(<<>>, State) -> State;

was written a decade ago. The VM has changed a lot. The JSON PR in OTP has a different way of doing this that may be faster: erlang/otp#8111

codeadict · 2024-03-08T19:46:44Z

For extra info, there is also this discussion about adding a C BIF to the BEAM using this algorithm https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/

essen · 2024-03-08T22:45:15Z

OK I took a long look at all that chatter about UTF-8 validation that I missed (including erlang/otp#6576 as fairly interesting). Thank you.

As far as SIMD goes, I am open to believe it could be a better alternative, but it remains to be proven for use within Erlang. Note that some strings can be overly long so the implementation would need to account for that. This might make it not as good as initially hoped.

The Elixir PR adding a fast_ascii option sounds good but as far as Cowboy is concerned users that want to skip this validation (because it will be done when decoding JSON, for example) should use a binary frame. Other users that do use text frames are more likely to use more than just ASCII. At least that's what I've experienced.

So for now the ticket is about refreshing the algorithm implementation rather than switching to a different algorithm. But it's possible that I missed something; I didn't actually start working on this and it is not yet a priority.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit UTF-8 validation #136

Revisit UTF-8 validation #136

essen commented Feb 26, 2024

codeadict commented Mar 8, 2024

essen commented Mar 8, 2024

Revisit UTF-8 validation #136

Revisit UTF-8 validation #136

Comments

essen commented Feb 26, 2024

codeadict commented Mar 8, 2024

essen commented Mar 8, 2024