Conserve BOM and properly support UTF16 #6497

Alexis-Lapierre · 2023-03-31T13:07:43Z

Create an enum EncodingBom in helix-view/src/document.rs that represent the Document encoding with BOM information.

Previous behaviour:

$ printf '\xEF\xBB\xBFtest\n' > utf8
$ file utf8
utf8: Unicode text, UTF-8 (with BOM) text
$ hx utf8    # Only type ':x' inside helix
$ file utf8
utf8: ASCII text

New behaviour:

$ printf '\xEF\xBB\xBFtest\n' > utf8
$ file utf8
utf8: Unicode text, UTF-8 (with BOM) text
$ hx utf8    # Only type ':x' inside helix
$ file utf8
utf8: Unicode text, UTF-8 (with BOM) text

This close #6246.

I'm still new to Rust, so feedback to improve this pull request is appreciated.

Edit

After some work, I removed the enum EncodingBom and replaced it by a boolean value, as per the feedback from @archseer.

I added some more commits on top of it to support UTF16 on top of UTF8. I had to implement an enum Encoder to buffer the writes to UTF16 since encoding_rs does not support writing UTF16 files.

So this pull request now also close #6542 on top of #6246.

I tested my branch against UTF8 with and without BOM, and UTF16 LE and BE.

kirawi · 2023-03-31T15:47:51Z

encoding_rs looks to also handle UTF-16LE and UTF-16BE: https://docs.rs/encoding_rs/latest/src/encoding_rs/lib.rs.html#3757

pascalkuthe · 2023-03-31T15:51:15Z

yeah:

    /// Performs non-incremental BOM sniffing.
    ///
    /// The argument must either be a buffer representing the entire input
    /// stream (non-streaming case) or a buffer representing at least the first
    /// three bytes of the input stream (streaming case).
    ///
    /// Returns `Some((UTF_8, 3))`, `Some((UTF_16LE, 2))` or
    /// `Some((UTF_16BE, 2))` if the argument starts with the UTF-8, UTF-16LE
    /// or UTF-16BE BOM or `None` otherwise.
    ///
    /// Available via the C wrapper.
    #[inline]
    pub fn for_bom(buffer: &[u8]) -> Option<(&'static Encoding, usize)> {
        if buffer.starts_with(b"\xEF\xBB\xBF") {
            Some((UTF_8, 3))
        } else if buffer.starts_with(b"\xFF\xFE") {
            Some((UTF_16LE, 2))
        } else if buffer.starts_with(b"\xFE\xFF") {
            Some((UTF_16BE, 2))
        } else {
            None
        }
    }
    ```

Alexis-Lapierre · 2023-03-31T16:21:49Z

At the start I implemented UTF16 support with this merge request, but I reverted it (8e38adf) because editing UTF16 on master mess up the file.

Here is the behaviour I have on master:

$ hx --version
helix 23.03 (3cf03723)
$ file /tmp/utf16.txt
/tmp/utf16.txt: Unicode text, UTF-16, little-endian text
$ hx /tmp/utf16.txt # Only type ':x' inside the editor
$ file /tmp/utf16.txt
/tmp/utf16.txt: ISO-8859 text

utf16.txt was downloaded from https://github.com/stain/encoding-test-files/blob/master/utf16.txt.

If it is reproducible on more machines, we should probably create an issue for that

Alexis-Lapierre · 2023-04-01T19:51:59Z

At the start I implemented UTF16 support with this merge request, but I reverted it (8e38adf) because editing UTF16 on master mess up the file.

Here is the behaviour I have on master:
$ hx --version
helix 23.03 (3cf03723)
$ file /tmp/utf16.txt
/tmp/utf16.txt: Unicode text, UTF-16, little-endian text
$ hx /tmp/utf16.txt # Only type ':x' inside the editor
$ file /tmp/utf16.txt
/tmp/utf16.txt: ISO-8859 text
utf16.txt was downloaded from https://github.com/stain/encoding-test-files/blob/master/utf16.txt.

If it is reproducible on more machines, we should probably create an issue for that

I created #6542 to explain in more details the problem I encountered while writing this pull request for UTF16

kirawi · 2023-04-01T22:06:18Z

encoding_rs only supports decoding of UTF-16: hsivonen/encoding_rs#31 (comment)

pascalkuthe · 2023-04-01T22:12:48Z

encoding_rs only supports decoding of UTF-16: hsivonen/encoding_rs#31 (comment)

to be fair std supports encoding utf-16 so it shouldn't be too hard to roll our encoder: hsivonen/encoding_rs#31 (comment) (can be done more efficiently and would look different in our case, simply iterating the doc chars and converting each to utf16 should work decently enough)

helix-view/src/document.rs

Alexis-Lapierre · 2023-04-04T12:10:32Z

NB: I'm waiting for #6549 to be merged to remove aacf4e3 from this merge request, so that helix support utf8 and utf16 bom

Alexis-Lapierre · 2023-04-09T20:27:49Z

NB: I'm waiting for #6549 to be merged to remove aacf4e3 from this merge request, so that helix support utf8 and utf16 bom

I implemented the proper way to save files in UTF16 with buffering in this commit: 642d9e7 .

pascalkuthe

This LGTM. The utf16 encoding implementation is exactly what I had in mind and endsup less disruptive to the codebase (while being faster) and feels more inline with what is actually changed (adding an.improved encoder instead of adding special cases to the code that uses the encoder). This could use some integration tests tough both for the bom and encoding handeling. This kind of stuff should ne pretty easy to write tests for. I don't have a lot of utf-16 text on hand to test this manually :D

Alexis-Lapierre · 2023-04-17T10:29:42Z

This LGTM. The utf16 encoding implementation is exactly what I had in mind and endsup less disruptive to the codebase (while being faster) and feels more inline with what is actually changed (adding an.improved encoder instead of adding special cases to the code that uses the encoder). This could use some integration tests tough both for the bom and encoding handeling. This kind of stuff should ne pretty easy to write tests for. I don't have a lot of utf-16 text on hand to test this manually :D

Thank you for your comment, I am waiting on the second reviewer to see if I need to modify anything in my pull request.
I think I'll make a follow up pull request adding the unit tests after this pr.

pascalkuthe · 2023-04-17T13:16:52Z

We usually keep tests in the same PR if they are not too complex so I would prefer if you add them in this PR. The tests should just be simple integration tests (in helix-term/src/tests) that open an utf16 file with the editor and check that the text of the active document is equal to what is expected and that the file remains unchanged after saving (you can do versions with and without BOM) so even if something about the code changes these tests would not interact with your API so the risk of duplicate work is fairly small IMO.

Alexis-Lapierre · 2023-04-18T13:46:33Z

OK thanks for the feedback. Expect an update on this Pull Request during this week.

Alexis-Lapierre · 2023-04-23T19:56:50Z

I've added 1ec6cdd

The commit add a simple integration test that repeat the steps to reproduce #6542 and #6246.

pascalkuthe

thanks for adding those tests! LGTM now altough we will likely wait until after the bugfix release to merge

kekePower · 2023-04-26T12:38:50Z

This looks like it's working as expected. Thanks for you efforts!

% file *txt
utf16-helix-01.txt: Unicode text, UTF-16, little-endian text
utf16-nvim-01.txt:  Unicode text, UTF-16, little-endian text
utf16.txt:          Unicode text, UTF-16, little-endian text

I opened utx16.txt and wrote the file to disk using helix 23.03 with this patch and neovim from git master.

kirawi mentioned this pull request Apr 2, 2023

Support encoding UTF-16 #6549

Closed

archseer reviewed Apr 3, 2023

View reviewed changes

helix-view/src/document.rs Outdated Show resolved Hide resolved

helix-view/src/document.rs Outdated Show resolved Hide resolved

Alexis-Lapierre changed the title ~~UTF8 conserve BOM: No longer overwrite BOM byte if present on save~~ UTF conserve BOM: No longer overwrite BOM byte if present on save Apr 9, 2023

Alexis-Lapierre changed the title ~~UTF conserve BOM: No longer overwrite BOM byte if present on save~~ UTF Conserve BOM bytes and properly support UTF16 Apr 9, 2023

Alexis-Lapierre requested a review from archseer April 9, 2023 20:27

pascalkuthe previously approved these changes Apr 12, 2023

View reviewed changes

pascalkuthe added C-enhancement Category: Improvements E-medium Call for participation: Experience needed to fix: Medium / intermediate S-waiting-on-review Status: Awaiting review from a maintainer. labels Apr 13, 2023

Alexis-Lapierre added 6 commits April 23, 2023 18:53

Add enum EncodingBom to encapsulate encoding_rs Encoding struct

001fd12

Correctly parse and apply bom encoding

e336ce7

Add comments to describe added methods

c244588

Can now write UTF-16 file correctly

a46ffdf

Remove EncodingBom enum, replace information by boolean

25fb071

Properly support UTF 16

3c61d26

Alexis-Lapierre dismissed pascalkuthe’s stale review via cd88d7b April 23, 2023 19:40

Integration test UTF8/16 file save

1ec6cdd

pascalkuthe approved these changes Apr 23, 2023

View reviewed changes

pascalkuthe added this to the 23.03.1 milestone Apr 25, 2023

the-mikedavis approved these changes Apr 30, 2023

View reviewed changes

pascalkuthe merged commit b0b3f45 into helix-editor:master Apr 30, 2023

pascalkuthe changed the title ~~UTF Conserve BOM bytes and properly support UTF16~~ Conserve BOM and properly support UTF16 Apr 30, 2023

Alexis-Lapierre deleted the conserve_bom_byte branch May 1, 2023 16:48

Triton171 pushed a commit to Triton171/helix that referenced this pull request Jun 18, 2023

Conserve BOM and properly support UTF16 (helix-editor#6497)

ea91edf

wes-adams pushed a commit to wes-adams/helix that referenced this pull request Jul 4, 2023

Conserve BOM and properly support UTF16 (helix-editor#6497)

54added

smortime pushed a commit to smortime/helix that referenced this pull request Jul 10, 2024

Conserve BOM and properly support UTF16 (helix-editor#6497)

eb616da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conserve BOM and properly support UTF16 #6497

Conserve BOM and properly support UTF16 #6497

Alexis-Lapierre commented Mar 31, 2023 •

edited

Loading

kirawi commented Mar 31, 2023

pascalkuthe commented Mar 31, 2023

Alexis-Lapierre commented Mar 31, 2023 •

edited

Loading

Alexis-Lapierre commented Apr 1, 2023 •

edited

Loading

kirawi commented Apr 1, 2023

pascalkuthe commented Apr 1, 2023

Alexis-Lapierre commented Apr 4, 2023

Alexis-Lapierre commented Apr 9, 2023

pascalkuthe left a comment •

edited

Loading

Alexis-Lapierre commented Apr 17, 2023

pascalkuthe commented Apr 17, 2023 •

edited

Loading

Alexis-Lapierre commented Apr 18, 2023

Alexis-Lapierre commented Apr 23, 2023

pascalkuthe left a comment

kekePower commented Apr 26, 2023

Conserve BOM and properly support UTF16 #6497

Conserve BOM and properly support UTF16 #6497

Conversation

Alexis-Lapierre commented Mar 31, 2023 • edited Loading

Edit

kirawi commented Mar 31, 2023

pascalkuthe commented Mar 31, 2023

Alexis-Lapierre commented Mar 31, 2023 • edited Loading

Alexis-Lapierre commented Apr 1, 2023 • edited Loading

kirawi commented Apr 1, 2023

pascalkuthe commented Apr 1, 2023

Alexis-Lapierre commented Apr 4, 2023

Alexis-Lapierre commented Apr 9, 2023

pascalkuthe left a comment • edited Loading

Choose a reason for hiding this comment

Alexis-Lapierre commented Apr 17, 2023

pascalkuthe commented Apr 17, 2023 • edited Loading

Alexis-Lapierre commented Apr 18, 2023

Alexis-Lapierre commented Apr 23, 2023

pascalkuthe left a comment

Choose a reason for hiding this comment

kekePower commented Apr 26, 2023

Alexis-Lapierre commented Mar 31, 2023 •

edited

Loading

Alexis-Lapierre commented Mar 31, 2023 •

edited

Loading

Alexis-Lapierre commented Apr 1, 2023 •

edited

Loading

pascalkuthe left a comment •

edited

Loading

pascalkuthe commented Apr 17, 2023 •

edited

Loading