-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conserve BOM and properly support UTF16 #6497
Conserve BOM and properly support UTF16 #6497
Conversation
|
yeah: /// Performs non-incremental BOM sniffing.
///
/// The argument must either be a buffer representing the entire input
/// stream (non-streaming case) or a buffer representing at least the first
/// three bytes of the input stream (streaming case).
///
/// Returns `Some((UTF_8, 3))`, `Some((UTF_16LE, 2))` or
/// `Some((UTF_16BE, 2))` if the argument starts with the UTF-8, UTF-16LE
/// or UTF-16BE BOM or `None` otherwise.
///
/// Available via the C wrapper.
#[inline]
pub fn for_bom(buffer: &[u8]) -> Option<(&'static Encoding, usize)> {
if buffer.starts_with(b"\xEF\xBB\xBF") {
Some((UTF_8, 3))
} else if buffer.starts_with(b"\xFF\xFE") {
Some((UTF_16LE, 2))
} else if buffer.starts_with(b"\xFE\xFF") {
Some((UTF_16BE, 2))
} else {
None
}
}
``` |
At the start I implemented UTF16 support with this merge request, but I reverted it (8e38adf) because editing UTF16 on master mess up the file. Here is the behaviour I have on master: $ hx --version
helix 23.03 (3cf03723)
$ file /tmp/utf16.txt
/tmp/utf16.txt: Unicode text, UTF-16, little-endian text
$ hx /tmp/utf16.txt # Only type ':x' inside the editor
$ file /tmp/utf16.txt
/tmp/utf16.txt: ISO-8859 text utf16.txt was downloaded from https://github.com/stain/encoding-test-files/blob/master/utf16.txt. If it is reproducible on more machines, we should probably create an issue for that |
I created #6542 to explain in more details the problem I encountered while writing this pull request for UTF16 |
|
to be fair std supports encoding utf-16 so it shouldn't be too hard to roll our encoder: hsivonen/encoding_rs#31 (comment) (can be done more efficiently and would look different in our case, simply iterating the doc chars and converting each to utf16 should work decently enough) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. The utf16 encoding implementation is exactly what I had in mind and endsup less disruptive to the codebase (while being faster) and feels more inline with what is actually changed (adding an.improved encoder instead of adding special cases to the code that uses the encoder). This could use some integration tests tough both for the bom and encoding handeling. This kind of stuff should ne pretty easy to write tests for. I don't have a lot of utf-16 text on hand to test this manually :D
Thank you for your comment, I am waiting on the second reviewer to see if I need to modify anything in my pull request. |
We usually keep tests in the same PR if they are not too complex so I would prefer if you add them in this PR. The tests should just be simple integration tests (in |
OK thanks for the feedback. Expect an update on this Pull Request during this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding those tests! LGTM now altough we will likely wait until after the bugfix release to merge
This looks like it's working as expected. Thanks for you efforts!
I opened utx16.txt and wrote the file to disk using helix 23.03 with this patch and neovim from git master. |
Create an enum EncodingBom in
helix-view/src/document.rs
that represent the Document encoding with BOM information.Previous behaviour:
New behaviour:
This close #6246.
I'm still new to Rust, so feedback to improve this pull request is appreciated.
Edit
After some work, I removed the enum EncodingBom and replaced it by a boolean value, as per the feedback from @archseer.
I added some more commits on top of it to support UTF16 on top of UTF8. I had to implement an enum Encoder to buffer the writes to UTF16 since
encoding_rs
does not support writing UTF16 files.So this pull request now also close #6542 on top of #6246.
I tested my branch against UTF8 with and without BOM, and UTF16 LE and BE.