Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix errors on non-UTF-8 encodings #4730

Merged
merged 1 commit into from
Mar 16, 2020
Merged

Conversation

smola
Copy link
Contributor

@smola smola commented Dec 1, 2019

Description

Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.

@smola
Copy link
Contributor Author

smola commented Dec 1, 2019

Note that the pull request removes tests about BlobHelper#lines encoding. These were introduced in #1211:

@alindeman Awkwardly enough, we've made an implicit guarantee that the data exposed through lines is encoded in its original form (in practice, ASCII-8BIT or UTF-8) even if the file is detected as a different encoding. Certain other code seems to actually depend on this.

Strictly speaking, this guarantee is still in place: lines result is ASCII-8BIT just as the files are read. However, that wouldn't be the case if an in-memory blob is created. I lack context of the original issue about this behavior, so it's hard to tell if this change is breaking any third party code.

Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in github-linguist#1211). So the
error would only be seen in wild usage (see issue github-linguist#353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
@lildude
Copy link
Member

lildude commented Dec 7, 2019

A quick test suggests this does the trick for the samples mentioned in #353. More extensive testing would be needed on github.com as I don't have the context around the comment you referenced @smola and I have no idea how this change would impact github.com without testing it.

I'll try schedule some time this week to test this locally and in our staging env.

@lildude
Copy link
Member

lildude commented Jan 6, 2020

Missed the deploy window before the Christmas freeze so will try test this sometime in the next two-to-three weeks, depending on availability etc.

@lildude lildude self-assigned this Jan 6, 2020
@lildude lildude mentioned this pull request Mar 11, 2020
4 tasks
@lildude
Copy link
Member

lildude commented Mar 16, 2020

I've done some testing over the past few days and haven't spotted any problems so I'm going to approve and merge this. Thanks @smola 🙇

@lildude lildude merged commit 7a7f01f into github-linguist:master Mar 16, 2020
@smola smola deleted the fix-utf8 branch June 5, 2020 16:05
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants