Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle utf-16 encoding gracefully when querying Snowflake #38

Merged
merged 2 commits into from
May 9, 2022

Conversation

notactuallytreyanastasio
Copy link
Contributor

@notactuallytreyanastasio notactuallytreyanastasio commented May 9, 2022

This stemmed from an issue we saw in production, a result we were
getting included some garbled characters because of a UTF-8 and UTF-16
difference.

We were getting the string:

"CVS Pharmacy®"

Now, Â = U+00C2 and ® = U+00AE in UTF-8.

However, what we really wanted was:

"CVS Pharmacy®"

As the  character is a mistaken extra byte.

Here, we introduce a function to properly handle this and return the
desired results using the latin1 encoding with UTF-8.

I have tried a couple ways of going about testing this, but the coverage
so far does not allow me an easy way to even hit this branch of the
functions in result or snowflex itself. I will confer with someone else
to get a real test written.

JIRA ticket is AUTO-4025

Here is validation run on this using just iex to demonstrate:

iex(1)> string = "CVS Pharmacy®"
"CVS Pharmacy®"
iex(2)>
nil
iex(3)> defmodule Convert do
...(3)>   def parse(string) do
...(3)>     string
...(3)>     |> to_charlist
...(3)>     |> handle_encoding
...(3)>   end
...(3)>
...(3)>   defp handle_encoding(data) when is_list(data) do
...(3)>     raw = :erlang.list_to_binary(data)
...(3)>     case :unicode.characters_to_binary(raw) do
...(3)>       utf8 when is_binary(utf8) -> utf8
...(3)>       _ -> :unicode.characters_to_binary(raw, :latin1)
...(3)>     end
...(3)>   end
...(3)>
...(3)>   defp handle_encoding(data), do: data
...(3)> end
{:module, Convert,
 <<70, 79, 82, 49, 0, 0, 6, 240, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 0, 236,
   0, 0, 0, 21, 14, 69, 108, 105, 120, 105, 114, 46, 67, 111, 110, 118, 101,
   114, 116, 8, 95, 95, 105, 110, 102, 111, 95, ...>>, {:handle_encoding, 1}}
iex(4)>
nil
iex(5)> IO.puts Convert.parse(string)
CVS Pharmacy®
:ok

Copy link
Contributor

@superhawk610 superhawk610 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be helpful to add a similar test for db_connection

@notactuallytreyanastasio
Copy link
Contributor Author

@superhawk610 will do good eye 👍

This stemmed from an issue we saw in production, a result we were
getting included some garbled characters because of a UTF-8 and UTF-16
difference.

We were getting the string:

```
"CVS Pharmacy®"
```

Now, Â = U+00C2 and ® = U+00AE in UTF-8.

However, what we really wanted was:

```
"CVS Pharmacy®"
```

As the  character is a mistaken extra byte.

Here, we introduce a function to properly handle this and return the
desired results using the latin1 encoding with UTF-8.

I have tried a couple ways of going about testing this, but the coverage
so far does not allow me an easy way to even hit this branch of the
functions in result or snowflex itself. I will confer with someone else
to get a real test written.
@notactuallytreyanastasio notactuallytreyanastasio force-pushed the handle-utf-16-from-snowflake branch from bcfa8ac to b9af515 Compare May 9, 2022 22:34
@notactuallytreyanastasio notactuallytreyanastasio force-pushed the handle-utf-16-from-snowflake branch from c926bd4 to f58c68d Compare May 9, 2022 22:38
@Ch4s3 Ch4s3 merged commit cfa79bd into master May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants