You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
As described in NVIDIA/spark-rapids-benchmarks#170, the TPC-DS raw data files are in ISO-8859 format, but nds_transcode.py was reading them as UTF8. The customer file has some strings with international characters (Ô and É). With the bug in nds_transcode, we were just passing through these ISO-8859 characters unmodified, while the CPU CSV reader translates the invalid UTF8 characters as � (0xefbfbd).
Steps/Code to reproduce bug
I will attach the file iso-8859-example.csv. iso-8859-example.csv
Expected behavior
Ideally, we should produce the same output as CPU.
This is a difference in the handling of an invalid UTF8 character in the input file (the result of reading an ISO-8599 file as UTF8), so it's not clear we need to fix it. We might be able to document the difference.
The text was updated successfully, but these errors were encountered:
Describe the bug
As described in NVIDIA/spark-rapids-benchmarks#170, the TPC-DS raw data files are in ISO-8859 format, but nds_transcode.py was reading them as UTF8. The customer file has some strings with international characters (Ô and É). With the bug in nds_transcode, we were just passing through these ISO-8859 characters unmodified, while the CPU CSV reader translates the invalid UTF8 characters as � (0xefbfbd).
Steps/Code to reproduce bug
I will attach the file iso-8859-example.csv.
iso-8859-example.csv
In a spark shell:
Then use a tool like
xxd
to examine the binary data for the output files.CPU:
GPU:
Note that CPU has
ef bfbd
where GPU hasd4
Expected behavior
Ideally, we should produce the same output as CPU.
This is a difference in the handling of an invalid UTF8 character in the input file (the result of reading an ISO-8599 file as UTF8), so it's not clear we need to fix it. We might be able to document the difference.
The text was updated successfully, but these errors were encountered: