Source Salesforce: decode(chunk) produces garbled text #15950

amrael · 2022-08-25T07:03:18Z

Environment

Airbyte version: 0.40.0-alpha
OS Version / Instance: Ubuntu 18.04, Airbyte Cloud
Deployment: Plural, Airbyte Cloud
Source Connector and version: Salesforce 1.0.13
Step where error happened: Sync job

Current Behavior

UTF-8 characters seem to be decoded as ISO-8859-1 mistakenly.

Expected Behavior

Multi-byte text should always be decoded as UTF-8

Logs

Steps to Reproduce

A Salesforce object such as OPPORTUNITY should have more than hundreds of records. (more than 1KB)
Create a connection with the salesforce connector and any destination.
Sync

The root cause might be this line of code, https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L297
Splitting data into small chunks and decoding each may not be a good idea since the data could be divided in the middle of a multi-byte character, which can be from one to four bytes depending on the character.

--
source: https://stackoverflow.com/a/10229225

The first 128 characters (US-ASCII) need one byte.

The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

Are you willing to submit a PR?

No

marcosmarxm · 2022-08-29T19:26:39Z

Thanks for reporting this @amrael I added the issue to connector backlog team.

amrael · 2022-09-12T01:25:57Z

@marcosmarxm
I tried downgrading it to v1.0.2 and v1.0.6.

1.0.6
- Failed with an error "UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 5653-5654: unexpected end of data"
1.0.2
- Succeeded

The result proves that this part, https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L296-L297 is causing the issue.

marcosmarxm · 2022-09-12T17:47:59Z

@alexander-marquardt do you mind giving your opinion for this issue in Salesforce?

amrael added needs-triage type/bug Something isn't working labels Aug 25, 2022

octavia-squidington-iii added team/triage autoteam community team/tse Technical Support Engineers and removed team/triage labels Aug 25, 2022

amrael changed the title ~~source_salesforce decode produces garbled text~~ source_salesforce decode(chunk) produces garbled text Aug 25, 2022

marcosmarxm changed the title ~~source_salesforce decode(chunk) produces garbled text~~ Source Salesforce: decode(chunk) produces garbled text Aug 25, 2022

marcosmarxm added python Pull requests that update Python code connectors/source/salesforce team/connectors-python and removed needs-triage team/tse Technical Support Engineers labels Aug 25, 2022

Nakachi-S mentioned this issue Sep 14, 2022

🐛 Source Salesforce: Write binary file without decoding #16684

Closed

14 tasks

artem1205 self-assigned this Sep 27, 2022

artem1205 mentioned this issue Sep 28, 2022

🐛 Source Salesforce: fix response encoding #17314

Merged

19 tasks

artem1205 closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source Salesforce: decode(chunk) produces garbled text #15950

Source Salesforce: decode(chunk) produces garbled text #15950

amrael commented Aug 25, 2022 •

edited

Loading

marcosmarxm commented Aug 29, 2022

amrael commented Sep 12, 2022

marcosmarxm commented Sep 12, 2022

Source Salesforce: decode(chunk) produces garbled text #15950

Source Salesforce: decode(chunk) produces garbled text #15950

Comments

amrael commented Aug 25, 2022 • edited Loading

Environment

Current Behavior

Expected Behavior

Logs

Steps to Reproduce

Are you willing to submit a PR?

marcosmarxm commented Aug 29, 2022

amrael commented Sep 12, 2022

marcosmarxm commented Sep 12, 2022

amrael commented Aug 25, 2022 •

edited

Loading