Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Salesforce: decode(chunk) produces garbled text #15950

Closed
amrael opened this issue Aug 25, 2022 · 3 comments
Closed

Source Salesforce: decode(chunk) produces garbled text #15950

amrael opened this issue Aug 25, 2022 · 3 comments
Assignees
Labels

Comments

@amrael
Copy link

amrael commented Aug 25, 2022

Environment

  • Airbyte version: 0.40.0-alpha
  • OS Version / Instance: Ubuntu 18.04, Airbyte Cloud
  • Deployment: Plural, Airbyte Cloud
  • Source Connector and version: Salesforce 1.0.13
  • Step where error happened: Sync job

Current Behavior

UTF-8 characters seem to be decoded as ISO-8859-1 mistakenly.
image

Expected Behavior

Multi-byte text should always be decoded as UTF-8

Logs

Steps to Reproduce

  1. A Salesforce object such as OPPORTUNITY should have more than hundreds of records. (more than 1KB)
  2. Create a connection with the salesforce connector and any destination.
  3. Sync

The root cause might be this line of code, https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L297
Splitting data into small chunks and decoding each may not be a good idea since the data could be divided in the middle of a multi-byte character, which can be from one to four bytes depending on the character.

--
source: https://stackoverflow.com/a/10229225

The first 128 characters (US-ASCII) need one byte.

The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

Are you willing to submit a PR?

No

@amrael amrael added needs-triage type/bug Something isn't working labels Aug 25, 2022
@amrael amrael changed the title source_salesforce decode produces garbled text source_salesforce decode(chunk) produces garbled text Aug 25, 2022
@marcosmarxm marcosmarxm changed the title source_salesforce decode(chunk) produces garbled text Source Salesforce: decode(chunk) produces garbled text Aug 25, 2022
@marcosmarxm marcosmarxm added python Pull requests that update Python code connectors/source/salesforce team/connectors-python and removed needs-triage team/tse Technical Support Engineers labels Aug 25, 2022
@marcosmarxm
Copy link
Member

Thanks for reporting this @amrael I added the issue to connector backlog team.

@amrael
Copy link
Author

amrael commented Sep 12, 2022

@marcosmarxm
I tried downgrading it to v1.0.2 and v1.0.6.

  • 1.0.6
    • Failed with an error "UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 5653-5654: unexpected end of data"
  • 1.0.2
    • Succeeded

The result proves that this part, https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L296-L297 is causing the issue.

@marcosmarxm
Copy link
Member

@alexander-marquardt do you mind giving your opinion for this issue in Salesforce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants