[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238

voronin-ilya · 2024-12-02T20:31:27Z

Bug Description
opensearchproject/data-prepper container image incorrectly handles UTF-8 characters when streaming data from DynamoDB to S3 buckets in NDJSON format. Non-ASCII characters are replaced with question marks (?) in the output files.

Steps to Reproduce

Set up data-prepper using the opensearchproject/data-prepper container image
Create a DynamoDB table with items containing strings with non-ASCII characters (e.g., Mandarin, Tamil)
Configure data-prepper to stream changes from the DynamoDB table to an S3 bucket using NDJSON format
Observe the resulting S3 objects

Actual Behavior
All non-ASCII characters in the original DynamoDB data are replaced with question marks (?) in the S3 output files.

Expected Behavior
All UTF-8 characters, including non-ASCII characters, should be preserved in the output NDJSON files exactly as they appear in the source DynamoDB table.

Workaround
Adding the environment variable LC_ALL=C.UTF-8 to the container configuration resolves the issue. This environment variable should be set by default in the container image to ensure proper UTF-8 handling.

The text was updated successfully, but these errors were encountered:

dlvenable · 2024-12-10T20:36:54Z

We have had similar issues with the DynamoDB source and control characters. These might be related: #5027

voronin-ilya added bug Something isn't working untriaged labels Dec 2, 2024

github-project-automation bot added this to Data Prepper Tracking Board Dec 2, 2024

github-project-automation bot moved this to Unplanned in Data Prepper Tracking Board Dec 2, 2024

dlvenable removed the untriaged label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238

[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238

voronin-ilya commented Dec 2, 2024

dlvenable commented Dec 10, 2024

[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238

[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238

Comments

voronin-ilya commented Dec 2, 2024

dlvenable commented Dec 10, 2024