Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238

Open
voronin-ilya opened this issue Dec 2, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@voronin-ilya
Copy link

Bug Description
opensearchproject/data-prepper container image incorrectly handles UTF-8 characters when streaming data from DynamoDB to S3 buckets in NDJSON format. Non-ASCII characters are replaced with question marks (?) in the output files.

Steps to Reproduce

  1. Set up data-prepper using the opensearchproject/data-prepper container image
  2. Create a DynamoDB table with items containing strings with non-ASCII characters (e.g., Mandarin, Tamil)
  3. Configure data-prepper to stream changes from the DynamoDB table to an S3 bucket using NDJSON format
  4. Observe the resulting S3 objects

Actual Behavior
All non-ASCII characters in the original DynamoDB data are replaced with question marks (?) in the S3 output files.

Expected Behavior
All UTF-8 characters, including non-ASCII characters, should be preserved in the output NDJSON files exactly as they appear in the source DynamoDB table.

Workaround
Adding the environment variable LC_ALL=C.UTF-8 to the container configuration resolves the issue. This environment variable should be set by default in the container image to ensure proper UTF-8 handling.

@dlvenable
Copy link
Member

We have had similar issues with the DynamoDB source and control characters. These might be related: #5027

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

2 participants