Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surrogate escape strings #143

Closed
pyrco opened this issue Sep 30, 2024 · 0 comments · Fixed by #144
Closed

Surrogate escape strings #143

pyrco opened this issue Sep 30, 2024 · 0 comments · Fixed by #144
Assignees

Comments

@pyrco
Copy link
Contributor

pyrco commented Sep 30, 2024

String record fields could hold unicode surogate escaped values. E.g. filenames from a filesystem that are not proper utf-8.

Python will surrogate escape these filenames to have them still be a proper Python string type while not losing any information on what the actual filename is.

We also want to preserve this information, so flow.record should deal with surrogate sequences in string fields. It currently converts Python strings to utf-8 encoded byte strings and v.v. when creating the binary record format (or other output formats), and it will throw a UnicodeDecodeError when encountering such strings or byte strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant