-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop HTML encoding unicode characters when serializing XForm #685
Conversation
To evaluate risk and potential downstream effects, @seadowg will look into:
|
Strangely, someone just reported running into a related issue. Enketo can't parse an XML submission with complex emojis. That led me to poke around some. No encoding on the XML root. I believe then the resulting files are ASCII which is a subset of UTF-8. Currently all characters in submissions are guaranteed to only include the 128 ASCII chars (which are also part of UTF-8). Actually UTF-8 encoding unicode characters instead of writing their HTML codes would be a change for sure. We would probably want to add the |
A submission that fails has Similarly, the issue you saw was with 👍, right? When I add that in Collect it is I can't create a submission that fails from my device no matter what. 👍🏿👨👨👧👦🐼 -- all fine. See https://test.getodk.cloud/#/projects/149/forms/all-widgets/submissions and feel free to make additional submissions. |
Ok so after doing a bunch of (probably too frantic) research, it seems that the only way to flag that a file uses UTF-8 or ASCII (or something else) is to add a BOM (byte-order marker) to the beginning of the file. It seems like that's not actually that common and most text editors etc just assume UTF-8. Collect's instance/submission files don't have a BOM as far as I can see (using
...but it doesn't seem like Collect (or Enketo) include that. I'd definitely be happy for someone to double-check all my thinking here, but from all this I think we should do a few things:
Escaping feels more scary to me as it creates more uncertainty downstream around what kind of HTML escaping as been used. If we're able to flag UTF-8 using standard XML means, we're making it a lot cleaner to any consumers what to expect when reading the file. For the golden path of Collect/Enketo -> Central this shouldn't change anything (other than fixing problems with complex emoji) as Central as people will most likely be using CSV exports/OData APIs and these were already decoding the HTML encoded characters. |
I agree with your reasoning. Let's make sure this is in our next beta. Happy to write the forum post unless you'd rather take this one all the way now that you've done a deep dive! |
Can do! I'll write it up as part of the release notes for the v2022.3 Beta so that it can go in Github Releases and the Forum (for maximum 👀). |
Closes getodk/collect#4554
What has been done to verify that this works as intended?
New test and verified it fixed the issue using Collect.
Why is this the best possible solution? Were any other approaches considered?
We could attempt to only stop HTML encoding for
formid
andversion
, but that would actually be more difficult as we'd have to change the way we serialize -KXmlSerializer
only allows us to switch Unicode escaping on or off for the whole document (based on the desired output encoding). This solution also removes weirdness with Emoji (and other high digit Unciode characters) being encoded to two HTML characters which causes problems downstream for Central.How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?
People doing analysis on submission XML directly will be affected as they might expect HTML encoded files (although this wasn't the case for Enketo submissions anyway). There could be other unforeseen consequences, so I'd advocate for us getting this into an early Collect beta and then announcing the change on the forum.
Do we need any specific form for testing your changes? If so, please attach one.
Any form that has Unicode characters anywhere and also submissions that include Unicode characters (like an Emoji in a text question)