Stop HTML encoding unicode characters when serializing XForm #685

seadowg · 2022-05-24T16:17:04Z

What has been done to verify that this works as intended?

New test and verified it fixed the issue using Collect.

Why is this the best possible solution? Were any other approaches considered?

We could attempt to only stop HTML encoding for formid and version, but that would actually be more difficult as we'd have to change the way we serialize - KXmlSerializer only allows us to switch Unicode escaping on or off for the whole document (based on the desired output encoding). This solution also removes weirdness with Emoji (and other high digit Unciode characters) being encoded to two HTML characters which causes problems downstream for Central.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

People doing analysis on submission XML directly will be affected as they might expect HTML encoded files (although this wasn't the case for Enketo submissions anyway). There could be other unforeseen consequences, so I'd advocate for us getting this into an early Collect beta and then announcing the change on the forum.

Do we need any specific form for testing your changes? If so, please attach one.

Any form that has Unicode characters anywhere and also submissions that include Unicode characters (like an Emoji in a text question)

lognaturel · 2022-05-25T17:34:10Z

To evaluate risk and potential downstream effects, @seadowg will look into:

Whether instances written by Collect are currently UTF-8 encoded
Whether instances written by either Enketo or Collect have anything in the encoding XML root attribute

lognaturel · 2022-05-25T20:47:04Z

Strangely, someone just reported running into a related issue. Enketo can't parse an XML submission with complex emojis. That led me to poke around some. No encoding on the XML root.

I believe then the resulting files are ASCII which is a subset of UTF-8. Currently all characters in submissions are guaranteed to only include the 128 ASCII chars (which are also part of UTF-8). Actually UTF-8 encoding unicode characters instead of writing their HTML codes would be a change for sure. We would probably want to add the encoding attribute as well. We know Central will be fine with this but it's hard to reason about possible implications.

lognaturel · 2022-05-25T21:39:29Z

A submission that fails has &#55357;&#56852; (UTF-16 surrogate pairs). But when I try to add the 😔 emoji my XML has 😔 in it.

Similarly, the issue you saw was with 👍, right? When I add that in Collect it is 👍 and works end-to-end. You got &#55357;&#56397;. Seems like it's device-related which encoding is used? This explains why we haven't heard of this issue. I was pretty sure I'd tried emoji before and seen people use them.

I can't create a submission that fails from my device no matter what. 👍🏿👨‍👨‍👧‍👦🐼 -- all fine. See https://test.getodk.cloud/#/projects/149/forms/all-widgets/submissions and feel free to make additional submissions.

seadowg · 2022-05-26T15:54:51Z

Whether instances written by Collect are currently UTF-8 encoded

Ok so after doing a bunch of (probably too frantic) research, it seems that the only way to flag that a file uses UTF-8 or ASCII (or something else) is to add a BOM (byte-order marker) to the beginning of the file. It seems like that's not actually that common and most text editors etc just assume UTF-8. Collect's instance/submission files don't have a BOM as far as I can see (using hexdump -n 3 -C <file> and then comparing to https://www.garykessler.net/library/file_sigs.html) and it seems that's pretty normal from seeing Stack Overflows where people are having a hard time reading files WITH a BOM. My impression is that the best way of flagging "this file is UTF-8" is to use the encoding attribute in XML...

Whether instances written by either Enketo or Collect have anything in the encoding XML root attribute

...but it doesn't seem like Collect (or Enketo) include that.

I'd definitely be happy for someone to double-check all my thinking here, but from all this I think we should do a few things:

Merge this PR (which as far as we can see creates UTF-8 files)
Add the encoding attribute to instance/submission files in Collect and Enketo
Announce on the forum that Collect submissions files will be UTF-8 and no longer have escaped unicode

Escaping feels more scary to me as it creates more uncertainty downstream around what kind of HTML escaping as been used. If we're able to flag UTF-8 using standard XML means, we're making it a lot cleaner to any consumers what to expect when reading the file. For the golden path of Collect/Enketo -> Central this shouldn't change anything (other than fixing problems with complex emoji) as Central as people will most likely be using CSV exports/OData APIs and these were already decoding the HTML encoded characters.

lognaturel · 2022-05-31T15:36:08Z

I agree with your reasoning. Let's make sure this is in our next beta.

Happy to write the forum post unless you'd rather take this one all the way now that you've done a deep dive!

seadowg · 2022-06-02T09:29:41Z

Happy to write the forum post unless you'd rather take this one all the way now that you've done a deep dive!

Can do! I'll write it up as part of the release notes for the v2022.3 Beta so that it can go in Github Releases and the Forum (for maximum 👀).

)

seadowg added 2 commits May 24, 2022 17:14

Stop HTML encoding unicode characters when serializing XForm

a5e9e01

Add test to check XForm serialization preserves unicode characters

ca182a8

seadowg marked this pull request as ready for review May 25, 2022 12:35

seadowg requested a review from lognaturel May 25, 2022 12:35

lognaturel approved these changes May 31, 2022

View reviewed changes

lognaturel merged commit adde512 into getodk:master May 31, 2022

seadowg mentioned this pull request Jun 1, 2022

Upgrade JavaRosa getodk/collect#5155

Merged

3 tasks

seadowg deleted the utf branch June 2, 2022 09:27

lognaturel pushed a commit to lognaturel/javarosa that referenced this pull request Jul 13, 2022

Stop HTML encoding unicode characters when serializing XForm (getodk#685

bf9385e

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop HTML encoding unicode characters when serializing XForm #685

Stop HTML encoding unicode characters when serializing XForm #685

seadowg commented May 24, 2022 •

edited by lognaturel

Loading

lognaturel commented May 25, 2022

lognaturel commented May 25, 2022 •

edited

Loading

lognaturel commented May 25, 2022 •

edited

Loading

seadowg commented May 26, 2022

lognaturel commented May 31, 2022

seadowg commented Jun 2, 2022

Stop HTML encoding unicode characters when serializing XForm #685

Stop HTML encoding unicode characters when serializing XForm #685

Conversation

seadowg commented May 24, 2022 • edited by lognaturel Loading

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Do we need any specific form for testing your changes? If so, please attach one.

lognaturel commented May 25, 2022

lognaturel commented May 25, 2022 • edited Loading

lognaturel commented May 25, 2022 • edited Loading

seadowg commented May 26, 2022

lognaturel commented May 31, 2022

seadowg commented Jun 2, 2022

seadowg commented May 24, 2022 •

edited by lognaturel

Loading

lognaturel commented May 25, 2022 •

edited

Loading

lognaturel commented May 25, 2022 •

edited

Loading