Allow `` character references #750

Mingun · 2024-06-01T15:52:13Z

We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently. Because checking literal forms means that we should decode and check all the input, this will influence performance. We are not ready to get that performance lost for now. Users of the Reader API could do their own checks themselves.

Ironically, this is what firstly was proposed in #496. After trying to finish that PR I found that we have some unanswered questions (above) which we should work out before we will be create a consistent solution. I think, it will be tight linked with #749.

I think, the best what we can do now is to not check validity of any characters and allow users to that themselves.

cc @sashka

We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently. Because checking literal forms means that we should decode and check all the input, this will influence performance. We are not ready to get that performance lost for now. Users of the Reader API could do their own checks themselves.

codecov-commenter · 2024-06-01T15:59:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.67%. Comparing base (5d76174) to head (4522d9d).
Report is 23 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #750      +/-   ##
==========================================
+ Coverage   61.24%   61.67%   +0.42%     
==========================================
  Files          39       39              
  Lines       16277    16626     +349     
==========================================
+ Hits         9969    10254     +285     
- Misses       6308     6372      +64

Flag	Coverage Δ
unittests	`61.67% <100.00%> (+0.42%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dralley · 2024-06-04T00:51:37Z

We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently.

Some characters (like control characters, at least in XML 1.1), are legal when escaped but illegal otherwise, while others (like NUL) are illegal in both cases.

While we definitely are not 100% conformant with the standard, I disagree that this particular "inconsistency" is necessarily wrong. As far as I can tell, it's one we're going to have even once we do conform, because the XML standard doesn't treat all invalid characters as equally invalid.

Mingun · 2024-06-04T15:43:16Z

Some characters (like control characters, at least in XML 1.1), are legal when escaped but illegal otherwise, while others (like NUL) are illegal in both cases.

Yes, it is. My point is that:

we agreed in Allow  values #496 that this checks should be optional although disabling them will create non-conforming parser;
to create good solution some investigation required;
checking NUL only when resolving character references is anyway does only half of work, we still need to check literal NULs, but this will impact performance; so this should be done carefully and maybe even as an explicit preprocess step from the user side;
the restriction is anyway seems artificial with roots in С null-terminated strings;
for some of our users this check is undesirable.

So I think that removing this check should not make things worse and I propose to accept a short-term solution (yes, I know what you thought now) of removing this check and add it later with understanding and formal tests (which we should get after resolving #749).

dralley · 2024-06-04T17:29:32Z

the restriction is anyway seems artificial with roots in С null-terminated strings; for some of our users this check is undesirable.

I'm still not certain that disabling them will help this user. I have a vague recollection (though I cannot find where it was written) that whatever weird nonstandard format they were using was dumping raw binary data into the space between tags (i.e. text event in our parlance) and using an attribute to denote the length so that the parser could entirely skip over that region.

I could be mixing it up with another issue though.

Mingun · 2024-06-04T18:18:07Z

You are talking about #623, I meant this and this comments. With this change we can make life for some people easier untilw we implement proper solution that will check all according to the standard.

dralley · 2024-06-05T17:41:30Z

You are talking about #623

Ah, correct.

dralley

I still have mixed feelings but I'm OK with going forwards with this if @sashka confirms that it would be helpful to them

Mingun requested a review from dralley June 1, 2024 15:52

dralley approved these changes Jun 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `` character references #750

Allow `` character references #750

Mingun commented Jun 1, 2024

codecov-commenter commented Jun 1, 2024

dralley commented Jun 4, 2024 •

edited

Loading

Mingun commented Jun 4, 2024

dralley commented Jun 4, 2024

Mingun commented Jun 4, 2024

dralley commented Jun 5, 2024

dralley left a comment

Allow &#0; character references #750

Are you sure you want to change the base?

Allow &#0; character references #750

Conversation

Mingun commented Jun 1, 2024

codecov-commenter commented Jun 1, 2024

Codecov Report

dralley commented Jun 4, 2024 • edited Loading

Mingun commented Jun 4, 2024

dralley commented Jun 4, 2024

Mingun commented Jun 4, 2024

dralley commented Jun 5, 2024

dralley left a comment

Choose a reason for hiding this comment

Allow `` character references #750

Allow `` character references #750

dralley commented Jun 4, 2024 •

edited

Loading