-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow �
character references
#750
base: master
Are you sure you want to change the base?
Conversation
We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently. Because checking literal forms means that we should decode and check all the input, this will influence performance. We are not ready to get that performance lost for now. Users of the Reader API could do their own checks themselves.
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #750 +/- ##
==========================================
+ Coverage 61.24% 61.67% +0.42%
==========================================
Files 39 39
Lines 16277 16626 +349
==========================================
+ Hits 9969 10254 +285
- Misses 6308 6372 +64
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Some characters (like control characters, at least in XML 1.1), are legal when escaped but illegal otherwise, while others (like NUL) are illegal in both cases. While we definitely are not 100% conformant with the standard, I disagree that this particular "inconsistency" is necessarily wrong. As far as I can tell, it's one we're going to have even once we do conform, because the XML standard doesn't treat all invalid characters as equally invalid. |
Yes, it is. My point is that:
So I think that removing this check should not make things worse and I propose to accept a short-term solution (yes, I know what you thought now) of removing this check and add it later with understanding and formal tests (which we should get after resolving #749). |
I'm still not certain that disabling them will help this user. I have a vague recollection (though I cannot find where it was written) that whatever weird nonstandard format they were using was dumping raw binary data into the space between tags (i.e. text event in our parlance) and using an attribute to denote the length so that the parser could entirely skip over that region. I could be mixing it up with another issue though. |
Ah, correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have mixed feelings but I'm OK with going forwards with this if @sashka confirms that it would be helpful to them
We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently. Because checking literal forms means that we should decode and check all the input, this will influence performance. We are not ready to get that performance lost for now. Users of the Reader API could do their own checks themselves.
Ironically, this is what firstly was proposed in #496. After trying to finish that PR I found that we have some unanswered questions (above) which we should work out before we will be create a consistent solution. I think, it will be tight linked with #749.
I think, the best what we can do now is to not check validity of any characters and allow users to that themselves.
cc @sashka