Formatter: Escape non-printable characters in literals #11520
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With this patch, the formatter escapes all non-printable characters in string, char, and symbol literals.
Non-printable characters in source code are confusing because they typically don't show up in the editor or other displays. This can lead to misunderstanding of the code and can be actively exploited to plant malicious code that gets unnoticed in review (see #11392 for example).
Ideally, editors and source code viewers could take care of this and make sure to indicate non-printable characters. There have been some improvements to that lately. But it's probably impossible or at least very hard to cover ever angle. So it's a good idea to mitigate any issues by making non-printable characters more explicit in source code.
In Crystals grammar (and after #11508 is merged), non-printable characters should technically only be valid inside literals or doc comments. We can easily replace any character in a literal by an equivalent escape sequence without changing the semantics.
This patch implements that. The only exception is that the non-printable characters
\n
and\t
are allowed in string and symbol literals in order to avoid messing up the line format. I don't think that makes any sense for symbol literals, but they share the implementation with strings 🤷 Not sure it's worth making a special case.For demonstration, the two examples from #11392 format like this now:
That's the first step of #11478. A follow up could make the parser reject non-printables, but that's up for debate.
Some non-printables such as BIDI control characters could be considered to be allowed if we implement a BIDI algorithm in the parser that restricts the context for such controls to the boundaries of the literal.