-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigating underhanded code #19
Comments
Thanks for splitting this off into a separate issue! In #17 (comment), you wrote:
Yeah. Most programming languages today just let users deal with this on their own, with the idea that a project that wants to ban non-ASCII characters can make their own commit hook or whatever. I don't think this is an effective solution in practice. Although language designers think of it as delegating the choice to individual projects, what actually happens is projects don't do anything and thus remain insecure. Even worse, projects that use non-English languages don't even have the option to be secure! This doesn't feel very inclusive. I don't have an answer here, but I'd love to brainstorm some possibilities. Here's one potential direction: Assuming we have some sort of package manager, we could require each package to declare a list of (natural) languages they'll allow in identifiers, strings, and comments. We'd probably only support a small set of languages at first; for each of those, we'd have some sort of rules for valid character sequences that avoid confusion with language syntax or with other valid text in the set of allowed languages. (We could use looser rules for comments, which only care about confusability with language syntax; in strings we probably also care about at least confusability with ASCII, so we don't end up with programs that look like they're printing a particular HTTP header or shell command but are actually printing something else. For identifiers we need full confusability detection.) The rules would be enforced by the compiler. Not all combinations of languages would be supported, so we don't need to analyze the full matrix of language pairs. For right-to-left languages, in addition to restricting which characters can be used, we'd require the user to add an LTR mark after any string or identifier where it would affect the display ordering. (This might be annoying, but it has some big benefits -- not only does it prevent security issues, it fixes all the display problems that frequently come up even in legitimate code using right-to-left text. Let me know if you want some examples of these; they're pretty terrible.) Note that our syntax choices affect how permissive we can be. For example, if we have C-style Does this seem like a plausible direction? And more generally, is this a problem worth thinking about? I'd love to hear people's thoughts. |
This sounds like a promising idea. We should find people with internationalization expertise to discuss this with -- perhaps some of the Unicode folks -- to make sure that what we're considering is reasonable. Certainly as a mechanism to avoid homograph issues, limiting the set of available characters seems like a plausible step to take, and we could augment that with a manual list of confusables if our language support extends to a language with such characters. I'm not sure whether LTR marks are enough to force the context enclosing an RTL string literal to be treated as LTR text. As an alternative, we could run the Unicode bidi algorithm on the source text and verify that it infers left-to-right directionality for all text outside comments and string literals -- that way we don't need to care how the user ensures that the non-text source code is interpreted as left-to-right. However, it's not clear to me whether the directionality portion of this issue should properly be viewed as an encoding issue or as a display issue. A smart renderer could tokenize the source code and ensure that tokens are always displayed left-to-right (or always right-to-left, if that is the viewer's preference), with comments and string literals treated as independent bodies of text. Would we be working around a deficiency in context-unaware text editors by requiring explicit LTR marks (or isolates or whatever else), or is this simply the right way to do it? In any case, to the extent that we can do so, I'd like for us to delegate the thinking and approach for this problem to another group. I don't think Carbon should be defining its own language profiles -- that really seems like something that some other group should be providing for us and that we should only be consuming. Does anything like this already exist? |
My two cents:
So lets consider contexts from what I think would be least restrictive toward those that are probably most restrictive:
For multiline strings and comments, I think that anything must be OK. And we should design the delimiters with this reality in mind. Notably, in both cases, I think that the start and end should be both vertically and horizontally identifiable. I believe the direction is already going in this way, so I'm not worried here. Put differently, the context doesn't start until a new line, and it ends before the line of the end delimiter. This seems like a strong mitigation for making sure the delimiters can be recognized no matter the contents. For single-line, I think that we need to block all vertical breaks and/or whitespace. I think Unicode already gives us strong categorizations for this, and we always have the multiline context to allow it. The result should be that again, the delimiters can be engineered to be distinguishable from the content. For identifiers, I think Unicode already gives us strong sets of characters that can be meaningfully delimited by both whitespace and the punctuation set we are using. I think the interesting questions arise for Carbon's fixed syntax. Here, we don't really have a compelling i18n story, and I'm not aware of any mainstream languages that do... They all pick a set of linguistic rules and adhere to them. While I'm open to adopting innovations in this space if & when they arise, I'm not personally motivated to try to do that innovation with Carbon. Instead, I would suggest making this context extremely limited. I think we should allow exactly the set of whitespace that we design the fixed syntax to interact with, and disallow everything else. I don't think this will be a burden because the context is only that of the fixed syntax of the language. I honestly think we might want to go so far as to preclude tabs, and just use spaces and newline characters. The ambiguity and lack of consistency seems to dramatically outweigh the benefits of allowing this flexibility (to me). (The only alternative that seems appealing is to insist on tabs for indentation rather than spaces.) I think the result is that we pick a near-lowest-common-denominator character set (including whitespace!) for the fixed syntax. Concerns like RTL seem well handled with this kind of approach because the RTL would be within one of these contexts, not outside of it. Admitedly, it isn't beautifully internationalized through this approach, but I don't think anything short of innovating toward internationalized syntax will accomplish that goal. And it should still allow arbitrary alphabet usage within the user-defined contexts, with minimal restrictions necessary to delimit those contexts. WDYT? |
I don't think that approach prevents us from having source code that appears to do one thing but does another -- the Unicode bidirectional algorithm knows nothing about our different syntactic contexts. If it sees
then it might look like the string starts with In order to fix this, we can ask users to insert explicit left-to-right marks after a right-to-left string. That would then display properly:
... but would require that we treat such marks as whitespace, which is an option suggested by Unicode TR31. (Effectively, this approach would amount to "we will verify that the code does what it looks like it does, but making it look right is the user's problem not ours".) If we want to force the matter, we could do so by, for example, including a character with strong left-to-right directionality as part of at least the closing delimiter of any context that contains right-to-left text. For example, we could permit an optional (no-op) trailing character with strong LTR ordering (say
with no need for any invisible LTR marks. And similarly we could require identifiers to end with a character with strong LTR directionality. Perhaps we could combine this with my previous suggestion:
This still allows some amount of underhandedness, for example:
... might look like it's comparing
... so But this is far from my area of expertise; we should find someone who knows this stuff deeply and ask. |
Agreed about finding an expert to help here, or at least running our ideas by some experts. But I really like forbidding LTR and RTL marks outside of specific contexts where they can be well handled. In essence, I continue to like fixed-syntax-whitespace (IE, outside of any comment or literal) to be an extremely limited character set. I also really like requiring an explicit LTR demarkation (so not a whitespace LTR mark) at the end of any context which contains non-LTR sequences. This goes to the principle I'm suggesting of creating unambiguous delimiters for contexts where we enable more things. So there might be an LTR-string-literal context that uses just |
Only change is to update the path to the fuzzer build extension. Original main commit message: > Add an initial parser library. (#30) > > This library builds a parse tree, very similar to a concrete syntax > tree. There are no semantics here, simply introducing the basic > syntactic structure. > > The current focus has been on the APIs and the data structures used to > represent the parse tree, and not on the actual code doing the > parsing. The code doing the parsing tries to be reasonably efficient > and reasonably easy to understand recursive descent parser. But there > is likely much that can be done to improve this code path. A notable > area where very little thought has been given yet are emitting good > diagnostics and doing good recovery in the event of parse errors. > > Also, this code does not try to match the current under-discussion > grammar closely. It is only partial and reflects discussions from some > time ago. It should be updated incrementally to reflect the current > expected grammar. > > The data structure used for the parse tree is unusual. The first > constraint is that there is a precise one-to-one correspondence > between the tokens produced by the lexer and the nodes in the parse > tree. Every token results in exactly one node. In that way, the parse > tree can be thought of as merely shaping the token stream into a tree. > > Each node is also represented with a fixed set of data that is densely > packed. Combined with the exact relationship to tokens, this allows us > to fully allocate the parse tree's storage, and to use a dense array > rather than a pointer-based tree structure. > > The tree structure itself is implicitly defined by tracking the size > of each subtree rooted at a particular node. See the code comments for > more details (and I'm happy to add more comments where necessary). The > goal is to minimize both the allocations (one), the working set size > of the tree as a whole, and optimize common iteration patterns. The > tree is stored in postorder. This allows depth-first postorder > iteration as well as topological iteration by walking in reverse. > > Building the parse tree in postorder is a natural consequence of the > grammar being LR rather than LL, which is a consequence of supporting > infix operators. > > As with the Lexer, the parser supports an API for operating on the > parse tree, as well as the ability to print the tree in both > a human-readable and machine-readable format (YAML-based). It includes > significant unit tests and a fuzz tester. The fuzzer's corpus will be > in a follow-up commit. > > This is the largest chunk of code already written by several of us > prior to open sourcing. (There are a few more pieces, but they are > significantly smaller and less interesting.) If there are major things > that folks would like to see happen here, it may make sense to move > them into issues for tracking. I have tried to update the code to > follow the style guidelines, but apologies if I missed anything, just > let me know. We also have issues #19 and #29 to track things that > already came up with the lexer. Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the |
Only change is to update the path to the fuzzer build extension. Original main commit message: > Add an initial parser library. (#30) > > This library builds a parse tree, very similar to a concrete syntax > tree. There are no semantics here, simply introducing the basic > syntactic structure. > > The current focus has been on the APIs and the data structures used to > represent the parse tree, and not on the actual code doing the > parsing. The code doing the parsing tries to be reasonably efficient > and reasonably easy to understand recursive descent parser. But there > is likely much that can be done to improve this code path. A notable > area where very little thought has been given yet are emitting good > diagnostics and doing good recovery in the event of parse errors. > > Also, this code does not try to match the current under-discussion > grammar closely. It is only partial and reflects discussions from some > time ago. It should be updated incrementally to reflect the current > expected grammar. > > The data structure used for the parse tree is unusual. The first > constraint is that there is a precise one-to-one correspondence > between the tokens produced by the lexer and the nodes in the parse > tree. Every token results in exactly one node. In that way, the parse > tree can be thought of as merely shaping the token stream into a tree. > > Each node is also represented with a fixed set of data that is densely > packed. Combined with the exact relationship to tokens, this allows us > to fully allocate the parse tree's storage, and to use a dense array > rather than a pointer-based tree structure. > > The tree structure itself is implicitly defined by tracking the size > of each subtree rooted at a particular node. See the code comments for > more details (and I'm happy to add more comments where necessary). The > goal is to minimize both the allocations (one), the working set size > of the tree as a whole, and optimize common iteration patterns. The > tree is stored in postorder. This allows depth-first postorder > iteration as well as topological iteration by walking in reverse. > > Building the parse tree in postorder is a natural consequence of the > grammar being LR rather than LL, which is a consequence of supporting > infix operators. > > As with the Lexer, the parser supports an API for operating on the > parse tree, as well as the ability to print the tree in both > a human-readable and machine-readable format (YAML-based). It includes > significant unit tests and a fuzz tester. The fuzzer's corpus will be > in a follow-up commit. > > This is the largest chunk of code already written by several of us > prior to open sourcing. (There are a few more pieces, but they are > significantly smaller and less interesting.) If there are major things > that folks would like to see happen here, it may make sense to move > them into issues for tracking. I have tried to update the code to > follow the style guidelines, but apologies if I missed anything, just > let me know. We also have issues #19 and #29 to track things that > already came up with the lexer. Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
We should decide to what extent and in what ways we want to mitigate the risk of underhanded Carbon code -- code that intentionally behaves in a manner different from its appearance.
See https://github.com/carbon-language/carbon-lang/pull/17/files#r428333473 for a case where this question arose.
The text was updated successfully, but these errors were encountered: