Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigating underhanded code #19

Open
zygoloid opened this issue May 20, 2020 · 6 comments
Open

Mitigating underhanded code #19

zygoloid opened this issue May 20, 2020 · 6 comments
Labels
design idea An issue recording a specific language design idea that folks can potentially pick up. long term Issues expected to take over 90 days to resolve.

Comments

@zygoloid
Copy link
Contributor

We should decide to what extent and in what ways we want to mitigate the risk of underhanded Carbon code -- code that intentionally behaves in a manner different from its appearance.

See https://github.com/carbon-language/carbon-lang/pull/17/files#r428333473 for a case where this question arose.

@mconst
Copy link
Contributor

mconst commented May 21, 2020

Thanks for splitting this off into a separate issue!

In #17 (comment), you wrote:

We need to decide to what extent we want to worry about the second class of problems. I could certainly imagine changes we could make to the specification to avoid these problems -- restricting source files to ASCII eliminates a lot of them -- but those changes come with associated costs, some of which would directly harm our inclusivity goal (for example: you cannot write identifiers -- or string literals! -- in your native language). For a project that expects to see only English-language identifiers and comments, restricting source files to ASCII might be a totally reasonable option. My inclination is that that should be a project-wide decision, though, not a language-wide one.

Yeah. Most programming languages today just let users deal with this on their own, with the idea that a project that wants to ban non-ASCII characters can make their own commit hook or whatever.

I don't think this is an effective solution in practice. Although language designers think of it as delegating the choice to individual projects, what actually happens is projects don't do anything and thus remain insecure. Even worse, projects that use non-English languages don't even have the option to be secure! This doesn't feel very inclusive.

I don't have an answer here, but I'd love to brainstorm some possibilities. Here's one potential direction:

Assuming we have some sort of package manager, we could require each package to declare a list of (natural) languages they'll allow in identifiers, strings, and comments. We'd probably only support a small set of languages at first; for each of those, we'd have some sort of rules for valid character sequences that avoid confusion with language syntax or with other valid text in the set of allowed languages. (We could use looser rules for comments, which only care about confusability with language syntax; in strings we probably also care about at least confusability with ASCII, so we don't end up with programs that look like they're printing a particular HTTP header or shell command but are actually printing something else. For identifiers we need full confusability detection.) The rules would be enforced by the compiler.

Not all combinations of languages would be supported, so we don't need to analyze the full matrix of language pairs. For right-to-left languages, in addition to restricting which characters can be used, we'd require the user to add an LTR mark after any string or identifier where it would affect the display ordering. (This might be annoying, but it has some big benefits -- not only does it prevent security issues, it fixes all the display problems that frequently come up even in legitimate code using right-to-left text. Let me know if you want some examples of these; they're pretty terrible.)

Note that our syntax choices affect how permissive we can be. For example, if we have C-style /* comments, we need to ban any text inside them that could simulate a */ sequence, or hide the real */ sequence, or cause text to be reordered across the */. But if we only have // comments, we can be much more generous.

Does this seem like a plausible direction? And more generally, is this a problem worth thinking about? I'd love to hear people's thoughts.

@zygoloid
Copy link
Contributor Author

This sounds like a promising idea. We should find people with internationalization expertise to discuss this with -- perhaps some of the Unicode folks -- to make sure that what we're considering is reasonable. Certainly as a mechanism to avoid homograph issues, limiting the set of available characters seems like a plausible step to take, and we could augment that with a manual list of confusables if our language support extends to a language with such characters.

I'm not sure whether LTR marks are enough to force the context enclosing an RTL string literal to be treated as LTR text. As an alternative, we could run the Unicode bidi algorithm on the source text and verify that it infers left-to-right directionality for all text outside comments and string literals -- that way we don't need to care how the user ensures that the non-text source code is interpreted as left-to-right. However, it's not clear to me whether the directionality portion of this issue should properly be viewed as an encoding issue or as a display issue. A smart renderer could tokenize the source code and ensure that tokens are always displayed left-to-right (or always right-to-left, if that is the viewer's preference), with comments and string literals treated as independent bodies of text. Would we be working around a deficiency in context-unaware text editors by requiring explicit LTR marks (or isolates or whatever else), or is this simply the right way to do it?

In any case, to the extent that we can do so, I'd like for us to delegate the thinking and approach for this problem to another group. I don't think Carbon should be defining its own language profiles -- that really seems like something that some other group should be providing for us and that we should only be consuming. Does anything like this already exist?

@chandlerc
Copy link
Contributor

My two cents:

  1. break this down into contexts where different things need to apply
  2. use reasonably tiered rules for these contexts to make the scope of damage contained and tractable

So lets consider contexts from what I think would be least restrictive toward those that are probably most restrictive:

  • multiline string literals & comments
  • single line string literals & comments
  • identifiers
  • fixed language syntax

For multiline strings and comments, I think that anything must be OK. And we should design the delimiters with this reality in mind. Notably, in both cases, I think that the start and end should be both vertically and horizontally identifiable. I believe the direction is already going in this way, so I'm not worried here. Put differently, the context doesn't start until a new line, and it ends before the line of the end delimiter. This seems like a strong mitigation for making sure the delimiters can be recognized no matter the contents.

For single-line, I think that we need to block all vertical breaks and/or whitespace. I think Unicode already gives us strong categorizations for this, and we always have the multiline context to allow it. The result should be that again, the delimiters can be engineered to be distinguishable from the content.

For identifiers, I think Unicode already gives us strong sets of characters that can be meaningfully delimited by both whitespace and the punctuation set we are using.

I think the interesting questions arise for Carbon's fixed syntax. Here, we don't really have a compelling i18n story, and I'm not aware of any mainstream languages that do... They all pick a set of linguistic rules and adhere to them. While I'm open to adopting innovations in this space if & when they arise, I'm not personally motivated to try to do that innovation with Carbon.

Instead, I would suggest making this context extremely limited. I think we should allow exactly the set of whitespace that we design the fixed syntax to interact with, and disallow everything else. I don't think this will be a burden because the context is only that of the fixed syntax of the language. I honestly think we might want to go so far as to preclude tabs, and just use spaces and newline characters. The ambiguity and lack of consistency seems to dramatically outweigh the benefits of allowing this flexibility (to me). (The only alternative that seems appealing is to insist on tabs for indentation rather than spaces.) I think the result is that we pick a near-lowest-common-denominator character set (including whitespace!) for the fixed syntax.

Concerns like RTL seem well handled with this kind of approach because the RTL would be within one of these contexts, not outside of it. Admitedly, it isn't beautifully internationalized through this approach, but I don't think anything short of innovating toward internationalized syntax will accomplish that goal. And it should still allow arbitrary alphabet usage within the user-defined contexts, with minimal restrictions necessary to delimit those contexts.

WDYT?

@zygoloid
Copy link
Contributor Author

I don't think that approach prevents us from having source code that appears to do one thing but does another -- the Unicode bidirectional algorithm knows nothing about our different syntactic contexts. If it sees

var String: x = "مرحبا" + "بالعالم";

then it might look like the string starts with "بالعالم" and ends with "مرحبا", but that is not the case: the string starts with "مرحبا" and ends with "بالعالم". The reason is that all of the text between the first " and the last ", including the +, is displayed right-to-left. (The ", +, and space characters all have weak or neutral directionality and do not force a switch to left-to-right.)

In order to fix this, we can ask users to insert explicit left-to-right marks after a right-to-left string. That would then display properly:

var String: x = "مرحبا"‎ + "‎بالعالم";

... but would require that we treat such marks as whitespace, which is an option suggested by Unicode TR31. (Effectively, this approach would amount to "we will verify that the code does what it looks like it does, but making it look right is the user's problem not ours".)

If we want to force the matter, we could do so by, for example, including a character with strong left-to-right directionality as part of at least the closing delimiter of any context that contains right-to-left text. For example, we could permit an optional (no-op) trailing character with strong LTR ordering (say r for "RTL") on string literals. Then the above example could be written as:

var String: x = "مرحبا"r + "بالعالم"r;

with no need for any invisible LTR marks. And similarly we could require identifiers to end with a character with strong LTR directionality.

Perhaps we could combine this with my previous suggestion:

  • Reject if the Unicode bidi algorithm would not order the tokens left-to-right.
  • Allow an optional suffix character on character and string literals to make it easy to return to an LTR context without resorting to an LTR mark.

This still allows some amount of underhandedness, for example:

var auto: x = ‏"==0".size()‎;

... might look like it's comparing 0 against "".size(), but the source code is actually this:

var auto: x = <RTL mark>"==0".size();

... so x is 3, not True. I think this too can be handled by requiring the opening delimiter of a context in which we permit RTL text be classified as left-to-right. But someone would need to study the Unicode bidi algorithm to be sure whether that's sufficient. Disallowing LTR and RTL marks outside of literals and comments does seem like a safer choice.

But this is far from my area of expertise; we should find someone who knows this stuff deeply and ask.

@chandlerc
Copy link
Contributor

Agreed about finding an expert to help here, or at least running our ideas by some experts.

But I really like forbidding LTR and RTL marks outside of specific contexts where they can be well handled. In essence, I continue to like fixed-syntax-whitespace (IE, outside of any comment or literal) to be an extremely limited character set.

I also really like requiring an explicit LTR demarkation (so not a whitespace LTR mark) at the end of any context which contains non-LTR sequences.

This goes to the principle I'm suggesting of creating unambiguous delimiters for contexts where we enable more things. So there might be an LTR-string-literal context that uses just ", but a more general string-literal context that allows RTL but uses a delimiting style (much like your suggestion) that forces the delimiters themselves to be unambiguous and definitive when correctly rendered.

chandlerc added a commit that referenced this issue Dec 8, 2020
Only change is to update the path to the fuzzer build extension.

Original main commit message:

> Add an initial parser library. (#30)
>
> This library builds a parse tree, very similar to a concrete syntax
> tree. There are no semantics here, simply introducing the basic
> syntactic structure.
>
> The current focus has been on the APIs and the data structures used to
> represent the parse tree, and not on the actual code doing the
> parsing. The code doing the parsing tries to be reasonably efficient
> and reasonably easy to understand recursive descent parser. But there
> is likely much that can be done to improve this code path. A notable
> area where very little thought has been given yet are emitting good
> diagnostics and doing good recovery in the event of parse errors.
>
> Also, this code does not try to match the current under-discussion
> grammar closely. It is only partial and reflects discussions from some
> time ago. It should be updated incrementally to reflect the current
> expected grammar.
>
> The data structure used for the parse tree is unusual. The first
> constraint is that there is a precise one-to-one correspondence
> between the tokens produced by the lexer and the nodes in the parse
> tree. Every token results in exactly one node. In that way, the parse
> tree can be thought of as merely shaping the token stream into a tree.
>
> Each node is also represented with a fixed set of data that is densely
> packed. Combined with the exact relationship to tokens, this allows us
> to fully allocate the parse tree's storage, and to use a dense array
> rather than a pointer-based tree structure.
>
> The tree structure itself is implicitly defined by tracking the size
> of each subtree rooted at a particular node. See the code comments for
> more details (and I'm happy to add more comments where necessary). The
> goal is to minimize both the allocations (one), the working set size
> of the tree as a whole, and optimize common iteration patterns. The
> tree is stored in postorder. This allows depth-first postorder
> iteration as well as topological iteration by walking in reverse.
>
> Building the parse tree in postorder is a natural consequence of the
> grammar being LR rather than LL, which is a consequence of supporting
> infix operators.
>
> As with the Lexer, the parser supports an API for operating on the
> parse tree, as well as the ability to print the tree in both
> a human-readable and machine-readable format (YAML-based). It includes
> significant unit tests and a fuzz tester. The fuzzer's corpus will be
> in a follow-up commit.
>
> This is the largest chunk of code already written by several of us
> prior to open sourcing. (There are a few more pieces, but they are
> significantly smaller and less interesting.) If there are major things
> that folks would like to see happen here, it may make sense to move
> them into issues for tracking. I have tried to update the code to
> follow the style guidelines, but apologies if I missed anything, just
> let me know. We also have issues #19 and #29 to track things that
> already came up with the lexer.

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
@github-actions
Copy link

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time.
This issue is labeled inactive because the last activity was over 90 days ago.

@github-actions github-actions bot added the inactive Issues and PRs which have been inactive for at least 90 days. label Aug 16, 2021
chandlerc added a commit that referenced this issue Jun 28, 2022
Only change is to update the path to the fuzzer build extension.

Original main commit message:

> Add an initial parser library. (#30)
>
> This library builds a parse tree, very similar to a concrete syntax
> tree. There are no semantics here, simply introducing the basic
> syntactic structure.
>
> The current focus has been on the APIs and the data structures used to
> represent the parse tree, and not on the actual code doing the
> parsing. The code doing the parsing tries to be reasonably efficient
> and reasonably easy to understand recursive descent parser. But there
> is likely much that can be done to improve this code path. A notable
> area where very little thought has been given yet are emitting good
> diagnostics and doing good recovery in the event of parse errors.
>
> Also, this code does not try to match the current under-discussion
> grammar closely. It is only partial and reflects discussions from some
> time ago. It should be updated incrementally to reflect the current
> expected grammar.
>
> The data structure used for the parse tree is unusual. The first
> constraint is that there is a precise one-to-one correspondence
> between the tokens produced by the lexer and the nodes in the parse
> tree. Every token results in exactly one node. In that way, the parse
> tree can be thought of as merely shaping the token stream into a tree.
>
> Each node is also represented with a fixed set of data that is densely
> packed. Combined with the exact relationship to tokens, this allows us
> to fully allocate the parse tree's storage, and to use a dense array
> rather than a pointer-based tree structure.
>
> The tree structure itself is implicitly defined by tracking the size
> of each subtree rooted at a particular node. See the code comments for
> more details (and I'm happy to add more comments where necessary). The
> goal is to minimize both the allocations (one), the working set size
> of the tree as a whole, and optimize common iteration patterns. The
> tree is stored in postorder. This allows depth-first postorder
> iteration as well as topological iteration by walking in reverse.
>
> Building the parse tree in postorder is a natural consequence of the
> grammar being LR rather than LL, which is a consequence of supporting
> infix operators.
>
> As with the Lexer, the parser supports an API for operating on the
> parse tree, as well as the ability to print the tree in both
> a human-readable and machine-readable format (YAML-based). It includes
> significant unit tests and a fuzz tester. The fuzzer's corpus will be
> in a follow-up commit.
>
> This is the largest chunk of code already written by several of us
> prior to open sourcing. (There are a few more pieces, but they are
> significantly smaller and less interesting.) If there are major things
> that folks would like to see happen here, it may make sense to move
> them into issues for tracking. I have tried to update the code to
> follow the style guidelines, but apologies if I missed anything, just
> let me know. We also have issues #19 and #29 to track things that
> already came up with the lexer.

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
@jonmeow jonmeow added long term Issues expected to take over 90 days to resolve. design idea An issue recording a specific language design idea that folks can potentially pick up. and removed inactive Issues and PRs which have been inactive for at least 90 days. labels Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design idea An issue recording a specific language design idea that folks can potentially pick up. long term Issues expected to take over 90 days to resolve.
Projects
None yet
Development

No branches or pull requests

4 participants