-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple constant propagation AST-based analysis #852
base: main
Are you sure you want to change the base?
Conversation
…It only refactors the interpreter to be able to change the semantics or add additional functionality to it. At first, I was trying to do a monad-style interpreter, but TypeScript does not have do-notation, which can quickly produce unreadable code because of the bind functions. Although, there are some suggestions over the internet on how to emulate do-notation using generators, it looks ad-hoc, hacky, and unnatural. So, I decided to take the idea of monad and adapt it into an imperative flavor. The idea is to define the semantics of an interpreter as an abstract class parametric over two generic types: "V" (the type of expressions' results) and "S" (the type of statements' results). At the moment, module level declarations do not have an associated generic type, but it will probably be added. Specific semantics can be represented by implementing the abstract class. For example, to create an interpreter with the standard semantics, just write: const interpreter = new Interpreter(new StandardSemantics(ctx)); If we want to change the behavior, for example, by collecting special values when statements execute, just pass another semantics: const interpreter = new Interpreter(new YourCustomSemantics(ctx)); This is a first version of the "monadish" interpreter. As such, it probably will suffer changes as more semantics are added since they will probably force changes to the interface in the abstract class.
…hod. It also supports variable tracing in structs.
… did not show it. This was due to partial evaluator making use of dummySrcInfo.
- forgot to change initialSctx in foreach statement to foreachSctx
Added negative tests.
The main idea of the approach is to keep a map from variable names to either a value or undefined:
This map is stored in the statement context, and it is used to track the value that variables have so far in the program. We say that a variable is "undetermined" if either:
The approach keeps track of the value of each variable so far. For example, in this program snippet:
After line During this trace analysis, a variable can become undetermined mainly because of the following reasons:
I'll explain each case now. Case 1Consider this function:
After line
Because this makes the process of merging different branches in the code easier (as will be explained in Case 3). After line Case 2Consider this program:
After line Case 3If control flow branches and then joins, then the binding maps of each branch will merge at the joint point using the following rule:
This is better exemplified with an example. Consider this function:
Control flow at We would compute the map at Hence, at Sometimes the analyzer is able to determine that a particular branch will always be taken. In those cases, instead of merging the binding maps at the joint point, the analyzer simply takes the map of the executed branch, by using the following rule:
For example, in the above program, condition
Then, the bindings map at One last important note. Observe that the rules always start by stating:
variables Handling loopsSo far, I haven't talked about how loops are handled by the analyzer. Consider the following function:
There are two possible branches at If we follow branch So, in the above example, the map at If instead, we have the program:
Then, the map at The above examples suggest a general procedure to handle the branch inside the loop:
As was the case with conditionals, sometimes the analyzer is able to determine if a loop will execute or not. In that case, it will take the binding map of the corresponding branch. For example, consider this function:
In this case, the condition Some pointers to the code
|
Partial commit (does not compile).
…elated to variable tracing to a test folder inside interpreters folder (since constant propagation analyzer is an instance of interpreter). However, one test case is still failing, but it does not have to do with the analyzer. Will merge with main to see if it solves the issue.
Please DO NOT review this PR yet. It still lacks some code documentation. I will also add some modifications to the handling of loops. I will write a comment in the morning explaining the rationale. |
The current code in the PR still implements the analysis of loops as I described them in the looooong comment I did above. I was still in doubt that loop analysis using ASTs could be done by carrying out iterations because I was trapped in the idea of the naive approach that I describe in these new notes below. But it turns out that there is a way to modify the naive approach to actually make it work (as I describe in these notes as well). So, yes, you were correct @anton-trunov. I just needed to carry out the proofs to convince myself =). I will incorporate this way of analyzing the loops later, since now I will switch to reviewer mode. The change should not take too long because all the infrastructure is already in the PR. The following notes are just a way for myself to remember (and to convince myself of) the details of the idea and to explain to the reviewers the rationale of the analyzer. The semilattice of environmentsDenote by
It assigns to An environment is a partial function from For example, in function The analyzer uses the following join operation on environments (denoted by the symbol In other words, The rationale for using such join has to do with how branches are joined during analysis. For example, consider this function with a single conditional:
Since it is not possible to know at compile time the value of The For example,
In other words, Also, note that,
Handling loopsNaive Approach (Fails, but gives insight into what needs to be done)Given a loop like:
Suppose for a moment that the value of condition where where
So, if the sequence of computations: eventually stabilizes, we got our join for the loop. But notice that in our semilattice, we have Even though it is possible to prove that this increasing sequence eventually stabilizes (i.e., there is a step in the sequence from which the environment never changes again), the sequence may have several "fake stabilization" periods in which the sequence seems to have stabilized for several steps before abruptly change into another "fake stabilization" period. Therefore, it is very difficult (if not impossible) to detect if we are in the final stabilization period or within a fake one, which may lead to a premature stop of the iterative process. The above problem is better illustrated with an example. Consider this function:
We will have the following environments after the respective iterations:
So, the increasing sequence will be:
Note that from the second step to the sixth step, the sequence seemed to have stabilized (i.e., it is in a "fake stabilization" period), and it would be unsound to stop the iterative process before the seventh step because we would incorrectly conclude that The problem with this naive approach is that the information computed by the joins is never given to the environments computed during the iterations. For example, the data Second Approach (Successful)The idea is that at each step we should take the result of the join of the previous step as input to the next iteration of the loop. More specifically:
In other words, we are computing the following sequence of environments:
The notation can be improved by introducing the following function: together with the superindex notation So, the sequence becomes:
Therefore, if this sequence eventually stabilizes, we would get our result. Note that requiring a stabilized sequence is the same as requiring that the sequence eventually reaches a fixpoint for function In fact, it is possible to prove that the above sequence is actually an increasing sequence that reaches a fixpoint for eventually reaches a fixpoint for The fact that the above result works for any starting environment is actually very important, because when starting the analysis of a loop, we are in some arbitrary environment. Moreover, the fact that the sequence reaches an environment I will not write the proofs in this note, but I will illustrate how this approach behaves with the problematic function described in the naive approach. Here is the function again for convenience:
We will have the following environments after the respective steps:
The big question here is: is this analysis too restrictive in the sense that it will assign to a lot of variables the value
Here, a human programmer can see that since In spite of the above, I think that this analysis is enough to cover the FunC errors. |
@jeshecdom btw, the traditional approach to assigning meanings to the lattice elements is as follows: "bottom" means "unreachable" or some kind of type error (like not a number if you are only tracking numbers in your analysis) and you also want the top element, which in this case would mean "any value", so you (semi-)lattice would look like
|
Thank you! I will attempt to restate the problem in the more traditional way. So, in the traditional way, you start with describing a complete lattice on the values first, and then extend it to the environments I see. But this would imply that in my environment map |
} | ||
|
||
public interpretModuleItem(ast: AstModuleItem): void { | ||
export abstract class InterpreterInterface<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOP in Typescript, and abstract
in particular are heavily unsound. If that wasn't bad enough, they're broken even in Javascript, where o.f()
and (o).f()
have different meaning, and to add insult to injury this kind of problems are not checked by Typescript at all.
Abstract interpreter is best expressed using "tagless final" pattern. Here's a simple example.
Imagine we have a language of numbers and addition. We define
interface Expr<T> {
num: (value: number) => T;
add: (left: T, right: T) => T;
}
and can now express 1 + (2 + 3)
as
const example = <T>({ num, add }: Expr<T>) => add(num(1), add(num(2), num(3)));
Then we can define concrete interpreters for Expr
:
const show: Expr<string> = {
num: (value) => `${value}`,
add: (left, right) => `${left} + ${right}`,
};
const run: Expr<number> = {
num: (value) => value,
add: (left, right) => left + right,
};
The only weird thing here is that whenever we apply an interpreter to a value, it's actually applying a value to interpreter: example(show)
. If that makes anyone uncomfortable, it's easy to have another wrapper
const showAlg: Expr<string> = { ... };
const show = (value: (algebra: Expr<string>) => string): string => value(showAlg);
From a theory side of things, this is just a regular encoding for existential type through double negation of universal quantification.
Now to more obscure parts of this encoding. You could have mentioned that show
would display 1 + 2 + 3
instead of 1 + (2 + 3)
, and we'll make a scarecrow issue out of it. Let's add some context that keeps a boolean flag of whether subexpression should be wrapped into parens.
const showAlg: Expr<(topLevel: boolean) => string> = {
num: (value) => () => `${value}`,
add: (left, right) => (topLevel) => {
const tmp = `${left(false)} ${right(false)}`;
return topLevel ? tmp : `(${tmp})`;
},
};
const show = (value) => value(showAlg)(true);
That kind of Reader monad pattern happens most of the times we intend to use tagless final encoding, and in fact it's very often that we want to add a few other values into our context. In order not to face the tedium of adjusting every single place of our program, it's best to make it an object the first time the code is implemented
type FooContext = { flag: boolean }; // only one... for now
type Foo = (ctx: FooContext) => string;
const fooAlg: Expr<Foo> = { ... };
The best part of the pattern (and in fact the reason it was initially invented) comes from modularity of these interpreters. We have both sequential and parallel composition here:
const showButAll1: Expr<string> = {
...show,
num: () => '1',
};
const withLogging = <T>({ num, add }: Expr<T>): Expr<(log: () => string) => T> => ({
num: (value) => (log) => { log(value.tostring()); return num(value); },
add: (left, right) => (log) => add(left(log), right(log)),
});
const showWithLog = withLogging(show);
const translate = <T>(algebra: Expr1<T>): Expr2<T> => ({ ... });
const biinterpret = <T, U>(algebra1: Expr<T>, algebra2: Expr<U>): Expr<[T, U]> => ({ ... });
and also we can extend (or even reduce) the language without modifying any and all of the previously working code. Together these two solve the extension problem
interface ExtendedExpr<T> extend Expr<T> {
mul: (left: T, right: T) => T;
}
const showExtended: ExtendedExpr<string> = {
...show,
mul: (left, right) => `${left} * ${right}`,
};
// none of original code was changed
Another thing where the pattern excels is enforcing implementer of interpreter to list all the properties of ast nodes if by agreement (or linter rule) all the new properties are added as first arguments. When another field is introduced to ast node, it's usually way too easy to forget to handle it somewhere.
I suspect biinterpret(evaluate, generate)
is somewhat close to what is going on in this PR (yet I didn't properly read it).
When this pattern does not work that well:
- If we need any operations with more than 1 argument (for example, equality), it takes huge effort to encode it, and even then we're guaranteed (proofs omitted) to spend at least
O(N^2)
operations. I wish Kiselyov mentioned it in original paper instead of a passing mention in related source code in his blog :/ - If we have to consider more than one node at a time. Usually it signifies we have our ast types wrong, but we already have codebase with whatever ast is there, and I suspect you're not a fan of the idea of refactoring it. It's definitely possible to use context to keep the information of nodes around the current. Like, when we want to do something on node of type B inside of node of type A, we just add
insideA: boolean
flag. It's just a bit tedious, so general rule is not to use tagless final whenever lots of pattern-matching is expected. I should probably mention there is a Boehm-Berarducci encoding for arbitrary pattern-matching, but it doesn't look very practical to me. - The approach is best suited for "everything is an expression" kind of language. If there are ast types with drastically different models (expressions and statements, for example), you'd have types of
Expr<E> & Stmt<S, E>
kind (expressions get interpreted as E, statements as S, and they also have to know what expressions are modelled as). If the set of concrete interpreters is expected to have a lot of different models for different ast nodes, types can get quite peculiar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I described in some detail what might be done here, let's come back to the topic of OOP and where it makes sense to use it (I mean, anything that has class
or new
in it).
The short answer would be "never", but it's not entirely true. In prototype
s js holds only one object with all the method pointers, and when we do not use a class, we have to create a closure object for every method in every instance of the "class". In fact it's an exceptionally rare situation when an object holds more than a couple of methods (very likely they should have been just standalone functions), or where extra memory usage and runtime overhead are not worth extra soundness. It usually happens in a code that operates with billions of objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer would be "never", but it's not entirely true
Sorry if I'll interrupt your thoughts, but I want to add that using a minimal amount of OOP is very handy when dealing with the Object Pool pattern where you minimize new memory allocations and greatly reduce GC spikes by re-using already created objects. And for those to have a method like .reset()
that'll operate on the fields is super handy :)
Added lattice for values: this simplified join operations.
I still need to explore the refactoring suggested by @verytactical. My main worry right now is that I need to add more tests for the constant propagation analyzer. So, probably I will add first the tests and then start exploring the suggested refactoring. |
…backward compatibility.
…for backwards compatibility.
… for backwards compatibility.
Looking at you fixing those backwards compatibility tests, I believe it would be nice to resolve #1019 for us not to stumble upon features not in Node.js 18 when writing code |
…Crossing fingers now :)
How should we proceed? I have no idea what to do :) |
Fret not, that's on @verytactical |
…d cancel branches during joining. - Added a lot of more tests to cover return statements, struct joining, short-circuiting in && and ||, ternary conditional operatior ? _ : _, static calls, null dereferencing and integer overflow.
…viously it received an array of environments). Makes the code more simple. There is no place in the code where more than two environments are joined.
Issue
Closes #716.
The solution is able to detect not only division by zero problems, but any kind of problem that depends on variable tracing, like null dereferencings, number overflows. Although, I need to add testing for all the other possibilities.
Checklist