title | subtitle | author | geometry | output |
---|---|---|---|---|
Using OpenAI Codex for Gradual Type Inference |
Final Report |
Federico Cassano and Noah Shinn |
left=3cm,right=3cm,top=3cm,bottom=3cm |
pdf_document |
This report was created before publication, and served as our final submission for our AI class at Northeastern University. Our final work has diverged from this report, and can be found on ArXiv here.
The work in this report has been evaluated using OpenAI's Codex, which is now a defunct product. In addition, it has been evaluated using our old evaluation harness, which has now been replaced by a much better multi-threaded evaluation framework that can be found here.
Type inference for gradually-typed languages such as TypeScript and Python has become increasingly prevalent in the field of programming languages. However, current approaches often struggle with inferring descriptive types in cases in which user-defined type annotations are absent, especially when inferring function signatures. In our dataset, we found that TypeScript's inference procedure was only able to correctly type-infer 59% of the given files. Furthermore, we found that the quality of the type annotations was low, as the types were too permissive, possibly leading to an increased number of dynamic type errors. This finding makes the built-in procedure ineffective in practice. In this report, we show an effective use of large natural language models to aid these type inference procedures. Our approach utilizes static insertion of type holes to generate a prompt to be edited by a language model. Our paper mainly uses Codex's code-davinci-edit model for TypeScript type inference. Additionally, we also explore other language models such as Facebook's InCoder model. Across our dataset, we were able to type-infer 91% of the files with descriptive, high quality type annotations.
\newpage
Given this untyped file,
const kClosest = (points, K) => {
let len = points.length,
l = 0,
r = len - 1
while (l <= r) {
let mid = helper(points, l, r)
if (mid === K) break
if (mid < K) {
l = mid + 1
} else {
r = mid - 1
}
}
return points.slice(0, K)
}
function helper(A, l, r) {
... // omitted for length (uses compare on an element of A)
}
function compare(p1, p2) {
return p1[0] * p1[0] + p1[1] * p1[1] - p2[0] * p2[0] - p2[1] * p2[1];
}
TypeScript's inference procedure gives this completion:
const kClosest: (
points: any,
K: any
) => any = (points, K) => {
let len: any = points.length,
l: number = 0,
r: number = len - 1;
while (l <= r) {
let mid: any = helper(points, l, r);
... // omitted for length
}
};
function helper(A: any, l: any, r: any): any;
function compare(p1: any, p2: any): number;
While the completion from TypeScript's inference procedure may successfully type check, it is important to note the overuse of the any
type annotation. For example, note the any
annotation for the points
parameter. On the first line of the function, points.length
is being accessed. Given the context, it would seem rational to type-annotate points
with any[]
. However, consider the case that the kClosest
function is given { length: 3 }
as the points
argument. Clearly, { length: 3 }
is not an any[]
. However, a deterministic type inference procedure must consider this case. Therefore, annotating the points
argument with any
is the most rational decision in this case. This problem motivates the use of a probabilistic approach to overcome similar cases.
As an initial approach, we could try feeding the program to a natural language model that is able to make edits on the prompt. We would instruct the model to insert type annotations on the given untyped TypeScript program. The model would then output a set of completions that may type-check. Then, we would run type-checks on each completion to find ones that pass the checks.
We have employed the approach described above in the following manner:
- We insert the identifier
_hole_
in place of missing types in our input JavaScript program. To do this, we use a compiler$\mathcal{K} : \text{File} \rightarrow \mathcal{P}$ . - We define an instruction
$\mathcal{I}$ , which is the constant string:
$\mathcal{I} = \text{"Substitute the identifier _hole_ with the correct type."}$ - We query the
davinci-edit
model using the compiled prompt$\mathcal{P}$ and instruction$\mathcal{I}$ . We receive back a set of completions$\mathcal{C}$ ,$0 \leq |\mathcal{C}| \leq n$ , where$n$ is a pre-defined maximum number of completions. - We use a cheap and admissible heuristic
$h : c \rightarrow (\text{Boolean},\ \mathcal{N})$ that determines if a given completion$c$ is correct (a correct completion however, may still not type-check) and the quality of the type annotations$q$ , where the lower the$q$ the better. - We apply
$h$ to all elements in$\mathcal{C}$ and we add the completions that produced$\text{True}$ to an ordered set$\mathcal{Q}$ sorted by$q$ . - Using the command:
tsc --allowJs --checkJs --noEmit --target es2022 <file.ts>
we run a full type-check on every completion in$\mathcal{Q}$ , terminating as soon as the command returns code0
, meaning that the completion is type-correct. By terminating early on the sorted set, we guarantee that our solution is optimal with respect to$\mathcal{Q}$ . We let$c^*$ be the optimal type-correct solution. - We produce
$c^*$ if it exists, otherwise we produce an error.
Using this strategy, we were able to type-infer our motivating example as such:
const kClosest: (
points: [number, number][],
K: number
) => [number, number][] = (points, K) => {
let len: number = points.length,
l: number = 0,
r: number = len - 1;
while (l <= r) {
let mid: number = helper(points, l, r);
... // omitted for length
}
};
function helper(A: [number, number][], l: number, r: number): number;
function compare(p1: [number, number], p2: [number, number]): number;
We believe that these types are as descriptive as they can get. The compare
function ever only accesses the first two elements of p1
and p2
, therefore a pair type is a good choice.
However, this approach does not scale for large input sizes, such as larger functions or files for several reasons. First, the Codex model has a maximum input token limit of 2048 tokens. Second, the quality of the infilled type annotations declines as the size of the completion context increases.
An obvious approach to solve this problem is to simply split the original program into chunks, which are then sent to the Codex model. Then, these completed chunks would be reassembled to compose a set of candidate completions. A systematic way of chunking the original program is to create a tree of code blocks in which each level represents an inner scope of the level above. Given this approach, it is important to note that we lose context of other code blocks that may be referenced in the current code block. To alleviate this problem, we include signatures of the child scopes in the current scope. Additionally, we include usages of identifiers that defined in the current scope and are used in other scopes. Our step-by-step algorithm is explained below:
-
- We create a tree of nested code-blocks
$T$ in our JavaScript program. Where the top-level is the root of the tree and each function, class or method has a node below the parent block. The node of the tree is identified by$\text{id-}\alpha$ , the$\alpha$ -renamed name of the function/class/method. The root node id is defined as$\text{id-}\alpha = root$ .
- We create a tree of nested code-blocks
-
- For each
$\text{id-}\alpha$ , we find usages of the identifier and we save them in the node for prompt-engineering purposes.
- For each
-
- We start traversing the tree using a bottom-up traversal, starting from the leaves. We process each level in parallel. To each node we apply the steps described below:
-
- If the current node is not a leaf node, we extract the completions for all the direct children of this node and then type-weave the types of the children into this node. This process creates permutations for all completions of all direct children of this node. For each permutation, we create a prompt
$\mathcal{P}_i$ . We use the type-weaving procedure:
$\mathcal{W}: \text{Original File, Nettle File} \rightarrow \text{Resulting File}$ .
- If the current node is not a leaf node, we extract the completions for all the direct children of this node and then type-weave the types of the children into this node. This process creates permutations for all completions of all direct children of this node. For each permutation, we create a prompt
-
- For each prompt, we insert the saved usages of
$\text{id-}\alpha$ at the bottom of the prompt, to help Codex infer the types.
- For each prompt, we insert the saved usages of
-
- For each prompt, we insert the identifier
_hole_
in place of missing types in our prompts. To do this, we use a compiler$\mathcal{K} : \text{File} \rightarrow \mathcal{P}$ .
- For each prompt, we insert the identifier
-
- We define an instruction
$\mathcal{I}$ , which is the constant string:
$\mathcal{I} = \text{"Substitute the identifier _hole_ with the correct type."}$
- We define an instruction
-
- For each prompt computed for the node, query the
davinci-edit
orincoder
model using the prompt $\mathcal{P}i$ and instruction $\mathcal{I}$. We receive back a set of completions $\mathcal{C}{\text{id-}\alpha}$,$0 \leq |\mathcal{C}_{\text{id-}\alpha}| \leq n$ , where$n$ is a pre-defined maximum number of completions.
- For each prompt computed for the node, query the
-
- We use a cheap and admissible heuristic
$h : c \rightarrow (\text{Boolean},\ \mathcal{N})$ that determines if a given completion$c$ is correct (a correct completion however, may still not type-check) and the quality of the type annotations$q$ , where the lower the$q$ the better.
- We use a cheap and admissible heuristic
-
- We apply
$h$ to all elements in $\mathcal{C}{\text{id-}\alpha}$ and add the completions that produced $\text{True}$ to an ordered set $\mathcal{Q}{\text{id-}\alpha}$ sorted by$q$ .
- We apply
-
- Using the command:
tsc --allowJs --checkJs --noEmit --target es2022 <file.ts>
we run a full type-check test on every completion in $\mathcal{Q}{root}$, terminating as soon as the command returns code0
, meaning that the completion is type-correct. By terminating early on the sorted set, we guarantee that our solution is optimal with respect to $\mathcal{Q}\text{root}$. We let$c^*$ be the optimal type-correct solution.
- Using the command:
-
- We produce
$c^*$ if it exists, otherwise we produce an error.
- We produce
We have implemented
For instance, given this input:
function hello(name: string) {
let msg = "Hello";
return msg + ", " + name + "!";
}
function hello(name: string): _hole_ {
let msg: _hole_ = "Hello";
return msg + ", " + name + "!";
}
We have implemented our heuristic
Type | Score | Correct |
---|---|---|
missing (example: let x = 1 ) |
N/A | False, terminate |
literal type (example: let x: 3 = 1 ) |
N/A | False, terminate |
_hole_ |
N/A | False, terminate |
unknown |
+10 | True, continue |
any |
+5 | True, continue |
undefined |
+2 | True, continue |
Function (interface type) |
+5 | True, continue |
For example, with the completion:
function hello(name: string): any {
let msg: undefined = "Hello";
return msg + ", " + name + "!";
}
any
and undefined
gives a summed score of
While, with the completion:
function hello(name: string): _hole_ {
let msg: string = "Hello";
return msg + ", " + name + "!";
}
_hole_
type terminates the heuristic early.
Additionally,
We have implemented a procedure
For instance, given this input:
function hello2(name) {
const helloInner = (name) => {
return "Hello " + name;
};
return helloInner(name);
}
const hello1 = (name) => {
return "Hello " + name;
};
The procedure is going to output:
{ width=100px }
Where each node in the tree is annotated using the definition:
pub struct CompNode {
pub children_idxs: Vec<usize>, // pointers to the children
pub name: String, // the alpha-renamed id
pub code: String, // the code of this node
pub completed: Vec<String>, // the completions of this node (starts empty)
pub usages: String, // the usages of the alpha-renamed id of this node
}
\newpage
We have implemented a type-weaving procedure
For instance, the given program:
function hello(name: string): string {
function inner(): string {
let my_string: string = "Hello " + name;
return my_string;
}
return inner();
}
Has the following
|
Type |
---|---|
hello |
(name: string) => string |
hello$inner |
() => string |
hello$inner$my_string |
string |
Using this program as our nettle (the reference program that donates types to be transplanted),
we can transplant types into the following program, starting from scope hello
:
function hello(name) {
function inner() {
let my_string = "Hello " + name;
return my_string;
}
return inner();
}
We will receive back the program:
function hello(name) {
function inner(): string {
let my_string: string = "Hello " + name;
return my_string;
}
return inner();
}
This allows us to ignore any types given to the hello
function, while transplanting types given to
the hello$inner
function and the hello$inner$my_string
variable.
\newpage
We have implemented a client in Rust that manages the pipeline and queries the Codex API. The client will communicate with the compiler
USAGE:
client [OPTIONS] --tokens <TOKENS> --file <FILE> --output <OUTPUT>
OPTIONS:
-c, --cache <CACHE> The Redis URL for the cache
--disable-rate-limit Whether or not to prevent rate limits.
You may want to set this to false if you are using
your own model. By default, we try to
prevent rate limits, by using this
flag you can disable this behavior
-e, --endpoint <ENDPOINT> The url of the codex endpoint [default:
https://api.openai.com/v1/edits]
-f, --file <FILE> The target file path
--fallback Whether to fallback to "any" or not
-h, --help Print help information
-l, --lang <LANG> The target language. Either `ts` or `py` [default: ts]
-n, --n <N> The number of completions to return [default: 3]
-o, --output <OUTPUT> Output file directory path
-r, --retries <RETRIES> The number of request to send to Codex [default: 1]
-s, --strategy <STRATEGY> Completion strategy. Either: {"simple": simple completion,
"tree": tree completion} [default: simple]
--stop-at <STOP_AT> The maximum number of type-checkable completions to return
[default: 1]
-t, --tokens <TOKENS> Codex tokens to use, separated by commas
--temp <TEMP> The temperature to use for the completion [default: 1]
-V, --version Print version information
Type-correct solutions will be written to the specified directory.
\newpage
We have developed a protocol similar to Microsoft's LSP, that allows to generalize our type inference strategies
for any language server that implements LangServer
trait in the Rust client.
The flow of information using the protocol is described in Figure 1.
The server needs to be designed with concurrent requests in mind as the tree
strategy queries each node of the current
level of the tree in parallel.
Our TypeScript language server implement such protocol.
One limitation with using Codex is the max request limit of 20 requests per minute. To solve this issue, we tried to utilize a different model that we could self host. We decided to use Facebook's InCoder model as it was the only other model that is trained to provide prompt edit completions on code.
The InCoder model query system was implemented as a simple HTTP server that evaluates the
InCoder model
Given the following prompt
function hello(name: _hole_): _hole_ {
let msg: string = "Hello";
return msg + ", " + name + "!";
}
and the hyperparameters max_to_generate=5
, temperature=0.8
, and max_retries=1
in a POST request, the server will split the input by the fake type, _hole_
to yield the following list:
parts_list = [
"function hello(name: ",
"): ",
' { let msg: string = "Hello"; return msg + ", " + name + "!";}',
];
The parts_list
and the given hyperparameters will form prompt infill
task.
type_list = ["string", "string"];
Then, the type_list
will be inserted into the original parts_list
to build the final completion:
function hello(name: string): string {
let msg: string = "Hello";
return msg + ", " + name + "!";
}
The describes process will execute
{
"choices": [
"text": "<completion0>",
"text": "<completion1>",
...
"text": "<completionN>"
]
}
where choices
is of length
This API mimics the Codex API in order to be an effective drop-in replacement, all we had to do in our client is changing the url.
In production, we execute the InCoder model on NVIDIA A100 GPUs using Northeastern's Discovery Cluster.
\newpage
We randomly picked 100 files from a dataset composed of 1934 small to medium-sized JavaScript Leetcode solutions. First, we converted the JavaScript files to TypeScript files and then manually validated every file to ensure that there exists a type checkable solution. This includes linting the untyped files and scanning for logical errors.
We employed a best-of-3 evaluation approach in which we ran each configuration of our client three times and saved the best outcome. We run both the simple and tree strategy with Codex using temperatures of 0.8 and 1.0. We also test InCoder using the tree strategy with temperatures of 0.8 and 1.0. The results are shown in Figure 2 below.
From Figure 2, we observe a significantly higher accuracy using Codex with a temperature of 0.8 (91%) compared to TypeScript's built-in static type inference procedure (59%). However, we also note a very poor performance using InCoder for temperatures of 0.8 and 1.0 (14%). It is also important to note the increased average quality of type annotations given by Codex and InCoder in successfully-annotated files compared to TypeScript's built-in type inference. Our heuristic
Using a combination of prompt engineering and Codex's code-davinci-edit
model, we were able to effectively type-infer a medium-sized dataset of TypeScript files. Across all of our experimental dataset, we were able to type-infer 91% of the files. Furthermore, we were able to observe a significant improvement in the quality of the type annotations, as the types were more descriptive and restrictive than the type annotations inferred by TypeScript's built-in type inference procedure. We found that the results from InCoder
were significantly worse than those from Codex, likely due to the model not being trained on typed code. Additionally, we found that our prompt engineering implementation produced similar results to our simple implementation, despite the loss of context.
In future work, we plan to expand our prompt engineering approach to include additional static analysis techniques in order to add context. For example, we could find free variables in inner scopes and include them in the outer scope, in order to show dependency. One limitation of our approach is the reliance on a commercial product, Codex, which may render our approach expensive and infeasible for an individual or a small research group to use. While the results from InCoder
were not as good as those from Codex, with further fine-tuning and more precise prompt engineering, the model could be used as an effective replacement for Codex, allowing our approach to be used by individuals and small research groups. Additionally, thanks to our language server protocol, our approach could be extended to other programming languages, such as Python, Rust and Java, by training a language model on the task of type inference for these languages. Additionally, we plan to explore more language models and techniques to improve the accuracy of the type inference procedure. We believe that our approach has the potential to improve type inference in gradually-typed languages, and we hope that our work will lead to more robust and accurate type inference procedures in the future.