add support for translating Dart #153

devoncarew · 2024-07-31T22:20:52Z

add support for translating benchmarks into Dart
closes Add support for the Dart language #152

Note that I'm starting this PR as a draft as I have some open questions (and haven't finished all testing).

My questions are mostly called out in todo comments in the code, but:

I'm not 100% what I should put in for stop tokens - what they're used for. the ones in this PR have not been updated from the typescript translator
I see many python terms in the prompt we're sending to the LLMs (called out in the todos); I suspect we should instead prompt with terms more specific to Dart
I'm translating Python Optional types to Dart's nullable types

devoncarew · 2024-07-31T22:23:56Z

Here's the output for python3 test.py humaneval_to_dart ../datasets/originals/HumanEval_53_add.py:

// Add two numbers x and y
// >>> add(2, 3)
// 5
// >>> add(5, 7)
// 12
int add(int x, int y) {

********************************************************************************
void main() {
  final candidate = add;

  expect(candidate(0, 1), 1);
  expect(candidate(1, 0), 1);
  expect(candidate(2, 3), 5);
  expect(candidate(5, 7), 12);
  expect(candidate(7, 5), 12);

  print('success');
}

void expect(dynamic a, dynamic b) {
  if (a == b) return;

  if (a is List && b is List) {
    expectList(a, b);
  } else if (a is Map && b is Map) {
    expectMap(a, b);
  } else {
    throw '$a != $b';
  }
}

void expectList(List a, List b) {
  if (a.length != b.length) throw 'list lengths are not equal';

  for (var i = 0; i < a.length; i++) {
    expect(a[i], b[i]);
  }
}

void expectMap(Map a, Map b) {
  if (a.length != b.length) throw 'map lengths are not equal';

  for (var key in a.keys) {
    expect(a[key], b[key]);
  }
}
********************************************************************************
['\nfunction ', '\n/*', '\n//', '\nclass']
********************************************************************************
Translation succeeded. Examine the output above to verify that it is correct.

And a DartPad snippet for the above: https://dartpad.dev/?id=8f9fe06a67328d6c82c35c202915227e

arjunguha · 2024-07-31T23:09:42Z

Let's start with the stop tokens.

Short version

Use \n} as the single stop token. This will make LLMs generate a complete function, with the exception of the final }. But, you can append that in the execution script.

Long version

The LLM doesn't parse, or even tokenize in the sense that a compiler writer would recognize.
The model will generate text indefinitely in principle. In practice, it will stop when it runs out of GPU memory / hits the configured maximum length, which for current models exceeds 16KB of text.

So, the stop tokens are the accepted hack for telling the model when to stop generating text. Since the task is to generate a top-level function, the stop tokens are the text that typically follows a top-level function. Hopefully that explains why we use \nfunction, \nclass, etc. as stop tokens for TypeScript.

But, there is a simpler approach that we later realized works for any curly-brace language: just use \n} as the stop token and add \n} before execution.

arjunguha · 2024-07-31T23:11:15Z

The prompt terminology stuff: as mentioned in the MultiPL-E paper, it hardly matters. The LLMs are very robust to terminology, especially on high-resource languages (and Dart probably is high-resource).

However, you can add Dart terms here: https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/terms.csv

devoncarew · 2024-08-01T02:00:04Z

But, there is a simpler approach that we later realized works for any curly-brace language: just use \n} as the stop token and add \n} before execution.

Gotcha, thanks for the explanation; I updated the PR to use the closing curly as the stop token.

devoncarew · 2024-08-01T02:05:55Z

The prompt terminology stuff: as mentioned in the MultiPL-E paper, it hardly matters. The LLMs are very robust to terminology, especially on high-resource languages (and Dart probably is high-resource).

However, you can add Dart terms here: https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/terms.csv

Gotcha. I do have the terms added for Dart in that file; I don't see any translation when running something like 'test.py humaneval_to_dart ../datasets/originals/HumanEval_136_largest_smallest_integers.py' but perhaps that's not where the translation should show up. If this does look configured correctly, it may be worth double checking the logic in https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/generic_translator.py#L273. In any case, I understand that translating the python terms to dart isn't critical.

arjunguha · 2024-08-01T16:41:50Z

evaluation/src/eval_dart.py

-    elif "SyntaxError" in r.stderr:
-        status = "SyntaxError"
-    elif "ReferenceError" in r.stderr:
-        status = "ReferenceError"


At this point, for benchmarking, it only really matters that you classify each run as pass or fall. The MultiPL-E paper had a finer-grained analysis of the types of errors, which is what this stuff was for. So, this fine-grained categorization is optional at this point.

devoncarew · 2024-08-01T16:52:53Z

So, I was able to spot-check output against the list here: #152 (comment), and things seemed reasonable (the output analyzed correctly when given a stub impl. for the method under test).

What's a good way to do some end-to-end testing? To drive this against a specific LLM and single file? Or, an LLM and one run of all the HumanEval benchmark scripts? I'd like to get a bit more confidence that this translation is generating reasonable code / the benchmark results would be reliable.

arjunguha · 2024-08-05T01:49:23Z

You can create a local dataset of Dart prompts like this:

cd MultiPL-E/dataset_builder
python3 prepare_prompts_json.py \
     --lang humaneval_to_dart.py \
     --doctests transform \
     --prompt-terminology reworded \
     --output ../dart_prompts.jsonl

When I try this, I get a bunch of errors. Maybe something is not yet pushed?

arjunguha · 2024-08-05T16:00:46Z

Here is what I've done:

I have this patch to your code, which I think is fine:

diff --git a/dataset_builder/humaneval_to_dart.py b/dataset_builder/humaneval_to_dart.py
index f675be8..23de993 100644
--- a/dataset_builder/humaneval_to_dart.py
+++ b/dataset_builder/humaneval_to_dart.py
@@ -82,11 +82,7 @@ class Translator:
 
     def translate_prompt(self, name: str, args: List[ast.arg], returns, description: str) -> str:
         global needs_hashmap
-        description = (
-            "// " + re.sub(DOCSTRING_LINESTART_RE + "\n",
-            "// ",
-            description.strip()) + "\n",
-        )
+        description = "//" + re.sub(DOCSTRING_LINESTART_RE, "\n// ", description.strip()) + "\n"
         # Store this for later coercions on tests
         needs_hashmap = False
         self.type = [[arg.annotation for arg in args], returns]

Build a JSON file with all translated HumanEval prompts. The setting below
are the settings that have been established for benchmark: translating
doctests and reworded Python terminology to the target language.
```
cd MultiPL-E/dataset_builder
python3 prepare_prompts_json.py \
  --lang humaneval_to_dart.py \
  --doctests transform \
  --prompt-terminology reworded \
  --output ../dart_prompts.jsonl
```
You get a log with a bunch of errors. Some are expected. Not all prompts
will translate to a strictly typed language. My log says
150 / 161 problems translated. This is okay, but is on the low end for a
translation rate. (Some failures are expected when translating to a typed
language.) E.g., I see that TypeScript has 159 / 161 problems translated
(https://huggingface.co/datasets/nuprl/MultiPL-E).
With a GPU, generate completions for some model. I ran StarCoder2-15B:
```
cd MultiPL-E
python3 automodel_vllm.py \
     --name bigcode/starcoder2-15b \
     --root-dataset humaneval \
     --use-local \
     --dataset ./dart_prompts.jsonl \
     --temperature 0.2 \
     --batch-size 50 \
     --completion-limit 50 \
     --output-dir-prefix out
```
(This requires vLLM and is significantly faster than using Transformers.)

You can safely set --completion-limit 20 and get a reasonable stable
result. Any lower and you'll get variations greater than 1%.

At this point, you can start looking at the .json.gz files in the out
directory to see if they look reasonable.

If you want to run executions, you can add dart to evaluation/Dockerfile.
However, the likelihood of an LLM producing something destructive given a
HumanEval prompt is really low. So, I just ran them directly without a
container. (Dart installed from Conda.)

cd MultiPL-E
python3 evaluation/src/main.py --dir out --output-dir out  --recursive

This creates several .results.json.gz files, alongside the .json.gz files.

I am getting a bunch of these errors. It may be that the Conda version
of Dart is has a problem, or it could be the wierdness of my particular
cluster:

  Unhandled exception:
FileSystemException(path=/work/arjunguha-research-group/arjun/projects/MultiPL-E-Dart/condaenv/version; message=Cannot open file)
#0      _PhysicalFile.readAsStringSync (package:analyzer/file_system/physical_file_system.dart:158:7)
#1      FolderBasedDartSdk.languageVersion (package:analyzer/src/dart/sdk/sdk.dart:427:12)
#2      Driver.start (package:analysis_server/src/server/driver.dart:295:18)
#3      main (file:///b/s/w/ir/cache/builder/sdk/pkg/analysis_server/bin/server.dart:10:11)
#4      _delayEntrypointInvocation.<anonymous closure> (dart:isolate-patch/isolate_patch.dart:295:32)
#5      _RawReceivePortImpl._handleMessage (dart:isolate-patch/isolate_patch.dart:192:12)
Bad state: The analysis server crashed unexpectedly

devoncarew · 2024-08-20T01:17:28Z

Thanks for the feedback! I was OOO for a bit, but will circle back to this PR.

devoncarew · 2024-08-20T01:49:33Z

I'm still digging through this a bit, but when I run the Typescript translation locally for comparison, I see similar numbers to the Dart translator:

Translation stats:
  Num originals: 161
  Num translated: 152
  Translation ratio: 0.94

devoncarew · 2024-09-03T21:39:24Z

@arjunguha - I may be done tinkering with this PR now; I converted it from a draft to 'ready for review'.

arjunguha · 2024-09-05T09:39:43Z

I'll plan to do this Monday. Thanks!

arjunguha · 2024-09-09T17:05:54Z

I'm still digging through this a bit, but when I run the Typescript translation locally for comparison, I see similar numbers to the Dart translator:
Translation stats:
  Num originals: 161
  Num translated: 152
  Translation ratio: 0.94

Note to self: I see that I'm also getting 152 translations for TypeScript. This is a regression as of the last release a few months ago. I'll try to see what's up before merging this one.

devoncarew · 2024-09-09T17:11:28Z

There were a few language features I didn't generate code for (unions, ellipsis) that may be able to be handled w/ some creative code gen. If those do end up skewing benchmark numbers I could revisit.

arjunguha · 2024-09-09T17:36:45Z

Nope, sorted this out. The translation script was using the prompt dataset by default. This was the fix:

6a04908

This dataset has the doctests in the Python originals manually cleaned by an undergraduate and high school student, which is what we use for everything else.

With this dataset, here is what I get with Dart:

Translation stats:
  Num originals: 161
  Num translated: 157
  Translation ratio: 0.98

I'm going to merge this in and add some results.

add support for translating Dart

b0fc078

devoncarew marked this pull request as draft July 31, 2024 22:24

update the dart translation script

65a109c

address a todo

31288dc

arjunguha reviewed Aug 1, 2024

View reviewed changes

fix syntax error

1d2b0bb

devoncarew marked this pull request as ready for review September 3, 2024 20:05

arjunguha merged commit 1e09ff8 into nuprl:main Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for translating Dart #153

add support for translating Dart #153

devoncarew commented Jul 31, 2024

devoncarew commented Jul 31, 2024

arjunguha commented Jul 31, 2024

arjunguha commented Jul 31, 2024

devoncarew commented Aug 1, 2024

devoncarew commented Aug 1, 2024 •

edited

Loading

arjunguha Aug 1, 2024

devoncarew commented Aug 1, 2024

arjunguha commented Aug 5, 2024

arjunguha commented Aug 5, 2024 •

edited

Loading

devoncarew commented Aug 20, 2024

devoncarew commented Aug 20, 2024

devoncarew commented Sep 3, 2024

arjunguha commented Sep 5, 2024

arjunguha commented Sep 9, 2024

devoncarew commented Sep 9, 2024

arjunguha commented Sep 9, 2024

add support for translating Dart #153

add support for translating Dart #153

Conversation

devoncarew commented Jul 31, 2024

devoncarew commented Jul 31, 2024

arjunguha commented Jul 31, 2024

Short version

Long version

arjunguha commented Jul 31, 2024

devoncarew commented Aug 1, 2024

devoncarew commented Aug 1, 2024 • edited Loading

arjunguha Aug 1, 2024

Choose a reason for hiding this comment

devoncarew commented Aug 1, 2024

arjunguha commented Aug 5, 2024

arjunguha commented Aug 5, 2024 • edited Loading

devoncarew commented Aug 20, 2024

devoncarew commented Aug 20, 2024

devoncarew commented Sep 3, 2024

arjunguha commented Sep 5, 2024

arjunguha commented Sep 9, 2024

devoncarew commented Sep 9, 2024

arjunguha commented Sep 9, 2024

devoncarew commented Aug 1, 2024 •

edited

Loading

arjunguha commented Aug 5, 2024 •

edited

Loading