Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for translating Dart #153

Merged
merged 4 commits into from
Sep 9, 2024
Merged

Conversation

devoncarew
Copy link
Contributor

Note that I'm starting this PR as a draft as I have some open questions (and haven't finished all testing).

My questions are mostly called out in todo comments in the code, but:

  • I'm not 100% what I should put in for stop tokens - what they're used for. the ones in this PR have not been updated from the typescript translator
  • I see many python terms in the prompt we're sending to the LLMs (called out in the todos); I suspect we should instead prompt with terms more specific to Dart
  • I'm translating Python Optional types to Dart's nullable types

@devoncarew
Copy link
Contributor Author

Here's the output for python3 test.py humaneval_to_dart ../datasets/originals/HumanEval_53_add.py:

// Add two numbers x and y
// >>> add(2, 3)
// 5
// >>> add(5, 7)
// 12
int add(int x, int y) {

********************************************************************************
void main() {
  final candidate = add;

  expect(candidate(0, 1), 1);
  expect(candidate(1, 0), 1);
  expect(candidate(2, 3), 5);
  expect(candidate(5, 7), 12);
  expect(candidate(7, 5), 12);

  print('success');
}

void expect(dynamic a, dynamic b) {
  if (a == b) return;

  if (a is List && b is List) {
    expectList(a, b);
  } else if (a is Map && b is Map) {
    expectMap(a, b);
  } else {
    throw '$a != $b';
  }
}

void expectList(List a, List b) {
  if (a.length != b.length) throw 'list lengths are not equal';

  for (var i = 0; i < a.length; i++) {
    expect(a[i], b[i]);
  }
}

void expectMap(Map a, Map b) {
  if (a.length != b.length) throw 'map lengths are not equal';

  for (var key in a.keys) {
    expect(a[key], b[key]);
  }
}
********************************************************************************
['\nfunction ', '\n/*', '\n//', '\nclass']
********************************************************************************
Translation succeeded. Examine the output above to verify that it is correct.

And a DartPad snippet for the above: https://dartpad.dev/?id=8f9fe06a67328d6c82c35c202915227e

@devoncarew devoncarew marked this pull request as draft July 31, 2024 22:24
@arjunguha
Copy link
Member

Let's start with the stop tokens.

Short version

Use \n} as the single stop token. This will make LLMs generate a complete function, with the exception of the final }. But, you can append that in the execution script.

Long version

  • The LLM doesn't parse, or even tokenize in the sense that a compiler writer would recognize.
  • The model will generate text indefinitely in principle. In practice, it will stop when it runs out of GPU memory / hits the configured maximum length, which for current models exceeds 16KB of text.

So, the stop tokens are the accepted hack for telling the model when to stop generating text. Since the task is to generate a top-level function, the stop tokens are the text that typically follows a top-level function. Hopefully that explains why we use \nfunction, \nclass, etc. as stop tokens for TypeScript.

But, there is a simpler approach that we later realized works for any curly-brace language: just use \n} as the stop token and add \n} before execution.

@arjunguha
Copy link
Member

The prompt terminology stuff: as mentioned in the MultiPL-E paper, it hardly matters. The LLMs are very robust to terminology, especially on high-resource languages (and Dart probably is high-resource).

However, you can add Dart terms here: https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/terms.csv

@devoncarew
Copy link
Contributor Author

But, there is a simpler approach that we later realized works for any curly-brace language: just use \n} as the stop token and add \n} before execution.

Gotcha, thanks for the explanation; I updated the PR to use the closing curly as the stop token.

@devoncarew
Copy link
Contributor Author

devoncarew commented Aug 1, 2024

The prompt terminology stuff: as mentioned in the MultiPL-E paper, it hardly matters. The LLMs are very robust to terminology, especially on high-resource languages (and Dart probably is high-resource).

However, you can add Dart terms here: https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/terms.csv

Gotcha. I do have the terms added for Dart in that file; I don't see any translation when running something like 'test.py humaneval_to_dart ../datasets/originals/HumanEval_136_largest_smallest_integers.py' but perhaps that's not where the translation should show up. If this does look configured correctly, it may be worth double checking the logic in https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/generic_translator.py#L273. In any case, I understand that translating the python terms to dart isn't critical.

elif "SyntaxError" in r.stderr:
status = "SyntaxError"
elif "ReferenceError" in r.stderr:
status = "ReferenceError"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, for benchmarking, it only really matters that you classify each run as pass or fall. The MultiPL-E paper had a finer-grained analysis of the types of errors, which is what this stuff was for. So, this fine-grained categorization is optional at this point.

@devoncarew
Copy link
Contributor Author

So, I was able to spot-check output against the list here: #152 (comment), and things seemed reasonable (the output analyzed correctly when given a stub impl. for the method under test).

What's a good way to do some end-to-end testing? To drive this against a specific LLM and single file? Or, an LLM and one run of all the HumanEval benchmark scripts? I'd like to get a bit more confidence that this translation is generating reasonable code / the benchmark results would be reliable.

@arjunguha
Copy link
Member

You can create a local dataset of Dart prompts like this:

cd MultiPL-E/dataset_builder
python3 prepare_prompts_json.py \
     --lang humaneval_to_dart.py \
     --doctests transform \
     --prompt-terminology reworded \
     --output ../dart_prompts.jsonl

When I try this, I get a bunch of errors. Maybe something is not yet pushed?

@arjunguha
Copy link
Member

arjunguha commented Aug 5, 2024

Here is what I've done:

  1. I have this patch to your code, which I think is fine:

    diff --git a/dataset_builder/humaneval_to_dart.py b/dataset_builder/humaneval_to_dart.py
    index f675be8..23de993 100644
    --- a/dataset_builder/humaneval_to_dart.py
    +++ b/dataset_builder/humaneval_to_dart.py
    @@ -82,11 +82,7 @@ class Translator:
     
         def translate_prompt(self, name: str, args: List[ast.arg], returns, description: str) -> str:
             global needs_hashmap
    -        description = (
    -            "// " + re.sub(DOCSTRING_LINESTART_RE + "\n",
    -            "// ",
    -            description.strip()) + "\n",
    -        )
    +        description = "//" + re.sub(DOCSTRING_LINESTART_RE, "\n// ", description.strip()) + "\n"
             # Store this for later coercions on tests
             needs_hashmap = False
             self.type = [[arg.annotation for arg in args], returns]
  2. Build a JSON file with all translated HumanEval prompts. The setting below
    are the settings that have been established for benchmark: translating
    doctests and reworded Python terminology to the target language.

    cd MultiPL-E/dataset_builder
    python3 prepare_prompts_json.py \
      --lang humaneval_to_dart.py \
      --doctests transform \
      --prompt-terminology reworded \
      --output ../dart_prompts.jsonl
    

    You get a log with a bunch of errors. Some are expected. Not all prompts
    will translate to a strictly typed language. My log says
    150 / 161 problems translated. This is okay, but is on the low end for a
    translation rate. (Some failures are expected when translating to a typed
    language.) E.g., I see that TypeScript has 159 / 161 problems translated
    (https://huggingface.co/datasets/nuprl/MultiPL-E).

  3. With a GPU, generate completions for some model. I ran StarCoder2-15B:

    cd MultiPL-E
    python3 automodel_vllm.py \
         --name bigcode/starcoder2-15b \
         --root-dataset humaneval \
         --use-local \
         --dataset ./dart_prompts.jsonl \
         --temperature 0.2 \
         --batch-size 50 \
         --completion-limit 50 \
         --output-dir-prefix out
    

    (This requires vLLM and is significantly faster than using Transformers.)

    You can safely set --completion-limit 20 and get a reasonable stable
    result. Any lower and you'll get variations greater than 1%.

    At this point, you can start looking at the .json.gz files in the out
    directory to see if they look reasonable.

  4. If you want to run executions, you can add dart to evaluation/Dockerfile.
    However, the likelihood of an LLM producing something destructive given a
    HumanEval prompt is really low. So, I just ran them directly without a
    container. (Dart installed from Conda.)

    cd MultiPL-E
    python3 evaluation/src/main.py --dir out --output-dir out  --recursive
    

    This creates several .results.json.gz files, alongside the .json.gz files.

    I am getting a bunch of these errors. It may be that the Conda version
    of Dart is has a problem, or it could be the wierdness of my particular
    cluster:

      Unhandled exception:
    FileSystemException(path=/work/arjunguha-research-group/arjun/projects/MultiPL-E-Dart/condaenv/version; message=Cannot open file)
    #0      _PhysicalFile.readAsStringSync (package:analyzer/file_system/physical_file_system.dart:158:7)
    #1      FolderBasedDartSdk.languageVersion (package:analyzer/src/dart/sdk/sdk.dart:427:12)
    #2      Driver.start (package:analysis_server/src/server/driver.dart:295:18)
    #3      main (file:///b/s/w/ir/cache/builder/sdk/pkg/analysis_server/bin/server.dart:10:11)
    #4      _delayEntrypointInvocation.<anonymous closure> (dart:isolate-patch/isolate_patch.dart:295:32)
    #5      _RawReceivePortImpl._handleMessage (dart:isolate-patch/isolate_patch.dart:192:12)
    Bad state: The analysis server crashed unexpectedly
    

@devoncarew
Copy link
Contributor Author

Thanks for the feedback! I was OOO for a bit, but will circle back to this PR.

@devoncarew
Copy link
Contributor Author

I'm still digging through this a bit, but when I run the Typescript translation locally for comparison, I see similar numbers to the Dart translator:

Translation stats:
  Num originals: 161
  Num translated: 152
  Translation ratio: 0.94

@devoncarew devoncarew marked this pull request as ready for review September 3, 2024 20:05
@devoncarew
Copy link
Contributor Author

@arjunguha - I may be done tinkering with this PR now; I converted it from a draft to 'ready for review'.

@arjunguha
Copy link
Member

I'll plan to do this Monday. Thanks!

@arjunguha
Copy link
Member

I'm still digging through this a bit, but when I run the Typescript translation locally for comparison, I see similar numbers to the Dart translator:

Translation stats:
  Num originals: 161
  Num translated: 152
  Translation ratio: 0.94

Note to self: I see that I'm also getting 152 translations for TypeScript. This is a regression as of the last release a few months ago. I'll try to see what's up before merging this one.

@devoncarew
Copy link
Contributor Author

There were a few language features I didn't generate code for (unions, ellipsis) that may be able to be handled w/ some creative code gen. If those do end up skewing benchmark numbers I could revisit.

@arjunguha
Copy link
Member

Nope, sorted this out. The translation script was using the prompt dataset by default. This was the fix:

6a04908

This dataset has the doctests in the Python originals manually cleaned by an undergraduate and high school student, which is what we use for everything else.

With this dataset, here is what I get with Dart:

Translation stats:
  Num originals: 161
  Num translated: 157
  Translation ratio: 0.98

I'm going to merge this in and add some results.

@arjunguha arjunguha merged commit 1e09ff8 into nuprl:main Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for the Dart language
2 participants