Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LLMs to generate corpus #482

Open
DavidKorczynski opened this issue Jul 14, 2024 · 1 comment
Open

Use LLMs to generate corpus #482

DavidKorczynski opened this issue Jul 14, 2024 · 1 comment
Assignees

Comments

@DavidKorczynski
Copy link
Collaborator

DavidKorczynski commented Jul 14, 2024

Seed corpus can be significantly improve some harnesses. We should explore the use of generating seeds by way of LLMs, some ideas:

  1. Integrate into the workflow so it becomes possible to experiment with this.
  2. Ask the LLM to generate a small python program that generates a seed corpus
  3. Ask the LLM to generate actual files for a seed corpus
  4. Try and assemble a data set with sample seed files and see if this can be utilized by the LLM
  5. Enable this to be used for the auto-generated harnesses
  6. Enable this to be used for the existing harnesses, i.e. we can do "seed generation optimization" for any harnesses in oss-fuzz
@DavidKorczynski DavidKorczynski self-assigned this Jul 14, 2024
@DavidKorczynski
Copy link
Collaborator Author

(1) and (2) has been set up in #479 -- still a lot more work to do here but there are initial results.

mihaimaruseac pushed a commit that referenced this issue Jul 14, 2024
The main reason for this is that the `generate_code` does not really
generate any code but rather it queries a given LLM using a specified
prompt. Since we now have prompts of various sort, I feel it might be a
bit misplaced the name. This could also make it a bit more clear which
API to use if you're working on using LLMs for tasks other than explicit
code generation.

This came up while doing #479
where one consideration was to have the LLM generate a corpus explicit
without going a seed-corpus-by-way-of-python generation.

Ref: #482

---------

Signed-off-by: David Korczynski <david@adalogics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant