-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add set up for corpus generator generation by way of LLMs #479
Conversation
Signed-off-by: David Korczynski <david@adalogics.com>
Signed-off-by: David Korczynski <david@adalogics.com>
The main reason for this is that the `generate_code` does not really generate any code but rather it queries a given LLM using a specified prompt. Since we now have prompts of various sort, I feel it might be a bit misplaced the name. This could also make it a bit more clear which API to use if you're working on using LLMs for tasks other than explicit code generation. This came up while doing #479 where one consideration was to have the LLM generate a corpus explicit without going a seed-corpus-by-way-of-python generation. Ref: #482 --------- Signed-off-by: David Korczynski <david@adalogics.com>
/gcbrun exp -n dk-test-infra-99x1 -m vertex_ai_gemini-1-5 -b minor-for-ci -i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! do you have any preliminary results on this to share?
|
||
The target code of the harness is: | ||
<code> | ||
{TARGET_FUNCTION_SOURCE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this benefit from more context? e.g. relevant data type information? existing seeds in the repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to data type, e.g., struct
detail if this is ready for functions with complex/project-specific data types.
Signed-off-by: David Korczynski <david@adalogics.com>
Yup. Some sample scripts: For the harness: example 1#include "src/cjson/cJSON.h"
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
cJSON *item = cJSON_Parse((const char*)data);
cJSON_bool recurse = (cJSON_bool)(data[size-1] % 2);
cJSON *result = cJSON_Duplicate(item, recurse);
cJSON_Delete(result);
cJSON_Delete(item);
return 0;
} This generator was created: import json
# Generate 10 initial seeds for the fuzzing harness
seeds = []
# Seed 1: Empty input
seed1 = b""
seeds.append(seed1)
# Seed 2: Simple JSON object
seed2 = b'{"key": "value"}'
seeds.append(seed2)
# Seed 3: JSON object with nested object
seed3 = b'{"key": {"nested_key": "nested_value"}}'
seeds.append(seed3)
# Seed 4: JSON array
seed4 = b'[1, 2, 3]'
seeds.append(seed4)
# Seed 5: JSON array with nested objects
seed5 = b'[{"key1": "value1"}, {"key2": "value2"}]'
seeds.append(seed5)
# Seed 6: JSON object with array
seed6 = b'{"key": [1, 2, 3]}'
seeds.append(seed6)
# Seed 7: Complex nested JSON object
seed7 = b'{"key1": {"nested_key1": [1, 2, 3]}, "key2": {"nested_key2": {"deep_key": "deep_value"}}}'
seeds.append(seed7)
# Seed 8: JSON array with null values
seed8 = b'[null, null, null]'
seeds.append(seed8)
# Seed 9: JSON object with boolean values
seed9 = b'{"key1": true, "key2": false}'
seeds.append(seed9)
# Seed 10: JSON array with mixed data types
seed10 = b'[1, "two", true, false, null]'
seeds.append(seed10)
# Write seeds to files
for i, seed in enumerate(seeds):
with open(f'seed_{i+1}.json', 'wb') as f:
f.write(seed) Example 2For the following harness: #include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <time.h>
#include "/src/lighttpd1.4/src/buffer.h"
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
if (size < 3) {
return 0;
}
// Prepare buffer
buffer *b = buffer_init();
// Prepare format string
char format[size];
memcpy(format, data, size - 1);
format[size - 1] = '\0';
// Prepare tm structure
struct tm tm;
memset(&tm, 0, sizeof(struct tm));
// Fuzz the target function
buffer_append_strftime(b, format, &tm);
// Clean up
buffer_free(b);
return 0;
} The following script was generated: import os
# Create 10 initial seeds for the fuzzing harness
seeds = [
b"%Y-%m-%d %H:%M:%S", # Typical date and time format
b"%A, %B %d, %Y", # Weekday, month day, year format
b"%I:%M %p", # 12-hour clock format
b"%Y-%m-%dT%H:%M:%S%z", # ISO 8601 format with timezone
b"%Y/%m/%d %H:%M:%S", # Date and time format with slashes
b"%Y-%m-%d %H:%M:%S %Z", # Date and time format with timezone
b"%Y/%m/%d %I:%M %p", # Date and 12-hour clock format with slashes
b"%Y-%m-%d %H:%M:%S %Z", # Date and time format with timezone
b"%Y-%m-%d %H:%M:%S %Z", # Date and time format with timezone
b"%Y-%m-%d %H:%M:%S %Z", # Date and time format with timezone
]
# Write the seeds to files
for i, seed in enumerate(seeds):
with open(f"seed{i}.txt", "wb") as f:
f.write(seed) example 3The following harness: #include "/path/to/cJSON.h"
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
cJSON *obj = cJSON_CreateObject();
if (size < 2) {
return 0;
}
// Allocate memory for name and raw strings
char *name = (char *)malloc(size);
if (!name) {
return 0;
}
char *raw = (char *)malloc(size);
if (!raw) {
free(name);
return 0;
}
// Copy data into name and raw, making sure they are null-terminated
memcpy(name, data, size-1);
name[size - 1] = '\0';
memcpy(raw, data, size-1);
raw[size - 1] = '\0';
// Call the target function
cJSON_AddRawToObject(obj, name, raw);
// Free allocated memory
free(name);
free(raw);
cJSON_Delete(obj);
return 0;
} the following generator was created: import json
# Generate 10 initial seeds for the fuzzing harness
seeds = [
{
"object": {
"key1": "value1",
"key2": "value2"
},
"name": "name1",
"raw": "raw1"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3"
},
"name": "name2",
"raw": "raw2"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4"
},
"name": "name3",
"raw": "raw3"
},
{
"object": {
"key1": "value1"
},
"name": "name4",
"raw": "raw4"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"key5": "value5"
},
"name": "name5",
"raw": "raw5"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3"
},
"name": "name6",
"raw": "raw6"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"key5": "value5",
"key6": "value6"
},
"name": "name7",
"raw": "raw7"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4"
},
"name": "name8",
"raw": "raw8"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"key5": "value5",
"key6": "value6",
"key7": "value7"
},
"name": "name9",
"raw": "raw9"
},
{
"object": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"key5": "value5",
"key6": "value6",
"key7": "value7",
"key8": "value8"
},
"name": "name10",
"raw": "raw10"
}
]
# Write seeds to files
for i, seed in enumerate(seeds):
with open(f"seed{i+1}.json", "w") as file:
json.dump(seed, file, indent=4) Example 4For the following harness: #include <stdint.h>
#include <stddef.h>
#include "/src/lighttpd1.4/src/buffer.h"
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
if (size < sizeof(__intmax_t)) {
return 0;
}
buffer *b = buffer_init();
__intmax_t val;
memcpy(&val, data, sizeof(__intmax_t));
buffer_append_int(b, val);
buffer_free(b);
return 0;
} The following generator was created: import struct
# Generate 10 initial seeds for the fuzzing harness
seeds = [
struct.pack('q', 0), # Pack a long long (intmax_t) value of 0
struct.pack('q', 1), # Pack a long long (intmax_t) value of 1
struct.pack('q', -1), # Pack a long long (intmax_t) value of -1
struct.pack('q', 100), # Pack a long long (intmax_t) value of 100
struct.pack('q', -100), # Pack a long long (intmax_t) value of -100
struct.pack('q', 9223372036854775807), # Pack the maximum long long (intmax_t) value
struct.pack('q', -9223372036854775808), # Pack the minimum long long (intmax_t) value
struct.pack('q', 123456789), # Pack a random long long (intmax_t) value
struct.pack('q', -987654321), # Pack a random long long (intmax_t) value
struct.pack('q', 555555555) # Pack a random long long (intmax_t) value
]
# Write the seeds to files in the current working directory
for i, seed in enumerate(seeds):
with open(f'seed{i}.dat', 'wb') as f:
f.write(seed) Example 5Harness: #include <stdint.h>
#include <stddef.h>
#include "/src/lighttpd1.4/src/buffer.h"
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
if (size < sizeof(const struct const_iovec) || size % sizeof(const struct const_iovec) != 0) {
return 0;
}
const struct const_iovec *iov = (const struct const_iovec *)data;
const size_t n = size / sizeof(const struct const_iovec);
buffer *b = buffer_init();
buffer_append_iovec(b, iov, n);
buffer_free(b);
return 0;
} Generator: import os
import struct
# Define the structure of const_iovec
struct_const_iovec = struct.Struct("Q Q")
# Generate 10 initial seeds
for i in range(10):
seed_data = b''
for j in range(3): # Generate 3 const_iovec structures per seed
iov_base = os.urandom(8) # Generate random iov_base (8 bytes)
iov_len = struct.pack("Q", len(iov_base)) # Get length of iov_base
seed_data += struct_const_iovec.pack(int.from_bytes(iov_base, byteorder='big'), int.from_bytes(iov_len, byteorder='little'))
with open(f'seed_{i}.dat', 'wb') as f:
f.write(seed_data) |
/gcbrun skip |
llm_toolkit/corpus_generator.py
Outdated
from llm_toolkit import prompt_builder | ||
|
||
|
||
def get_corpus_generator_script( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more nit/improvement: this module is called corpus_generator
. we can reduce some repetition by just calling this function get_script
. i.e. callers will just do: corpus_generator.get_script()
.
/gcbrun skip |
"""Uses LLMs to generate a python script that will create a seed corpus for a | ||
harness. | ||
|
||
The script generated is purely generated and should be considered untrusted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The script generated is purely generated": probably "purely LLM-dependent"?
The script generated is purely generated and should be considered untrusted | ||
in the general sense. OSS-Fuzz-gen already executes arbitrary code since | ||
OSS-Fuzz-gen executes arbitrary open source projects with no checking on | ||
what code is committed to the given projects.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean "OSS-Fuzz-gen already executes arbitrary code from arbitrary open source projects with no checking on what code is committed to the given projects"?
# Get the corpus generation template | ||
with open( | ||
os.path.join(prompt_builder.DEFAULT_TEMPLATE_DIR, | ||
'corpus_generation_via_python_script.txt'), 'r') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for now, but if we have more prompt templates for corpus generation in the future, maybe we could have a separate dir, e.g., prompts/corpus_generation_template/
?
|
||
The target code of the harness is: | ||
<code> | ||
{TARGET_FUNCTION_SOURCE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to data type, e.g., struct
detail if this is ready for functions with complex/project-specific data types.
{TARGET_FUNCTION_SOURCE} | ||
</code> | ||
|
||
Could you please construct a small Python program that generate 10 initial seeds for my fuzz harness? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about also generating a dictionary? It can improve the efficiency of mutation. For example, if fuzzing cannot tell which of two paths will have higher coverage, a dictionary will increase the likelihood of generating inputs for the desired one.
Similarly, suppose LLM needs to encode some path constraints into the seed (e.g., if (x=="ABC")
, but x
is dynamically extracted from a non-fixed part of the input), a dictionary provides a more flexible way to increase the chance of generating that value.
|
||
The program you write to generate seeds should take no input and should output the seeds into the current working folder. | ||
|
||
Wrap the program in <results> tags in the reply and do not return any other text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If the response includes other text (e.g., explanatory sentences), maybe we can simply extract the code within the tag at here?
Adds logic for adding a corpus to a given harness. This is based off of using LLMs to ask for it to generate a python script that generates a set of seeds for a given target. This program is then appended to the build script of the project also wrapped such that zip files are created for OSS-Fuzz-style seed corpus.