Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set up for corpus generator generation by way of LLMs #479

Merged
merged 7 commits into from
Jul 16, 2024

Conversation

DavidKorczynski
Copy link
Collaborator

Adds logic for adding a corpus to a given harness. This is based off of using LLMs to ask for it to generate a python script that generates a set of seeds for a given target. This program is then appended to the build script of the project also wrapped such that zip files are created for OSS-Fuzz-style seed corpus.

Signed-off-by: David Korczynski <david@adalogics.com>
Signed-off-by: David Korczynski <david@adalogics.com>
Signed-off-by: David Korczynski <david@adalogics.com>
@DavidKorczynski DavidKorczynski changed the title Add poc corpus generator Add set up for corpus generator generation by way of LLMs Jul 14, 2024
mihaimaruseac pushed a commit that referenced this pull request Jul 14, 2024
The main reason for this is that the `generate_code` does not really
generate any code but rather it queries a given LLM using a specified
prompt. Since we now have prompts of various sort, I feel it might be a
bit misplaced the name. This could also make it a bit more clear which
API to use if you're working on using LLMs for tasks other than explicit
code generation.

This came up while doing #479
where one consideration was to have the LLM generate a corpus explicit
without going a seed-corpus-by-way-of-python generation.

Ref: #482

---------

Signed-off-by: David Korczynski <david@adalogics.com>
@DavidKorczynski
Copy link
Collaborator Author

/gcbrun exp -n dk-test-infra-99x1 -m vertex_ai_gemini-1-5 -b minor-for-ci -i

Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! do you have any preliminary results on this to share?

llm_toolkit/corpus_generator.py Outdated Show resolved Hide resolved
experiment/evaluator.py Outdated Show resolved Hide resolved

The target code of the harness is:
<code>
{TARGET_FUNCTION_SOURCE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this benefit from more context? e.g. relevant data type information? existing seeds in the repository?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to data type, e.g., struct detail if this is ready for functions with complex/project-specific data types.

llm_toolkit/corpus_generator.py Show resolved Hide resolved
Signed-off-by: David Korczynski <david@adalogics.com>
@DavidKorczynski
Copy link
Collaborator Author

do you have any preliminary results on this to share?

Yup.

Some sample scripts:

For the harness:

example 1

#include "src/cjson/cJSON.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    cJSON *item = cJSON_Parse((const char*)data);
    cJSON_bool recurse = (cJSON_bool)(data[size-1] % 2);
    
    cJSON *result = cJSON_Duplicate(item, recurse);
    
    cJSON_Delete(result);
    cJSON_Delete(item);

    return 0;
}

This generator was created:

import json

# Generate 10 initial seeds for the fuzzing harness
seeds = []

# Seed 1: Empty input
seed1 = b""
seeds.append(seed1)

# Seed 2: Simple JSON object
seed2 = b'{"key": "value"}'
seeds.append(seed2)

# Seed 3: JSON object with nested object
seed3 = b'{"key": {"nested_key": "nested_value"}}'
seeds.append(seed3)

# Seed 4: JSON array
seed4 = b'[1, 2, 3]'
seeds.append(seed4)

# Seed 5: JSON array with nested objects
seed5 = b'[{"key1": "value1"}, {"key2": "value2"}]'
seeds.append(seed5)

# Seed 6: JSON object with array
seed6 = b'{"key": [1, 2, 3]}'
seeds.append(seed6)

# Seed 7: Complex nested JSON object
seed7 = b'{"key1": {"nested_key1": [1, 2, 3]}, "key2": {"nested_key2": {"deep_key": "deep_value"}}}'
seeds.append(seed7)

# Seed 8: JSON array with null values
seed8 = b'[null, null, null]'
seeds.append(seed8)

# Seed 9: JSON object with boolean values
seed9 = b'{"key1": true, "key2": false}'
seeds.append(seed9)

# Seed 10: JSON array with mixed data types
seed10 = b'[1, "two", true, false, null]'
seeds.append(seed10)

# Write seeds to files
for i, seed in enumerate(seeds):
    with open(f'seed_{i+1}.json', 'wb') as f:
        f.write(seed)

Example 2

For the following harness:

#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <time.h>
#include "/src/lighttpd1.4/src/buffer.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    if (size < 3) {
        return 0;
    }

    // Prepare buffer
    buffer *b = buffer_init();
    
    // Prepare format string
    char format[size];
    memcpy(format, data, size - 1);
    format[size - 1] = '\0';

    // Prepare tm structure
    struct tm tm;
    memset(&tm, 0, sizeof(struct tm));

    // Fuzz the target function
    buffer_append_strftime(b, format, &tm);

    // Clean up
    buffer_free(b);

    return 0;
}

The following script was generated:

import os

# Create 10 initial seeds for the fuzzing harness
seeds = [
    b"%Y-%m-%d %H:%M:%S",  # Typical date and time format
    b"%A, %B %d, %Y",  # Weekday, month day, year format
    b"%I:%M %p",  # 12-hour clock format
    b"%Y-%m-%dT%H:%M:%S%z",  # ISO 8601 format with timezone
    b"%Y/%m/%d %H:%M:%S",  # Date and time format with slashes
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
    b"%Y/%m/%d %I:%M %p",  # Date and 12-hour clock format with slashes
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
]

# Write the seeds to files
for i, seed in enumerate(seeds):
    with open(f"seed{i}.txt", "wb") as f:
        f.write(seed)

example 3

The following harness:

#include "/path/to/cJSON.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    cJSON *obj = cJSON_CreateObject();
  
    if (size < 2) {
        return 0;
    }

    // Allocate memory for name and raw strings
    char *name = (char *)malloc(size);
    if (!name) {
        return 0;
    }

    char *raw = (char *)malloc(size);
    if (!raw) {
        free(name);
        return 0;
    }

    // Copy data into name and raw, making sure they are null-terminated
    memcpy(name, data, size-1);
    name[size - 1] = '\0';

    memcpy(raw, data, size-1);
    raw[size - 1] = '\0';

    // Call the target function
    cJSON_AddRawToObject(obj, name, raw);

    // Free allocated memory
    free(name);
    free(raw);

    cJSON_Delete(obj);

    return 0;
}

the following generator was created:

import json

# Generate 10 initial seeds for the fuzzing harness
seeds = [
    {
        "object": {
            "key1": "value1",
            "key2": "value2"
        },
        "name": "name1",
        "raw": "raw1"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3"
        },
        "name": "name2",
        "raw": "raw2"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4"
        },
        "name": "name3",
        "raw": "raw3"
    },
    {
        "object": {
            "key1": "value1"
        },
        "name": "name4",
        "raw": "raw4"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5"
        },
        "name": "name5",
        "raw": "raw5"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3"
        },
        "name": "name6",
        "raw": "raw6"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5",
            "key6": "value6"
        },
        "name": "name7",
        "raw": "raw7"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4"
        },
        "name": "name8",
        "raw": "raw8"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5",
            "key6": "value6",
            "key7": "value7"
        },
        "name": "name9",
        "raw": "raw9"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5",
            "key6": "value6",
            "key7": "value7",
            "key8": "value8"
        },
        "name": "name10",
        "raw": "raw10"
    }
]

# Write seeds to files
for i, seed in enumerate(seeds):
    with open(f"seed{i+1}.json", "w") as file:
        json.dump(seed, file, indent=4)

Example 4

For the following harness:

#include <stdint.h>
#include <stddef.h>
#include "/src/lighttpd1.4/src/buffer.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    if (size < sizeof(__intmax_t)) {
        return 0;
    }

    buffer *b = buffer_init();
    __intmax_t val;
    memcpy(&val, data, sizeof(__intmax_t));

    buffer_append_int(b, val);

    buffer_free(b);

    return 0;
}

The following generator was created:

import struct

# Generate 10 initial seeds for the fuzzing harness
seeds = [
    struct.pack('q', 0),  # Pack a long long (intmax_t) value of 0
    struct.pack('q', 1),  # Pack a long long (intmax_t) value of 1
    struct.pack('q', -1),  # Pack a long long (intmax_t) value of -1
    struct.pack('q', 100),  # Pack a long long (intmax_t) value of 100
    struct.pack('q', -100),  # Pack a long long (intmax_t) value of -100
    struct.pack('q', 9223372036854775807),  # Pack the maximum long long (intmax_t) value
    struct.pack('q', -9223372036854775808),  # Pack the minimum long long (intmax_t) value
    struct.pack('q', 123456789),  # Pack a random long long (intmax_t) value
    struct.pack('q', -987654321),  # Pack a random long long (intmax_t) value
    struct.pack('q', 555555555)  # Pack a random long long (intmax_t) value
]

# Write the seeds to files in the current working directory
for i, seed in enumerate(seeds):
    with open(f'seed{i}.dat', 'wb') as f:
        f.write(seed)

Example 5

Harness:

#include <stdint.h>
#include <stddef.h>
#include "/src/lighttpd1.4/src/buffer.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    if (size < sizeof(const struct const_iovec) || size % sizeof(const struct const_iovec) != 0) {
        return 0;
    }

    const struct const_iovec *iov = (const struct const_iovec *)data;
    const size_t n = size / sizeof(const struct const_iovec);
    buffer *b = buffer_init();

    buffer_append_iovec(b, iov, n);

    buffer_free(b);
    return 0;
}

Generator:

import os
import struct

# Define the structure of const_iovec
struct_const_iovec = struct.Struct("Q Q")

# Generate 10 initial seeds
for i in range(10):
    seed_data = b''
    for j in range(3):  # Generate 3 const_iovec structures per seed
        iov_base = os.urandom(8)  # Generate random iov_base (8 bytes)
        iov_len = struct.pack("Q", len(iov_base))  # Get length of iov_base
        seed_data += struct_const_iovec.pack(int.from_bytes(iov_base, byteorder='big'), int.from_bytes(iov_len, byteorder='little'))

    with open(f'seed_{i}.dat', 'wb') as f:
        f.write(seed_data)

Signed-off-by: David Korczynski <david@adalogics.com>
@DavidKorczynski
Copy link
Collaborator Author

/gcbrun skip

@oliverchang oliverchang requested a review from DonggeLiu July 16, 2024 00:53
from llm_toolkit import prompt_builder


def get_corpus_generator_script(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more nit/improvement: this module is called corpus_generator. we can reduce some repetition by just calling this function get_script. i.e. callers will just do: corpus_generator.get_script().

Signed-off-by: David Korczynski <david@adalogics.com>
@DavidKorczynski
Copy link
Collaborator Author

/gcbrun skip

@DavidKorczynski DavidKorczynski merged commit 6954e1e into main Jul 16, 2024
7 checks passed
@DavidKorczynski DavidKorczynski deleted the add-poc-corpus-generator branch July 16, 2024 08:14
"""Uses LLMs to generate a python script that will create a seed corpus for a
harness.

The script generated is purely generated and should be considered untrusted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The script generated is purely generated": probably "purely LLM-dependent"?

The script generated is purely generated and should be considered untrusted
in the general sense. OSS-Fuzz-gen already executes arbitrary code since
OSS-Fuzz-gen executes arbitrary open source projects with no checking on
what code is committed to the given projects."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean "OSS-Fuzz-gen already executes arbitrary code from arbitrary open source projects with no checking on what code is committed to the given projects"?

# Get the corpus generation template
with open(
os.path.join(prompt_builder.DEFAULT_TEMPLATE_DIR,
'corpus_generation_via_python_script.txt'), 'r') as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now, but if we have more prompt templates for corpus generation in the future, maybe we could have a separate dir, e.g., prompts/corpus_generation_template/?


The target code of the harness is:
<code>
{TARGET_FUNCTION_SOURCE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to data type, e.g., struct detail if this is ready for functions with complex/project-specific data types.

{TARGET_FUNCTION_SOURCE}
</code>

Could you please construct a small Python program that generate 10 initial seeds for my fuzz harness?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also generating a dictionary? It can improve the efficiency of mutation. For example, if fuzzing cannot tell which of two paths will have higher coverage, a dictionary will increase the likelihood of generating inputs for the desired one.
Similarly, suppose LLM needs to encode some path constraints into the seed (e.g., if (x=="ABC"), but x is dynamically extracted from a non-fixed part of the input), a dictionary provides a more flexible way to increase the chance of generating that value.


The program you write to generate seeds should take no input and should output the seeds into the current working folder.

Wrap the program in <results> tags in the reply and do not return any other text.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If the response includes other text (e.g., explanatory sentences), maybe we can simply extract the code within the tag at here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants