Add set up for corpus generator generation by way of LLMs #479

DavidKorczynski · 2024-07-14T17:26:20Z

Adds logic for adding a corpus to a given harness. This is based off of using LLMs to ask for it to generate a python script that generates a set of seeds for a given target. This program is then appended to the build script of the project also wrapped such that zip files are created for OSS-Fuzz-style seed corpus.

Signed-off-by: David Korczynski <david@adalogics.com>

The main reason for this is that the `generate_code` does not really generate any code but rather it queries a given LLM using a specified prompt. Since we now have prompts of various sort, I feel it might be a bit misplaced the name. This could also make it a bit more clear which API to use if you're working on using LLMs for tasks other than explicit code generation. This came up while doing #479 where one consideration was to have the LLM generate a corpus explicit without going a seed-corpus-by-way-of-python generation. Ref: #482 --------- Signed-off-by: David Korczynski <david@adalogics.com>

DavidKorczynski · 2024-07-14T22:11:32Z

/gcbrun exp -n dk-test-infra-99x1 -m vertex_ai_gemini-1-5 -b minor-for-ci -i

oliverchang

nice! do you have any preliminary results on this to share?

llm_toolkit/corpus_generator.py

experiment/evaluator.py

oliverchang · 2024-07-15T06:17:09Z

prompts/template_xml/corpus_generation_via_python_script.txt

+
+The target code of the harness is:
+<code>
+{TARGET_FUNCTION_SOURCE}


Would this benefit from more context? e.g. relevant data type information? existing seeds in the repository?

+1 to data type, e.g., struct detail if this is ready for functions with complex/project-specific data types.

llm_toolkit/corpus_generator.py

Signed-off-by: David Korczynski <david@adalogics.com>

DavidKorczynski · 2024-07-15T19:43:24Z

do you have any preliminary results on this to share?

Yup.

Some sample scripts:

For the harness:

example 1

#include "src/cjson/cJSON.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    cJSON *item = cJSON_Parse((const char*)data);
    cJSON_bool recurse = (cJSON_bool)(data[size-1] % 2);
    
    cJSON *result = cJSON_Duplicate(item, recurse);
    
    cJSON_Delete(result);
    cJSON_Delete(item);

    return 0;
}

This generator was created:

import json

# Generate 10 initial seeds for the fuzzing harness
seeds = []

# Seed 1: Empty input
seed1 = b""
seeds.append(seed1)

# Seed 2: Simple JSON object
seed2 = b'{"key": "value"}'
seeds.append(seed2)

# Seed 3: JSON object with nested object
seed3 = b'{"key": {"nested_key": "nested_value"}}'
seeds.append(seed3)

# Seed 4: JSON array
seed4 = b'[1, 2, 3]'
seeds.append(seed4)

# Seed 5: JSON array with nested objects
seed5 = b'[{"key1": "value1"}, {"key2": "value2"}]'
seeds.append(seed5)

# Seed 6: JSON object with array
seed6 = b'{"key": [1, 2, 3]}'
seeds.append(seed6)

# Seed 7: Complex nested JSON object
seed7 = b'{"key1": {"nested_key1": [1, 2, 3]}, "key2": {"nested_key2": {"deep_key": "deep_value"}}}'
seeds.append(seed7)

# Seed 8: JSON array with null values
seed8 = b'[null, null, null]'
seeds.append(seed8)

# Seed 9: JSON object with boolean values
seed9 = b'{"key1": true, "key2": false}'
seeds.append(seed9)

# Seed 10: JSON array with mixed data types
seed10 = b'[1, "two", true, false, null]'
seeds.append(seed10)

# Write seeds to files
for i, seed in enumerate(seeds):
    with open(f'seed_{i+1}.json', 'wb') as f:
        f.write(seed)

Example 2

For the following harness:

#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <time.h>
#include "/src/lighttpd1.4/src/buffer.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    if (size < 3) {
        return 0;
    }

    // Prepare buffer
    buffer *b = buffer_init();
    
    // Prepare format string
    char format[size];
    memcpy(format, data, size - 1);
    format[size - 1] = '\0';

    // Prepare tm structure
    struct tm tm;
    memset(&tm, 0, sizeof(struct tm));

    // Fuzz the target function
    buffer_append_strftime(b, format, &tm);

    // Clean up
    buffer_free(b);

    return 0;
}

The following script was generated:

import os

# Create 10 initial seeds for the fuzzing harness
seeds = [
    b"%Y-%m-%d %H:%M:%S",  # Typical date and time format
    b"%A, %B %d, %Y",  # Weekday, month day, year format
    b"%I:%M %p",  # 12-hour clock format
    b"%Y-%m-%dT%H:%M:%S%z",  # ISO 8601 format with timezone
    b"%Y/%m/%d %H:%M:%S",  # Date and time format with slashes
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
    b"%Y/%m/%d %I:%M %p",  # Date and 12-hour clock format with slashes
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
    b"%Y-%m-%d %H:%M:%S %Z",  # Date and time format with timezone
]

# Write the seeds to files
for i, seed in enumerate(seeds):
    with open(f"seed{i}.txt", "wb") as f:
        f.write(seed)

example 3

The following harness:

#include "/path/to/cJSON.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    cJSON *obj = cJSON_CreateObject();
  
    if (size < 2) {
        return 0;
    }

    // Allocate memory for name and raw strings
    char *name = (char *)malloc(size);
    if (!name) {
        return 0;
    }

    char *raw = (char *)malloc(size);
    if (!raw) {
        free(name);
        return 0;
    }

    // Copy data into name and raw, making sure they are null-terminated
    memcpy(name, data, size-1);
    name[size - 1] = '\0';

    memcpy(raw, data, size-1);
    raw[size - 1] = '\0';

    // Call the target function
    cJSON_AddRawToObject(obj, name, raw);

    // Free allocated memory
    free(name);
    free(raw);

    cJSON_Delete(obj);

    return 0;
}

the following generator was created:

import json

# Generate 10 initial seeds for the fuzzing harness
seeds = [
    {
        "object": {
            "key1": "value1",
            "key2": "value2"
        },
        "name": "name1",
        "raw": "raw1"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3"
        },
        "name": "name2",
        "raw": "raw2"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4"
        },
        "name": "name3",
        "raw": "raw3"
    },
    {
        "object": {
            "key1": "value1"
        },
        "name": "name4",
        "raw": "raw4"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5"
        },
        "name": "name5",
        "raw": "raw5"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3"
        },
        "name": "name6",
        "raw": "raw6"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5",
            "key6": "value6"
        },
        "name": "name7",
        "raw": "raw7"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4"
        },
        "name": "name8",
        "raw": "raw8"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5",
            "key6": "value6",
            "key7": "value7"
        },
        "name": "name9",
        "raw": "raw9"
    },
    {
        "object": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3",
            "key4": "value4",
            "key5": "value5",
            "key6": "value6",
            "key7": "value7",
            "key8": "value8"
        },
        "name": "name10",
        "raw": "raw10"
    }
]

# Write seeds to files
for i, seed in enumerate(seeds):
    with open(f"seed{i+1}.json", "w") as file:
        json.dump(seed, file, indent=4)

Example 4

For the following harness:

#include <stdint.h>
#include <stddef.h>
#include "/src/lighttpd1.4/src/buffer.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    if (size < sizeof(__intmax_t)) {
        return 0;
    }

    buffer *b = buffer_init();
    __intmax_t val;
    memcpy(&val, data, sizeof(__intmax_t));

    buffer_append_int(b, val);

    buffer_free(b);

    return 0;
}

The following generator was created:

import struct

# Generate 10 initial seeds for the fuzzing harness
seeds = [
    struct.pack('q', 0),  # Pack a long long (intmax_t) value of 0
    struct.pack('q', 1),  # Pack a long long (intmax_t) value of 1
    struct.pack('q', -1),  # Pack a long long (intmax_t) value of -1
    struct.pack('q', 100),  # Pack a long long (intmax_t) value of 100
    struct.pack('q', -100),  # Pack a long long (intmax_t) value of -100
    struct.pack('q', 9223372036854775807),  # Pack the maximum long long (intmax_t) value
    struct.pack('q', -9223372036854775808),  # Pack the minimum long long (intmax_t) value
    struct.pack('q', 123456789),  # Pack a random long long (intmax_t) value
    struct.pack('q', -987654321),  # Pack a random long long (intmax_t) value
    struct.pack('q', 555555555)  # Pack a random long long (intmax_t) value
]

# Write the seeds to files in the current working directory
for i, seed in enumerate(seeds):
    with open(f'seed{i}.dat', 'wb') as f:
        f.write(seed)

Example 5

Harness:

#include <stdint.h>
#include <stddef.h>
#include "/src/lighttpd1.4/src/buffer.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    if (size < sizeof(const struct const_iovec) || size % sizeof(const struct const_iovec) != 0) {
        return 0;
    }

    const struct const_iovec *iov = (const struct const_iovec *)data;
    const size_t n = size / sizeof(const struct const_iovec);
    buffer *b = buffer_init();

    buffer_append_iovec(b, iov, n);

    buffer_free(b);
    return 0;
}

Generator:

import os
import struct

# Define the structure of const_iovec
struct_const_iovec = struct.Struct("Q Q")

# Generate 10 initial seeds
for i in range(10):
    seed_data = b''
    for j in range(3):  # Generate 3 const_iovec structures per seed
        iov_base = os.urandom(8)  # Generate random iov_base (8 bytes)
        iov_len = struct.pack("Q", len(iov_base))  # Get length of iov_base
        seed_data += struct_const_iovec.pack(int.from_bytes(iov_base, byteorder='big'), int.from_bytes(iov_len, byteorder='little'))

    with open(f'seed_{i}.dat', 'wb') as f:
        f.write(seed_data)

Signed-off-by: David Korczynski <david@adalogics.com>

DavidKorczynski · 2024-07-15T20:09:10Z

/gcbrun skip

oliverchang · 2024-07-16T01:06:45Z

llm_toolkit/corpus_generator.py

+from llm_toolkit import prompt_builder
+
+
+def get_corpus_generator_script(


one more nit/improvement: this module is called corpus_generator. we can reduce some repetition by just calling this function get_script. i.e. callers will just do: corpus_generator.get_script().

Signed-off-by: David Korczynski <david@adalogics.com>

DavidKorczynski · 2024-07-16T08:06:49Z

/gcbrun skip

DonggeLiu · 2024-07-16T21:09:40Z

llm_toolkit/corpus_generator.py

+  """Uses LLMs to generate a python script that will create a seed corpus for a
+  harness.
+
+  The script generated is purely generated and should be considered untrusted


"The script generated is purely generated": probably "purely LLM-dependent"?

DonggeLiu · 2024-07-16T21:13:00Z

llm_toolkit/corpus_generator.py

+  The script generated is purely generated and should be considered untrusted
+  in the general sense. OSS-Fuzz-gen already executes arbitrary code since
+  OSS-Fuzz-gen executes arbitrary open source projects with no checking on
+  what code is committed to the given projects."""


Did you mean "OSS-Fuzz-gen already executes arbitrary code from arbitrary open source projects with no checking on what code is committed to the given projects"?

DonggeLiu · 2024-07-16T21:16:05Z

llm_toolkit/corpus_generator.py

+  # Get the corpus generation template
+  with open(
+      os.path.join(prompt_builder.DEFAULT_TEMPLATE_DIR,
+                   'corpus_generation_via_python_script.txt'), 'r') as f:


LGTM for now, but if we have more prompt templates for corpus generation in the future, maybe we could have a separate dir, e.g., prompts/corpus_generation_template/?

DonggeLiu · 2024-07-16T21:25:42Z

prompts/template_xml/corpus_generation_via_python_script.txt

+
+The target code of the harness is:
+<code>
+{TARGET_FUNCTION_SOURCE}


+1 to data type, e.g., struct detail if this is ready for functions with complex/project-specific data types.

DonggeLiu · 2024-07-16T21:48:32Z

prompts/template_xml/corpus_generation_via_python_script.txt

+{TARGET_FUNCTION_SOURCE}
+</code>
+
+Could you please construct a small Python program that generate 10 initial seeds for my fuzz harness?


How about also generating a dictionary? It can improve the efficiency of mutation. For example, if fuzzing cannot tell which of two paths will have higher coverage, a dictionary will increase the likelihood of generating inputs for the desired one.
Similarly, suppose LLM needs to encode some path constraints into the seed (e.g., if (x=="ABC"), but x is dynamically extracted from a non-fixed part of the input), a dictionary provides a more flexible way to increase the chance of generating that value.

DonggeLiu · 2024-07-16T21:54:45Z

prompts/template_xml/corpus_generation_via_python_script.txt

+
+The program you write to generate seeds should take no input and should output the seeds into the current working folder.
+
+Wrap the program in <results> tags in the reply and do not return any other text.


nit: If the response includes other text (e.g., explanatory sentences), maybe we can simply extract the code within the tag at here?

DavidKorczynski added 3 commits July 14, 2024 10:24

add poc corpus generation

3037058

Signed-off-by: David Korczynski <david@adalogics.com>

add coprus gen module

0e58988

Signed-off-by: David Korczynski <david@adalogics.com>

styling

aa59a04

Signed-off-by: David Korczynski <david@adalogics.com>

DavidKorczynski mentioned this pull request Jul 14, 2024

Rename gen code to query llm #480

Merged

DavidKorczynski changed the title ~~Add poc corpus generator~~ Add set up for corpus generator generation by way of LLMs Jul 14, 2024

DavidKorczynski mentioned this pull request Jul 14, 2024

Use LLMs to generate corpus #482

Open

oliverchang reviewed Jul 15, 2024

View reviewed changes

address review

19fc62e

Signed-off-by: David Korczynski <david@adalogics.com>

DavidKorczynski added 2 commits July 15, 2024 12:50

nit

b8b8081

Signed-off-by: David Korczynski <david@adalogics.com>

Merge branch 'main' into add-poc-corpus-generator

0eb1ae8

oliverchang requested a review from DonggeLiu July 16, 2024 00:53

oliverchang approved these changes Jul 16, 2024

View reviewed changes

fix nit

83acbfc

Signed-off-by: David Korczynski <david@adalogics.com>

DavidKorczynski merged commit 6954e1e into main Jul 16, 2024
7 checks passed

DavidKorczynski deleted the add-poc-corpus-generator branch July 16, 2024 08:14

DonggeLiu reviewed Jul 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add set up for corpus generator generation by way of LLMs #479

Add set up for corpus generator generation by way of LLMs #479

DavidKorczynski commented Jul 14, 2024

DavidKorczynski commented Jul 14, 2024

oliverchang left a comment

oliverchang Jul 15, 2024

DonggeLiu Jul 16, 2024

DavidKorczynski commented Jul 15, 2024

DavidKorczynski commented Jul 15, 2024

oliverchang Jul 16, 2024

DavidKorczynski commented Jul 16, 2024

DonggeLiu Jul 16, 2024

DonggeLiu Jul 16, 2024

DonggeLiu Jul 16, 2024

DonggeLiu Jul 16, 2024

DonggeLiu Jul 16, 2024

DonggeLiu Jul 16, 2024

		from llm_toolkit import prompt_builder


		def get_corpus_generator_script(


		The program you write to generate seeds should take no input and should output the seeds into the current working folder.

		Wrap the program in <results> tags in the reply and do not return any other text.

Add set up for corpus generator generation by way of LLMs #479

Add set up for corpus generator generation by way of LLMs #479

Conversation

DavidKorczynski commented Jul 14, 2024

DavidKorczynski commented Jul 14, 2024

oliverchang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidKorczynski commented Jul 15, 2024

example 1

Example 2

example 3

Example 4

Example 5

DavidKorczynski commented Jul 15, 2024

Choose a reason for hiding this comment

DavidKorczynski commented Jul 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment