c-kzg failing in SP1 due to unaligned access #169

CeciliaZ030 · 2024-05-07T23:46:13Z

Describe the feature request

Following
#148
#157
#162
We still have to fix kzg compilation in SP1. I've included example in pipeline/examples/sp1:
https://github.com/taikoxyz/raiko/blob/12826f2c820346b2c42deb757b359568e7d0f793/pipeline/examples/sp1/src/kzg.rs
The goal is to get the proof generated, although it seems like extern C overloading does proves & verified:

unsafe {
        let c = calloc(100);
        let m = malloc(200);
        free(c);
        free(m);
    }

Possibly a memory or stack limit issue.

Spam policy

I verify that this issue is NOT SPAM and understand SPAM issues will be closed and reported to GitHub, resulting in ACCOUNT TERMINATION.

The text was updated successfully, but these errors were encountered:

Brechtpd · 2024-05-07T23:58:53Z

Fails here: https://github.com/succinctlabs/sp1/blob/53c8915b99e06049f73b4e3c8f83c93296764808/core/src/runtime/mod.rs#L593

risc-v spec says this:

So allowed in the spec, with the warning it could be slow. Though SP1 seems to disallow it.

Brechtpd · 2024-05-08T00:32:11Z

Compiling with -mstrict-align seems to work to prevent unaligned accesses in the c code and the kzg code looks to run okay.

CeciliaZ030 · 2024-05-08T04:09:15Z

Fixed #170

puma314 · 2024-05-08T06:23:20Z

@Brechtpd thanks for this catch. I think we should note this somewhere in our docs, since it is non-obvious behavior that differs from the RISC-V spec.

Brechtpd · 2024-05-08T17:22:54Z

Reopening because seems the problem is not completely fixed for all blocks. Using block 101368 on A7, there is still an alignment issue at the same location. Even more, risc0 also fails for the block but with a different error:

SP1:

risc0:

puma314 · 2024-05-08T21:33:19Z

@Brechtpd I think that this is the same error across both VMs--seems like "LoadAccessFault" might be similar to our address not aligned error?

From a brief perusal online, it seems like in RISC-V, the hardware can implement only aligned accesses (this is true for us) and then the software can handle the unaligned accesses? (Stack overflow).

So perhaps we have to add a handler at the software layer for this. I can look into it / prioritize this since it seems like a blocker.

Brechtpd · 2024-05-08T22:10:00Z

@puma314

For risc0 it seems the error originates here: https://github.com/GregAC/rrs/blob/b23afc16b4e6a1fb5c4a73eb1e337e9400816507/rrs-lib/src/instruction_executor.rs#L185, and when going into read_mem it doesn't seem to hit the alignment error path because then the program would panic instead of returning the error that we see. So I think for some reason the memory location which is close to max int might be the problem and be corrupt (though no idea why that would be the case) but I'm not sure if risc0 manages the memory of different things so maybe the address could make sense. 4122387576 % 4 == 0 so I guess the address would also be aligned correctly.

Looking at risc0, they also do not seem to support unaligned memory accesses (unless they do the software fallback somewhere) which does seem to be very common. But it's also strange that the same blob that required -mstrict-align for sp1 did not need it for risc0. Though -mstrict-align is not enough, so who knows what's going on. :)

So hopefully for both the problem is just unaligned loads, will check with risc0 to see what they think about this error, if it ends up being an alignment problem then it would be great to have your help on this!

intoverflow · 2024-05-09T00:18:44Z

There's two additional flags you might want to pass to the builder:

-march=rv32im
-falign-functions=2

I'm on mobile at the mobile so cannot test, but if you share steps to reproduce the issue I can do some more digging tonight or tomorrow 🫡

flaub · 2024-05-09T00:55:48Z

The above address 4122387576 (aka 0xf5b6a478) looks like garbage, in r0vm, it shouldn't be possible for a program to address anything above 0x0c000000 (max addressable range is 192MB). So this feels like a dangling pointer. Since we're talking about C code, it's plausible that this might be a deeper bug in c-kzg itself (or perhaps related to modifications made to get it to run in these VMs)

SchmErik · 2024-05-09T01:55:17Z

Here's the memory layout in the risc0 zkvm. We only allow guest code to use 0x400 - 0x0c00_0000 between code, stack, and heap. Within this range, the stack occupies the lowest 2 mb of this range, followed by the code, and then followed by the heap. The heap starts where the code ends and grows upward on each allocation towards 0x0c00_0000. Any address above this is illegal for the guest to access and that's what you're seeing here.

The address being 0xf5b6_a478 makes me think that this is either a dangling pointer or a massive allocation close to 4gb. Is there something large being allocated on the heap? Does this work as expected on non-zkvm machines? I also second @intoverflow's advice adding those two flags to the builder.

Brechtpd · 2024-05-09T06:16:23Z

Made a repro branch here: #176. So can checkout the kzg-fun branch of this repo, and then:

export TARGET=risc0 MOCK=1
make install
make build && RUST_BACKTRACE=1 make test

change TARGET to sp1 for testing that.

Output for risc0:

Output for SP1:

There's two additional flags you might want to pass to the builder:
* `-march=rv32im`

* `-falign-functions=2`

Unfortunately they do not seem to fix it, but I added them in the test branch.

The address being 0xf5b6_a478 makes me think that this is either a dangling pointer or a massive allocation close to 4gb. Is there something large being allocated on the heap?

Nothing massive being allocated as far as I know, just the KZG trusted setup data and the blob data in memory (next to the block data, but that isn't much more than a couple of megabytes).

Does this work as expected on non-zkvm machines?

Oh good question because I think the zkVMs do use a different code path than the one taken by our "native" prover that also uses c-kzg but using the normal compilation path that I believe uses the x86 assembly hand optimized code. @smtmfft

We'll need to try that out and check it.

johntaiko · 2024-05-09T10:08:21Z

Does this work as expected on non-zkvm machines?

Works well in SGX environment.(block 101368 on A7)

johntaiko · 2024-05-09T13:11:28Z

The above address 4122387576 (aka 0xf5b6a478) looks like garbage, in r0vm, it shouldn't be possible for a program to address anything above 0x0c000000 (max addressable range is 192MB). So this feels like a dangling pointer. Since we're talking about C code, it's plausible that this might be a deeper bug in c-kzg itself (or perhaps related to modifications made to get it to run in these VMs)

It is quite possible that an arithmetic overflow issue from(c-kzg or its dependencies) was encountered on a 32-bit system(4122387576 approachs the maximum of u32), we can test it by running on other 32-bit virtual machines, like qemu.

johntaiko · 2024-05-09T13:26:21Z

we can test it by running on other 32-bit virtual machines.

So, we can build with debug info, and get the backtrace when panic

jtguibas · 2024-05-13T02:54:57Z

I did some investigating and this is what I found. To explore @johntaiko's hypothesis of the issue being related to compilation on 32-bit machines, I ran the following steps. Note that this isn't a perfect scenario to test things as it isn't being built on the RISC-V target (QEMU is a good next step IMO).

Create a 32-bit docker container and install requirements

sudo docker run --rm -it i686/ubuntu bash
apt-get update
apt-get install git
apt-get install build-essential
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Clone c-kzg taiko fork and run tests with memory/arithmetic sanitization

Clone repo

git clone https://github.com/smtmfft/c-kzg-4844.git
cd c-kzg-4844

Change bindings/rust/build.rs to compile c-kzg with C arithmetic overflow checks

cc.flag_if_supported("-ftrapv")
cc.flag_if_supported("-fsanitize=undefined")

Run tests with rust overflow checks

rustup toolchain install nightly
RUSTFLAGS="-C overflow-checks=on" cargo test --release

This seems to work FYI.

I also tried pointing raiko to this version of c-kzg-4844 to catch any arithmetic overflows, but it seems to still work on the prover when it's set to TARGET=native and make build && RUST_BACKTRACE=1 make test.

I think a good next step could be to do some more thorough debugging with GDB with the ELFs that get built for SP1/R0.

jtguibas · 2024-05-13T16:07:24Z

From further investigation, it seems like the issue is inside the call to:

blob_to_kzg_commitment

johntaiko · 2024-05-14T05:22:24Z

From further investigation, it seems like the issue is inside the call to:
blob_to_kzg_commitment

fix: #201

Backgroud

Because, we can't use the libc in zkvms, so we hacked the malloc and calloc methods

Case

But, we had a mismatch with the calloc function's signature.https://www.tutorialspoint.com/c_standard_library/c_function_calloc.htm

-unsafe extern "C" fn calloc(size: usize) -> *mut c_void {
+unsafe extern "C" fn calloc(nobj: usize, size: usize) -> *mut c_void {

Call in c-kzg-4844
https://github.com/smtmfft/c-kzg-4844/blob/77e9ba0a65e10e6a470832da2914b17a968da791/src/c_kzg_4844.c#L189

static C_KZG_RET c_kzg_calloc(void **out, size_t count, size_t size) {
    *out = NULL;
    if (count == 0 || size == 0) return C_KZG_BADARGS;
    // need alloc `count * size` but only get `count(4096)` in raiko
    *out = calloc(count, size);
    return *out != NULL ? C_KZG_OK : C_KZG_MALLOC;
}

static C_KZG_RET g1_lincomb_fast(
    g1_t *out, const g1_t *p, const fr_t *coeffs, uint64_t len
) {
    C_KZG_RET ret;
    void *scratch = NULL;
    blst_p1_affine *p_affine = NULL;
    blst_scalar *scalars = NULL;

    /* Tunable parameter: must be at least 2 since blst fails for 0 or 1 */
    if (len < 8) {
        g1_lincomb_naive(out, p, coeffs, len);
    } else {
        /* blst's implementation of the Pippenger method */
        size_t scratch_size = blst_p1s_mult_pippenger_scratch_sizeof(len);
        ret = c_kzg_malloc(&scratch, scratch_size);
        if (ret != C_KZG_OK) goto out;
        ret = c_kzg_calloc((void **)&p_affine, len, sizeof(blst_p1_affine));
        if (ret != C_KZG_OK) goto out;
        ret = c_kzg_calloc((void **)&scalars, len, sizeof(blst_scalar));
        if (ret != C_KZG_OK) goto out;

        /* Transform the points to affine representation */
        const blst_p1 *p_arg[2] = {p, NULL};
        blst_p1s_to_affine(p_affine, p_arg, len);

        /* Transform the field elements to 256-bit scalars */
        for (uint64_t i = 0; i < len; i++) {
            blst_scalar_from_fr(&scalars[i], &coeffs[i]);
        }

        /* Call the Pippenger implementation */
        const byte *scalars_arg[2] = {(byte *)scalars, NULL};
        const blst_p1_affine *points_arg[2] = {p_affine, NULL};
        blst_p1s_mult_pippenger(
            out, points_arg, len, scalars_arg, 255, scratch
        );
    }

    ret = C_KZG_OK;

out:
    c_kzg_free(scratch);
    c_kzg_free(p_affine);
    c_kzg_free(scalars);
    return ret;
}

So, some of the allocated memory overlapped, resulting in a UB.

smtmfft · 2024-06-06T05:45:47Z

Reopen for we still meet unaligned access in ckzg lib in SP1.
Add some findings here: https://hackmd.io/@taik0yue/B15mxaCVC

johntaiko · 2024-06-06T09:37:02Z

Reopen for we still meet unaligned access in ckzg lib in SP1. Add some findings here: hackmd.io/@taik0yue/B15mxaCVC

Fixed by smtmfft/c-kzg-4844#2

Related update in blst: supranational/blst@master...dyxushuai:blst:master

smtmfft · 2024-06-06T13:31:30Z

Possible, but I actually tried that...not works in my env, maybe I missed sth. will test later.

Reopen for we still meet unaligned access in ckzg lib in SP1. Add some findings here: hackmd.io/@taik0yue/B15mxaCVC

Fixed by smtmfft/c-kzg-4844#2

Related update in blst: supranational/blst@master...dyxushuai:blst:master

smtmfft · 2024-06-17T01:52:26Z

Adding flag is not enough, another reason is: #291

as sp1 requires strict aligned access, potentially some more issues we will meet later, and then that unsafe memory operation in rust seems inevitable. Hopefully there is a rustflag to let all vec allocation be in a aligned manner. will explore that later. (ref global aligned alloc: https://github.com/Brechtpd/cap/blob/a63589670397f85745daa1a22d2de4d46fd742aa/src/lib.rs#L183)

VM debugging is hard, no debug info, no frame stack, and no libc support (can not print in c native library), any tool can make it easier?

smtmfft · 2024-07-05T06:24:42Z

close this issue as non-reproducible for now.

CeciliaZ030 changed the title ~~c-kzg still failing in SGX & SP1~~ c-kzg still failing in SP1 May 7, 2024

Brechtpd mentioned this issue May 8, 2024

fix(sp1): kzg compilation fix for sp1 #170

Merged

CeciliaZ030 closed this as completed May 8, 2024

Brechtpd reopened this May 8, 2024

Brechtpd changed the title ~~c-kzg still failing in SP1~~ c-kzg still failing in SP1 and risc zero May 14, 2024

johntaiko linked a pull request May 14, 2024 that will close this issue

fix: mismatch method signature of libc's calloc #201

Merged

johntaiko closed this as completed in #201 May 14, 2024

smtmfft changed the title ~~c-kzg still failing in SP1 and risc zero~~ c-kzg failing in SP1 due to unaligned access Jun 6, 2024

smtmfft reopened this Jun 6, 2024

smtmfft closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c-kzg failing in SP1 due to unaligned access #169

c-kzg failing in SP1 due to unaligned access #169

CeciliaZ030 commented May 7, 2024

Brechtpd commented May 7, 2024

Brechtpd commented May 8, 2024

CeciliaZ030 commented May 8, 2024

puma314 commented May 8, 2024

Brechtpd commented May 8, 2024 •

edited

Loading

puma314 commented May 8, 2024

Brechtpd commented May 8, 2024

intoverflow commented May 9, 2024

flaub commented May 9, 2024

SchmErik commented May 9, 2024

Brechtpd commented May 9, 2024 •

edited

Loading

johntaiko commented May 9, 2024 •

edited

Loading

johntaiko commented May 9, 2024 •

edited

Loading

johntaiko commented May 9, 2024

jtguibas commented May 13, 2024 •

edited

Loading

jtguibas commented May 13, 2024

johntaiko commented May 14, 2024 •

edited

Loading

smtmfft commented Jun 6, 2024 •

edited

Loading

johntaiko commented Jun 6, 2024 •

edited

Loading

smtmfft commented Jun 6, 2024

smtmfft commented Jun 17, 2024 •

edited

Loading

smtmfft commented Jul 5, 2024

c-kzg failing in SP1 due to unaligned access #169

c-kzg failing in SP1 due to unaligned access #169

Comments

CeciliaZ030 commented May 7, 2024

Describe the feature request

Spam policy

Brechtpd commented May 7, 2024

Brechtpd commented May 8, 2024

CeciliaZ030 commented May 8, 2024

puma314 commented May 8, 2024

Brechtpd commented May 8, 2024 • edited Loading

puma314 commented May 8, 2024

Brechtpd commented May 8, 2024

intoverflow commented May 9, 2024

flaub commented May 9, 2024

SchmErik commented May 9, 2024

Brechtpd commented May 9, 2024 • edited Loading

johntaiko commented May 9, 2024 • edited Loading

johntaiko commented May 9, 2024 • edited Loading

johntaiko commented May 9, 2024

jtguibas commented May 13, 2024 • edited Loading

jtguibas commented May 13, 2024

johntaiko commented May 14, 2024 • edited Loading

Backgroud

Case

smtmfft commented Jun 6, 2024 • edited Loading

johntaiko commented Jun 6, 2024 • edited Loading

smtmfft commented Jun 6, 2024

smtmfft commented Jun 17, 2024 • edited Loading

smtmfft commented Jul 5, 2024

Brechtpd commented May 8, 2024 •

edited

Loading

Brechtpd commented May 9, 2024 •

edited

Loading

johntaiko commented May 9, 2024 •

edited

Loading

johntaiko commented May 9, 2024 •

edited

Loading

jtguibas commented May 13, 2024 •

edited

Loading

johntaiko commented May 14, 2024 •

edited

Loading

smtmfft commented Jun 6, 2024 •

edited

Loading

johntaiko commented Jun 6, 2024 •

edited

Loading

smtmfft commented Jun 17, 2024 •

edited

Loading