Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c-kzg failing in SP1 due to unaligned access #169

Closed
1 task done
CeciliaZ030 opened this issue May 7, 2024 · 22 comments · Fixed by #201
Closed
1 task done

c-kzg failing in SP1 due to unaligned access #169

CeciliaZ030 opened this issue May 7, 2024 · 22 comments · Fixed by #201

Comments

@CeciliaZ030
Copy link
Contributor

Describe the feature request

Following
#148
#157
#162
We still have to fix kzg compilation in SP1. I've included example in pipeline/examples/sp1:
https://github.com/taikoxyz/raiko/blob/12826f2c820346b2c42deb757b359568e7d0f793/pipeline/examples/sp1/src/kzg.rs
The goal is to get the proof generated, although it seems like extern C overloading does proves & verified:

unsafe {
        let c = calloc(100);
        let m = malloc(200);
        free(c);
        free(m);
    }

Possibly a memory or stack limit issue.

Spam policy

  • I verify that this issue is NOT SPAM and understand SPAM issues will be closed and reported to GitHub, resulting in ACCOUNT TERMINATION.
@CeciliaZ030 CeciliaZ030 changed the title c-kzg still failing in SGX & SP1 c-kzg still failing in SP1 May 7, 2024
@Brechtpd
Copy link
Contributor

Brechtpd commented May 7, 2024

Fails here: https://github.com/succinctlabs/sp1/blob/53c8915b99e06049f73b4e3c8f83c93296764808/core/src/runtime/mod.rs#L593

risc-v spec says this:

image

So allowed in the spec, with the warning it could be slow. Though SP1 seems to disallow it.

@Brechtpd
Copy link
Contributor

Brechtpd commented May 8, 2024

Compiling with -mstrict-align seems to work to prevent unaligned accesses in the c code and the kzg code looks to run okay.

@CeciliaZ030
Copy link
Contributor Author

Fixed #170

@puma314
Copy link

puma314 commented May 8, 2024

@Brechtpd thanks for this catch. I think we should note this somewhere in our docs, since it is non-obvious behavior that differs from the RISC-V spec.

@Brechtpd Brechtpd reopened this May 8, 2024
@Brechtpd
Copy link
Contributor

Brechtpd commented May 8, 2024

Reopening because seems the problem is not completely fixed for all blocks. Using block 101368 on A7, there is still an alignment issue at the same location. Even more, risc0 also fails for the block but with a different error:

SP1:

image

risc0:

image

@puma314
Copy link

puma314 commented May 8, 2024

@Brechtpd I think that this is the same error across both VMs--seems like "LoadAccessFault" might be similar to our address not aligned error?

From a brief perusal online, it seems like in RISC-V, the hardware can implement only aligned accesses (this is true for us) and then the software can handle the unaligned accesses? (Stack overflow).

So perhaps we have to add a handler at the software layer for this. I can look into it / prioritize this since it seems like a blocker.

@Brechtpd
Copy link
Contributor

Brechtpd commented May 8, 2024

@puma314

For risc0 it seems the error originates here: https://github.com/GregAC/rrs/blob/b23afc16b4e6a1fb5c4a73eb1e337e9400816507/rrs-lib/src/instruction_executor.rs#L185, and when going into read_mem it doesn't seem to hit the alignment error path because then the program would panic instead of returning the error that we see. So I think for some reason the memory location which is close to max int might be the problem and be corrupt (though no idea why that would be the case) but I'm not sure if risc0 manages the memory of different things so maybe the address could make sense. 4122387576 % 4 == 0 so I guess the address would also be aligned correctly.

Looking at risc0, they also do not seem to support unaligned memory accesses (unless they do the software fallback somewhere) which does seem to be very common. But it's also strange that the same blob that required -mstrict-align for sp1 did not need it for risc0. Though -mstrict-align is not enough, so who knows what's going on. :)

So hopefully for both the problem is just unaligned loads, will check with risc0 to see what they think about this error, if it ends up being an alignment problem then it would be great to have your help on this!

@intoverflow
Copy link
Contributor

There's two additional flags you might want to pass to the builder:

  • -march=rv32im
  • -falign-functions=2

I'm on mobile at the mobile so cannot test, but if you share steps to reproduce the issue I can do some more digging tonight or tomorrow 🫡

@flaub
Copy link

flaub commented May 9, 2024

The above address 4122387576 (aka 0xf5b6a478) looks like garbage, in r0vm, it shouldn't be possible for a program to address anything above 0x0c000000 (max addressable range is 192MB). So this feels like a dangling pointer. Since we're talking about C code, it's plausible that this might be a deeper bug in c-kzg itself (or perhaps related to modifications made to get it to run in these VMs)

@SchmErik
Copy link

SchmErik commented May 9, 2024

Here's the memory layout in the risc0 zkvm. We only allow guest code to use 0x400 - 0x0c00_0000 between code, stack, and heap. Within this range, the stack occupies the lowest 2 mb of this range, followed by the code, and then followed by the heap. The heap starts where the code ends and grows upward on each allocation towards 0x0c00_0000. Any address above this is illegal for the guest to access and that's what you're seeing here.

The address being 0xf5b6_a478 makes me think that this is either a dangling pointer or a massive allocation close to 4gb. Is there something large being allocated on the heap? Does this work as expected on non-zkvm machines? I also second @intoverflow's advice adding those two flags to the builder.

@Brechtpd
Copy link
Contributor

Brechtpd commented May 9, 2024

Made a repro branch here: #176. So can checkout the kzg-fun branch of this repo, and then:

export TARGET=risc0 MOCK=1
make install
make build && RUST_BACKTRACE=1 make test

change TARGET to sp1 for testing that.

Output for risc0:

image

Output for SP1:

image

There's two additional flags you might want to pass to the builder:

* `-march=rv32im`

* `-falign-functions=2`

Unfortunately they do not seem to fix it, but I added them in the test branch.

The address being 0xf5b6_a478 makes me think that this is either a dangling pointer or a massive allocation close to 4gb. Is there something large being allocated on the heap?

Nothing massive being allocated as far as I know, just the KZG trusted setup data and the blob data in memory (next to the block data, but that isn't much more than a couple of megabytes).

Does this work as expected on non-zkvm machines?

Oh good question because I think the zkVMs do use a different code path than the one taken by our "native" prover that also uses c-kzg but using the normal compilation path that I believe uses the x86 assembly hand optimized code. @smtmfft

We'll need to try that out and check it.

@johntaiko
Copy link
Contributor

johntaiko commented May 9, 2024

Does this work as expected on non-zkvm machines?

Works well in SGX environment.(block 101368 on A7)

@johntaiko
Copy link
Contributor

johntaiko commented May 9, 2024

The above address 4122387576 (aka 0xf5b6a478) looks like garbage, in r0vm, it shouldn't be possible for a program to address anything above 0x0c000000 (max addressable range is 192MB). So this feels like a dangling pointer. Since we're talking about C code, it's plausible that this might be a deeper bug in c-kzg itself (or perhaps related to modifications made to get it to run in these VMs)

It is quite possible that an arithmetic overflow issue from(c-kzg or its dependencies) was encountered on a 32-bit system(4122387576 approachs the maximum of u32), we can test it by running on other 32-bit virtual machines, like qemu.

@johntaiko
Copy link
Contributor

we can test it by running on other 32-bit virtual machines.

So, we can build with debug info, and get the backtrace when panic

@jtguibas
Copy link

jtguibas commented May 13, 2024

I did some investigating and this is what I found. To explore @johntaiko's hypothesis of the issue being related to compilation on 32-bit machines, I ran the following steps. Note that this isn't a perfect scenario to test things as it isn't being built on the RISC-V target (QEMU is a good next step IMO).

Create a 32-bit docker container and install requirements

sudo docker run --rm -it i686/ubuntu bash
apt-get update
apt-get install git
apt-get install build-essential
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Clone c-kzg taiko fork and run tests with memory/arithmetic sanitization

Clone repo

git clone https://github.com/smtmfft/c-kzg-4844.git
cd c-kzg-4844

Change bindings/rust/build.rs to compile c-kzg with C arithmetic overflow checks

cc.flag_if_supported("-ftrapv")
cc.flag_if_supported("-fsanitize=undefined")

Run tests with rust overflow checks

rustup toolchain install nightly
RUSTFLAGS="-C overflow-checks=on" cargo test --release

This seems to work FYI.

I also tried pointing raiko to this version of c-kzg-4844 to catch any arithmetic overflows, but it seems to still work on the prover when it's set to TARGET=native and make build && RUST_BACKTRACE=1 make test.

I think a good next step could be to do some more thorough debugging with GDB with the ELFs that get built for SP1/R0.

@jtguibas
Copy link

From further investigation, it seems like the issue is inside the call to:

blob_to_kzg_commitment

@Brechtpd Brechtpd changed the title c-kzg still failing in SP1 c-kzg still failing in SP1 and risc zero May 14, 2024
@johntaiko
Copy link
Contributor

johntaiko commented May 14, 2024

From further investigation, it seems like the issue is inside the call to:

blob_to_kzg_commitment

fix: #201

Backgroud

Because, we can't use the libc in zkvms, so we hacked the malloc and calloc methods

Case

But, we had a mismatch with the calloc function's signature.https://www.tutorialspoint.com/c_standard_library/c_function_calloc.htm

-unsafe extern "C" fn calloc(size: usize) -> *mut c_void {
+unsafe extern "C" fn calloc(nobj: usize, size: usize) -> *mut c_void {

Call in c-kzg-4844
https://github.com/smtmfft/c-kzg-4844/blob/77e9ba0a65e10e6a470832da2914b17a968da791/src/c_kzg_4844.c#L189

static C_KZG_RET c_kzg_calloc(void **out, size_t count, size_t size) {
    *out = NULL;
    if (count == 0 || size == 0) return C_KZG_BADARGS;
    // need alloc `count * size` but only get `count(4096)` in raiko
    *out = calloc(count, size);
    return *out != NULL ? C_KZG_OK : C_KZG_MALLOC;
}

static C_KZG_RET g1_lincomb_fast(
    g1_t *out, const g1_t *p, const fr_t *coeffs, uint64_t len
) {
    C_KZG_RET ret;
    void *scratch = NULL;
    blst_p1_affine *p_affine = NULL;
    blst_scalar *scalars = NULL;

    /* Tunable parameter: must be at least 2 since blst fails for 0 or 1 */
    if (len < 8) {
        g1_lincomb_naive(out, p, coeffs, len);
    } else {
        /* blst's implementation of the Pippenger method */
        size_t scratch_size = blst_p1s_mult_pippenger_scratch_sizeof(len);
        ret = c_kzg_malloc(&scratch, scratch_size);
        if (ret != C_KZG_OK) goto out;
        ret = c_kzg_calloc((void **)&p_affine, len, sizeof(blst_p1_affine));
        if (ret != C_KZG_OK) goto out;
        ret = c_kzg_calloc((void **)&scalars, len, sizeof(blst_scalar));
        if (ret != C_KZG_OK) goto out;

        /* Transform the points to affine representation */
        const blst_p1 *p_arg[2] = {p, NULL};
        blst_p1s_to_affine(p_affine, p_arg, len);

        /* Transform the field elements to 256-bit scalars */
        for (uint64_t i = 0; i < len; i++) {
            blst_scalar_from_fr(&scalars[i], &coeffs[i]);
        }

        /* Call the Pippenger implementation */
        const byte *scalars_arg[2] = {(byte *)scalars, NULL};
        const blst_p1_affine *points_arg[2] = {p_affine, NULL};
        blst_p1s_mult_pippenger(
            out, points_arg, len, scalars_arg, 255, scratch
        );
    }

    ret = C_KZG_OK;

out:
    c_kzg_free(scratch);
    c_kzg_free(p_affine);
    c_kzg_free(scalars);
    return ret;
}

So, some of the allocated memory overlapped, resulting in a UB.

@johntaiko johntaiko linked a pull request May 14, 2024 that will close this issue
@smtmfft smtmfft changed the title c-kzg still failing in SP1 and risc zero c-kzg failing in SP1 due to unaligned access Jun 6, 2024
@smtmfft smtmfft reopened this Jun 6, 2024
@smtmfft
Copy link
Contributor

smtmfft commented Jun 6, 2024

Reopen for we still meet unaligned access in ckzg lib in SP1.
Add some findings here: https://hackmd.io/@taik0yue/B15mxaCVC

@johntaiko
Copy link
Contributor

johntaiko commented Jun 6, 2024

Reopen for we still meet unaligned access in ckzg lib in SP1. Add some findings here: hackmd.io/@taik0yue/B15mxaCVC

Fixed by smtmfft/c-kzg-4844#2

Related update in blst: supranational/blst@master...dyxushuai:blst:master

@smtmfft
Copy link
Contributor

smtmfft commented Jun 6, 2024

Possible, but I actually tried that...not works in my env, maybe I missed sth. will test later.

Reopen for we still meet unaligned access in ckzg lib in SP1. Add some findings here: hackmd.io/@taik0yue/B15mxaCVC

Fixed by smtmfft/c-kzg-4844#2

Related update in blst: supranational/blst@master...dyxushuai:blst:master

@smtmfft
Copy link
Contributor

smtmfft commented Jun 17, 2024

Adding flag is not enough, another reason is: #291

as sp1 requires strict aligned access, potentially some more issues we will meet later, and then that unsafe memory operation in rust seems inevitable. Hopefully there is a rustflag to let all vec allocation be in a aligned manner. will explore that later. (ref global aligned alloc: https://github.com/Brechtpd/cap/blob/a63589670397f85745daa1a22d2de4d46fd742aa/src/lib.rs#L183)

VM debugging is hard, no debug info, no frame stack, and no libc support (can not print in c native library), any tool can make it easier?

@smtmfft
Copy link
Contributor

smtmfft commented Jul 5, 2024

close this issue as non-reproducible for now.

@smtmfft smtmfft closed this as completed Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants