-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c-kzg failing in SP1 due to unaligned access #169
Comments
risc-v spec says this: So allowed in the spec, with the warning it could be slow. Though SP1 seems to disallow it. |
Compiling with |
Fixed #170 |
@Brechtpd thanks for this catch. I think we should note this somewhere in our docs, since it is non-obvious behavior that differs from the RISC-V spec. |
@Brechtpd I think that this is the same error across both VMs--seems like "LoadAccessFault" might be similar to our address not aligned error? From a brief perusal online, it seems like in RISC-V, the hardware can implement only aligned accesses (this is true for us) and then the software can handle the unaligned accesses? (Stack overflow). So perhaps we have to add a handler at the software layer for this. I can look into it / prioritize this since it seems like a blocker. |
For risc0 it seems the error originates here: https://github.com/GregAC/rrs/blob/b23afc16b4e6a1fb5c4a73eb1e337e9400816507/rrs-lib/src/instruction_executor.rs#L185, and when going into read_mem it doesn't seem to hit the alignment error path because then the program would panic instead of returning the error that we see. So I think for some reason the memory location which is close to max int might be the problem and be corrupt (though no idea why that would be the case) but I'm not sure if risc0 manages the memory of different things so maybe the address could make sense. Looking at risc0, they also do not seem to support unaligned memory accesses (unless they do the software fallback somewhere) which does seem to be very common. But it's also strange that the same blob that required So hopefully for both the problem is just unaligned loads, will check with risc0 to see what they think about this error, if it ends up being an alignment problem then it would be great to have your help on this! |
There's two additional flags you might want to pass to the builder:
I'm on mobile at the mobile so cannot test, but if you share steps to reproduce the issue I can do some more digging tonight or tomorrow 🫡 |
The above address |
Here's the memory layout in the risc0 zkvm. We only allow guest code to use 0x400 - 0x0c00_0000 between code, stack, and heap. Within this range, the stack occupies the lowest 2 mb of this range, followed by the code, and then followed by the heap. The heap starts where the code ends and grows upward on each allocation towards The address being |
Made a repro branch here: #176. So can checkout the
change TARGET to Output for risc0: Output for SP1:
Unfortunately they do not seem to fix it, but I added them in the test branch.
Nothing massive being allocated as far as I know, just the KZG trusted setup data and the blob data in memory (next to the block data, but that isn't much more than a couple of megabytes).
Oh good question because I think the zkVMs do use a different code path than the one taken by our "native" prover that also uses c-kzg but using the normal compilation path that I believe uses the x86 assembly hand optimized code. @smtmfft We'll need to try that out and check it. |
Works well in SGX environment.(block 101368 on A7) |
It is quite possible that an arithmetic overflow issue from(c-kzg or its dependencies) was encountered on a 32-bit system(4122387576 approachs the maximum of u32), we can test it by running on other 32-bit virtual machines, like qemu. |
So, we can build with debug info, and get the backtrace when panic |
I did some investigating and this is what I found. To explore @johntaiko's hypothesis of the issue being related to compilation on 32-bit machines, I ran the following steps. Note that this isn't a perfect scenario to test things as it isn't being built on the RISC-V target (QEMU is a good next step IMO). Create a 32-bit docker container and install requirements
Clone c-kzg taiko fork and run tests with memory/arithmetic sanitization Clone repo
Change
Run tests with rust overflow checks
This seems to work FYI. I also tried pointing I think a good next step could be to do some more thorough debugging with GDB with the ELFs that get built for SP1/R0. |
From further investigation, it seems like the issue is inside the call to:
|
fix: #201 BackgroudBecause, we can't use the CaseBut, we had a mismatch with the -unsafe extern "C" fn calloc(size: usize) -> *mut c_void {
+unsafe extern "C" fn calloc(nobj: usize, size: usize) -> *mut c_void { Call in static C_KZG_RET c_kzg_calloc(void **out, size_t count, size_t size) {
*out = NULL;
if (count == 0 || size == 0) return C_KZG_BADARGS;
// need alloc `count * size` but only get `count(4096)` in raiko
*out = calloc(count, size);
return *out != NULL ? C_KZG_OK : C_KZG_MALLOC;
}
static C_KZG_RET g1_lincomb_fast(
g1_t *out, const g1_t *p, const fr_t *coeffs, uint64_t len
) {
C_KZG_RET ret;
void *scratch = NULL;
blst_p1_affine *p_affine = NULL;
blst_scalar *scalars = NULL;
/* Tunable parameter: must be at least 2 since blst fails for 0 or 1 */
if (len < 8) {
g1_lincomb_naive(out, p, coeffs, len);
} else {
/* blst's implementation of the Pippenger method */
size_t scratch_size = blst_p1s_mult_pippenger_scratch_sizeof(len);
ret = c_kzg_malloc(&scratch, scratch_size);
if (ret != C_KZG_OK) goto out;
ret = c_kzg_calloc((void **)&p_affine, len, sizeof(blst_p1_affine));
if (ret != C_KZG_OK) goto out;
ret = c_kzg_calloc((void **)&scalars, len, sizeof(blst_scalar));
if (ret != C_KZG_OK) goto out;
/* Transform the points to affine representation */
const blst_p1 *p_arg[2] = {p, NULL};
blst_p1s_to_affine(p_affine, p_arg, len);
/* Transform the field elements to 256-bit scalars */
for (uint64_t i = 0; i < len; i++) {
blst_scalar_from_fr(&scalars[i], &coeffs[i]);
}
/* Call the Pippenger implementation */
const byte *scalars_arg[2] = {(byte *)scalars, NULL};
const blst_p1_affine *points_arg[2] = {p_affine, NULL};
blst_p1s_mult_pippenger(
out, points_arg, len, scalars_arg, 255, scratch
);
}
ret = C_KZG_OK;
out:
c_kzg_free(scratch);
c_kzg_free(p_affine);
c_kzg_free(scalars);
return ret;
} So, some of the allocated memory overlapped, resulting in a UB. |
Reopen for we still meet unaligned access in ckzg lib in SP1. |
Fixed by smtmfft/c-kzg-4844#2 Related update in blst: supranational/blst@master...dyxushuai:blst:master |
Possible, but I actually tried that...not works in my env, maybe I missed sth. will test later.
|
Adding flag is not enough, another reason is: #291 as sp1 requires strict aligned access, potentially some more issues we will meet later, and then that unsafe memory operation in rust seems inevitable. Hopefully there is a rustflag to let all vec allocation be in a aligned manner. will explore that later. (ref global aligned alloc: https://github.com/Brechtpd/cap/blob/a63589670397f85745daa1a22d2de4d46fd742aa/src/lib.rs#L183) VM debugging is hard, no debug info, no frame stack, and no libc support (can not print in c native library), any tool can make it easier? |
close this issue as non-reproducible for now. |
Describe the feature request
Following
#148
#157
#162
We still have to fix kzg compilation in SP1. I've included example in pipeline/examples/sp1:
https://github.com/taikoxyz/raiko/blob/12826f2c820346b2c42deb757b359568e7d0f793/pipeline/examples/sp1/src/kzg.rs
The goal is to get the proof generated, although it seems like extern C overloading does proves & verified:
Possibly a memory or stack limit issue.
Spam policy
The text was updated successfully, but these errors were encountered: