-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption observed when using Moka v0.9.6 #281
Comments
Details of the issue: https://github.com/MaterializeInc/database-issues/issues/5879 It seems easy to reproduce if we can run an
|
- id: zippy-user-tables-large
label: "Long Zippy w/ user tables"
timeout_in_minutes: 2880
agents:
queue: linux-x86_64-large
artifact_paths: junit_*.xml
plugins:
- ./ci/plugins/mzcompose:
composition: zippy
args: [--scenario=UserTablesLarge, --actions=200000] |
I ran a short version of "Zippy w/ user tables test" on Linux x86_64. I used the same scenario, but smaller number of actions 2,000 instead of 200,000. I could not reproduce the issue. I will try a longer run. Environment:
Terminal Log: $ git clone git@github.com:MaterializeInc/materialize.git
$ cd materialize
$ git branch moka-segfault 07386e427f
$ git checkout moka-segfault
$ rg moka
Cargo.lock
2903: "moka",
3241:name = "moka"
4380: "moka",
8629: "moka",
src/persist-client/Cargo.toml
41:moka = { version = "0.9.6", default-features = false, features = ["sync"] }
src/persist-client/src/internal/cache.rs
16:use moka::notification::RemovalCause;
17:use moka::sync::Cache;
100: // This could maybe use moka's async cache to unify any concurrent
src/workspace-hack/Cargo.toml
63:moka = { version = "0.9.6", default-features = false, features = ["atomic64", "sync"] }
164:moka = { version = "0.9.6", default-features = false, features = ["atomic64", "sync"] } $ ./bin/mzcompose --find zippy run default --scenario UserTablesLarge --actions 2000
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-WBP7JFLWFQYUIHLXR5JSOAJ74VVKI6O7
materialize/clusterd:mzbuild-Q77JA6BIOATFUZAWANSIJFOCUXV3GOGX
materialize/materialized:mzbuild-IIUOBQEI7JVK3YSGYYV6FP3EP3JGFF5I
materialize/test-certs:mzbuild-6JWZ2MGLSYOCL3MLHIEIJMMLQLEOLCNY
materialize/postgres:mzbuild-ENTVWBFMO762DEEKAMURDMVMQ52GIO63
materialize/testdrive:mzbuild-ANAR5HEUWWDMTQUJ3ZA7XJWEY64H2ZTB
...
==> Running test case workflow-default
==> Running workflow default
...
[+] Running 3/3
✔ Container zippy-cockroach-1 Healthy
✔ Container zippy-minio-1 Healthy
✔ Container zippy-materialized-1 Healthy
> ALTER SYSTEM SET enable_unmanaged_cluster_replicas = true;
> CREATE CLUSTER storaged REPLICAS (r2 (
STORAGECTL ADDRESSES ['storaged:2100'],
STORAGE ADDRESSES ['storaged:2103'],
COMPUTECTL ADDRESSES ['storaged:2101'],
COMPUTE ADDRESSES ['storaged:2102'],
WORKERS 4
))
[+] Stopping 1/1
✔ Container zippy-materialized-1 Stopped
Going to remove zippy-materialized-1
[+] Removing 1/0
✔ Container zippy-materialized-1 Removed
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-testdrive-1 Healthy
Generating test...
Running test...
--- MaterializeInc/database-issues#1: KafkaStart
[+] Building 0.0s (0/0)
[+] Running 2/2
✔ Container zippy-zookeeper-1 Healthy
✔ Container zippy-kafka-1 Healthy
--- MaterializeInc/database-issues#2: CockroachStart
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-cockroach-1 Healthy
--- MaterializeInc/database-issues#3: MinioStart
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-minio-1 Healthy
--- MaterializeInc/database-issues#4: MzStart
[+] Building 0.0s (0/0)
[+] Running 3/3
✔ Container zippy-cockroach-1 Healthy
✔ Container zippy-minio-1 Healthy
✔ Container zippy-materialized-1 Healthy
--- MaterializeInc/materialize#5: StoragedStart
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-storaged-1 Healthy
--- MaterializeInc/database-issues#5: CreateTable
> CREATE TABLE table_0 (f1 INTEGER);
rows match; continuing at ts 1687305573.2167296
> INSERT INTO table_0 VALUES (0);
rows match; continuing at ts 1687305573.2269592
...
> INSERT INTO table_0 SELECT * FROM generate_series(568924, 572569);
rows match; continuing at ts 1687306589.1175394
==> mzcompose: test case workflow-default succeeded |
There are almost 300 commits between when moka was added and when removed: I need to know when (which commit) last failed. |
The following commits were mentioned in https://github.com/MaterializeInc/database-issues/issues/5879.
|
Ran the test (with 5,000 actions) several times on my home PC, but I have not reproduced the issue.
My PC (Windows 11 + WSL2) does not have enough RAM to run longer Zippy test. It has only 16GB of RAM and Docker containers like Kafka sometimes get killed while running test. I will expand the RAM to 32GB. I hope running longer test would help to reproduce the issue. But another aspect could be the number of available cores/hyper threads. My PC has 16 cores and 20 threads, but their EC2 instance only has 8 vCPUs:
I will probably adjust the number of hyper threads available for Docker:
Also, I will add some tracing logs to both Moka and Materialize to get better understandings on the access pattern. |
@tatsuya6502 I have 32 threads / 128 ram and can leave test running for night if this can help |
@0xdeafbeef Thanks for the offer. Wow, you have a great machine! I think I am going to be okay. I already ordered the RAM modules and am waiting for them to arrive. I think 32GB should be enough for now. |
FYI, I got the RAM modules for my PC, and confirmed it can run the test with at least 50,000 actions. I allocated 24GiB of memory and only 8 logical CPU cores to WSL2. The reason I did not try 200,000 actions is that it will take long time (Estimated ~30 hours). 50,000 actions took 7 hours and 14 minutes. The memory utilization was fine so I had feeling that it could run 200,000 without issue. Also, before I got the RAM module, I was also able to run 80,000 actions. I found that reducing the number of logical CPU cores for WSL2/Docker will reduce the memory utilization of the test. So I allocated only 8 cores (instead of 20) and 10GiB of memory (instead of 8GiB) to WSL2 and it was able to run 80,000 actions. |
I talked to a Materialize engineer. They will try to reproduce the issue using an EC2 instance and give me the full set of instructions. MaterializeInc/materialize#20136 (comment)
|
With some helps from Materialize engineers, I made some progress. I reproduced the segmentation fault in Moka v0.9.6 in Materialize, and got a core dump to analyze. NOTES: I had to use an Amazon EC2 Here is the back trace of the crashed thread: It crashed in the Materialize uses Moka v0.9.6, and this problem may have been fixed by #259 in v0.11.0. I will back-port the fix to v0.9.x, and run the same test to see if the problem goes away. Idea of steps to reproduceIf this is related to #259, reproducing steps could be the followings. However the end result depends largely on how the compiler optimizes reading an immutable reference
188: pub(crate) unsafe fn move_to_back(&mut self, mut node: NonNull<DeqNode<T>>) {
...
201: match node.prev {
202: Some(prev) if node.next.is_some() => (*prev.as_ptr()).next = node.next, |
I reviewed code and realized that above will not reproduce the problem.
|
I backported the fix to v0.9.x: I am running "Long Zippy w/ user tables" on an EC2 I will release Moka v0.9.8 with MaterializeInc/materialize#283 after I finish running pre-release tests. And then I will let the Materialize engineers know that the problem may have been fixed. |
I will backport MaterializeInc/materialize#259 to Moka v0.10.x as well. I will keep this issue open until I am done with it. |
Closing as completed. Moka v0.9.x
Moka v0.10.x
|
Split from MaterializeInc/database-issues#93.
are yet unknown.=> Linux x86_64moka::sync::Cache
are used in async contexts.Cache
instances. (An instance per blob URL. Do not know how many)I am trying to reproduce the issue using mokabench but no luck: #279 (comment)
CC: @ParkMyCar
The text was updated successfully, but these errors were encountered: