Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug assert fails in routing_filter_add(), "(index_no / addrs_per_page < pages_per_extent)": Causes large inserts workload to fail. #560

Open
gapisback opened this issue Mar 29, 2023 · 4 comments
Assignees
Labels
bug Something isn't working critical

Comments

@gapisback
Copy link
Collaborator

gapisback commented Mar 29, 2023

This specific problem was encountered during benchmarking of single-client inserting 20M rows (in PG-SplinterDB integration code base).


Update (agurajada; 4/2023) After deeper investigations of variations of this repro, the basic issue seems to be that there is some instability in the library in /main when using small key-value pairs. After some experimentation, it appears that we can stably insert 20+M rows, using single / multiple threads, when key=4 bytes and value >= 20 bytes. For smaller k/v pair sizes, there are different forms of instabilities seen. We really need a comprehensive test-suite that can exercise these basic insert workloads for diff # of rows inserts, # of clients, and varying combinations of K/V pair sizes.


This has been reproduced using standalone test script off of /main @ SHA 89f09b3.

Branch: agurajada/560-rf-add-bug (Has another commit, which fixes another issue, partially tracked by the failures by repros attempted for issue #458. You need that fix in order to go further along to repro this failure.)

The failure is seen in this test large_inserts_stress_test --num-inserts 20000000 test_560_seq_htobe32_key_random_6byte_values_inserts from this branch:

Running 1 CTests, suite name 'large_inserts_stress', test case 'test_560_seq_htobe32_key_random_6byte_values_inserts'.
TEST 1/1 large_inserts_stress:test_seq_htobe32_key_random_6byte_values_inserts OS-pid=426343, OS-tid=426343, Thread-ID=0, Assertion failed at src/routing_filter.c:586:routing_filter_add(): "(index_no / addrs_per_page < pages_per_extent)". index_no=16384, addrs_per_page=512, (index_no / addrs_per_page)=32, pages_per_extent=32
Aborted (core dumped)

With release build, you will get a seg-fault a few lines later.

The test case test_560_seq_htobe32_key_random_6byte_values_inserts() has the exact conditions required to trigger this bug. There are other test cases in this test file that invoke different combinations of sequential / random key inserts, and they all seem to succeed.

@gapisback gapisback added bug Something isn't working critical labels Mar 29, 2023
gapisback added a commit that referenced this issue Mar 29, 2023
…outing_filter_add(). Test FAILs.

Seems like this is the 1st commit where stuff starts to break:

sdb-fdb-build:[39] $ VERBOSE=6 ./build/debug/bin/unit/large_inserts_stress_test --num-inserts 20000000 --verbose-progress test_560_seq_htobe32_key_random_6byte_values_inserts
Running 1 CTests, suite name 'large_inserts_stress', test case 'test_560_seq_htobe32_key_random_6byte_values_inserts'.
TEST 1/1 large_inserts_stress:test_560_seq_htobe32_key_random_6byte_values_inserts Fingerprint size 29 too large, max value size is 5, setting to 27
fingerprint_size: 27
filter-index-size: 256 is too small, setting to 512
exec_worker_thread()::293:Thread 0  inserts 20000000 (20 million), sequential key, random value, KV-pairs starting from 0 (0) ...
OS-pid=430835, Thread-ID=0, Insert random value of fixed-length=6 bytes.
exec_worker_thread()::385:Thread-0 Inserted 1 million KV-pairs ...
exec_worker_thread()::385:Thread-0 Inserted 2 million KV-pairs ...
exec_worker_thread()::385:Thread-0 Inserted 3 million KV-pairs ...
[...]
exec_worker_thread()::385:Thread-0 Inserted 17 million KV-pairs ...
OS-pid=430835, OS-tid=430835, Thread-ID=0, Assertion failed at src/routing_filter.c:591:routing_filter_add(): "(index_no / addrs_per_page < pages_per_extent)". index_no=16384, addrs_per_page=512, (index_no / addrs_per_page)=32, pages_per_extent=32
Aborted (core dumped)
gapisback added a commit that referenced this issue Mar 29, 2023
…outing_filter_add(). Test works

Seems like this is the 1st commit where stuff works:

sdb-fdb-build:[30] $ VERBOSE=6 ./build/debug/bin/unit/large_inserts_stress_test --num-inserts 20000000 --verbose-progress test_560_seq_htobe32_key_random_6byte_values_inserts
Running 1 CTests, suite name 'large_inserts_stress', test case 'test_560_seq_htobe32_key_random_6byte_values_inserts'.
TEST 1/1 large_inserts_stress:test_560_seq_htobe32_key_random_6byte_values_inserts Fingerprint size 29 too large, max value size is 5, setting to 27
fingerprint_size: 27
filter-index-size: 256 is too small, setting to 512
exec_worker_thread()::293:Thread 0  inserts 20000000 (20 million), sequential key, random value, KV-pairs starting from 0 (0) ...
OS-pid=429442, Thread-ID=0, Insert random value of fixed-length=6 bytes.
exec_worker_thread()::385:Thread-0 Inserted 1 million KV-pairs ...
exec_worker_thread()::385:Thread-0 Inserted 2 million KV-pairs ...
exec_worker_thread()::385:Thread-0 Inserted 3 million KV-pairs ...
exec_worker_thread()::385:Thread-0 Inserted 4 million KV-pairs ...
[...]
exec_worker_thread()::385:Thread-0 Inserted 19 million KV-pairs ...
exec_worker_thread()::385:Thread-0 Inserted 20 million KV-pairs ...
exec_worker_thread()::400:Thread-0 Inserted 20 million KV-pairs in 140 s, 142857 rows/s
Allocated at unmount: 880 MiB
[OK]
RESULTS: 1 tests (1 ok, 0 failed, 0 skipped) ran in 141608 ms
@gapisback
Copy link
Collaborator Author

gapisback commented Mar 29, 2023

Have narrowed down the repro as follows: Re-applied relevant commits from agurajada/560-rf-add-bug to commits off of /main.

  1. agurajada/rf-add-f0af570-baseline : Branched off of /main @ SHA f0af570 - Repro succeeds
  2. agurajada/rf-add-fc897e4c-fail : Branched off of /main @ SHA fc897e4 - Repro fails

Debugging ...

Updated: (4/1/2023):

I pushed branch agurajada/rf-add-fc897e4c-fail, which adds couple more variations of the basic test case:

  • test_560_seq_host_endian32_key_random_5byte_values_inserts: Fails
  • test_560_seq_host_endian32_key_random_8byte_values_inserts: Passes

You have to run these with large # of inserts, say 20M e.g.

$ ./build/release/bin/unit/large_inserts_stress_test --verbose-progress --num-inserts 20000000 test_560_seq_host_endian32_key_random_5byte_values_inserts

@rosenhouse
Copy link
Member

rosenhouse commented Apr 3, 2023

Here's a really easy diff against main to reproduce this:

$ git diff
diff --git a/tests/unit/splinterdb_stress_test.c b/tests/unit/splinterdb_stress_test.c
index 87cf659..32d7470 100644
--- a/tests/unit/splinterdb_stress_test.c
+++ b/tests/unit/splinterdb_stress_test.c
@@ -18,8 +18,8 @@
 #include "../functional/random.h"
 #include "ctest.h" // This is required for all test-case files.

-#define TEST_KEY_SIZE   20
-#define TEST_VALUE_SIZE 116
+#define TEST_KEY_SIZE   4
+#define TEST_VALUE_SIZE 6

 // Function Prototypes
 static void *
@@ -73,7 +73,7 @@ CTEST2(splinterdb_stress, test_random_inserts_concurrent)
    ASSERT_TRUE(random_data >= 0);

    worker_config wcfg = {
-      .num_inserts = 1000 * 1000,
+      .num_inserts = 20 * 1000 * 1000,
       .random_data = random_data,
       .kvsb        = data->kvsb,
    };

@rosenhouse
Copy link
Member

We did some more experimentation today and it appears tuples with 4b key, 20b value will pass, but smaller will not.

@gapisback
Copy link
Collaborator Author

Here is some more evidence of the behaviours we see for this existing stress-test when run off of /main @sha 89f09b3:

For a single-thread execution:

 22 #define TEST_KEY_SIZE   4
 23
  27 #define TEST_VALUE_SIZE 16

#define num_threads 1

 56    data->cfg           = (splinterdb_config){.filename   = TEST_DB_NAME,
 57                                              .cache_size = 1000 * Mega,
 58                                              .disk_size  = 30000 * Mega,
 59                                              .data_cfg   = &data->default_data_config,
 60                                              .queue_scale_percent = 100};

The test progresses to about: Thread 140002820638272 has completed 10200000 inserts and then hangs.
The stack looks like:

Samples: 1M of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 16012963972 lost: 0/0 drop: 0/0
Overhead  Shared Object         Symbol
  46.09%  libsplinterdb.so      [.] clockcache_try_get_read.constprop.0
  43.10%  libsplinterdb.so      [.] clockcache_get
   5.85%  ld-linux-x86-64.so.2  [.] __tls_get_addr
   1.99%  libsplinterdb.so      [.] 0x000000000000d150
   1.02%  libsplinterdb.so      [.] 0x000000000000cf60
   0.99%  libsplinterdb.so      [.] rc_allocator_get_config
   0.96%  libsplinterdb.so      [.] rc_allocator_get_config_virtual
   0.00%  [kernel]              [k] exit_to_user_mode_loop

The same test with the same above config will pass for this combo (again single thread): TEST_KEY_SIZE 4; TEST_VALUE_SIZE 20

--- Re-run with 4-threads, key=4 bytes, value=16 bytes:

This will run for a while till here:

sdb-fdb-build:[43] $ VERBOSE=6 ./build/release/bin/unit/splinterdb_stress_test test_random_inserts_concurrent
Running 1 CTests, suite name 'splinterdb_stress', test case 'test_random_inserts_concurrent'.
TEST 1/1 splinterdb_stress:test_random_inserts_concurrent Waiting for 4 worker threads ...
  Thread[0] ID=140504240404032
  Thread[1] ID=140504232011328
  Thread[2] ID=140504223618624
  Thread[3] ID=140504215225920
Writing lots of data from thread 140504232011328
[...]
Thread 140504240404032 has completed 3300000 inserts
Thread 140504215225920 has completed 3300000 inserts
Thread 140504223618624 has completed 3600000 inserts

And then will hang in this stack:

Samples: 93K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 20538871086 lost: 0/0 drop: 0/0
Overhead  Shared Object         Symbol
  73.76%  libsplinterdb.so      [.] memtable_maybe_rotate_and_get_insert_lock
  11.95%  libsplinterdb.so      [.] clockcache_try_get_read.constprop.0
  10.88%  libsplinterdb.so      [.] clockcache_get
   1.59%  ld-linux-x86-64.so.2  [.] __tls_get_addr
   0.53%  libsplinterdb.so      [.] 0x000000000000d150
   0.38%  libsplinterdb.so      [.] clockcache_unget
   0.29%  libsplinterdb.so      [.] rc_allocator_get_config_virtual
   0.27%  libsplinterdb.so      [.] 0x000000000000cf60
   0.26%  libsplinterdb.so      [.] rc_allocator_get_config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical
Projects
None yet
Development

No branches or pull requests

3 participants