Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vnode hangs in {eleveldb,write,3} #273

Closed
systream opened this issue Apr 12, 2023 · 8 comments
Closed

vnode hangs in {eleveldb,write,3} #273

systream opened this issue Apr 12, 2023 · 8 comments

Comments

@systream
Copy link

systream commented Apr 12, 2023

In RIAK 3.0.12, vnodes started to hang in leveldb:write/3 and the process mailbox could grow up to 10K messages.
After a while whole node starts to behave unstable. The queue size of the vnode does not decrease, even without a load.

[{current_function,{eleveldb,write,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,10001},
 {links,[<0.1687.0>,<0.2146.0>,<0.258.0>]},
 {dictionary,[{rand_seed,{#{bits => 58,jump => #Fun<rand.13.8986388>,
                            next => #Fun<rand.10.8986388>,type => exsss,
                            uniform => #Fun<rand.11.8986388>,
                            uniform_n => #Fun<rand.12.8986388>},
                          [112506433278832578|159401657947672872]}},
              {#Ref<0.2265291839.1457520643.98734>,
               {bc_state,"/var/lib/riak/bitcask/11417981541647679048466287755595961091061972992",
                         fresh,undefined,[],2147483648,
                         [{expiry_secs,-1},
                          {io_mode,erlang},
                          {expiry_grace_time,0},
                          {small_file_threshold,10485760},
                          {dead_bytes_threshold,134217728},
                          {frag_threshold,40},
                          {dead_bytes_merge_trigger,536870912},
                          {frag_merge_trigger,60},
                          {max_file_size,2147483648},
                          {open_timeout,4},
                          {data_root,"/var/lib/riak/bitcask"},
                          {sync_strategy,none},
                          {merge_window,always},
                          {max_fold_age,-1},
                          {max_fold_puts,0},
                          {expiry_secs,-1},
                          {require_hint_crc,true},
                          {key_transform,#Fun<riak_kv_bitcask_backend.1.43379991>},
                          {read_write,true}], 
                         #Fun<riak_kv_bitcask_backend.1.43379991>,
                         #Ref<0.2265291839.1457651715.98733>,1,2}},
              {'$initial_call',{riak_core_vnode,init,1}},
              {'$ancestors',[riak_core_vnode_sup,riak_core_sup,<0.254.0>]},
              {bitcask_time_fudge,no_testing},
              {hashtree_tokens,47},
              {#Ref<0.2265291839.1457520644.95496>,
               {bc_state,"/var/lib/riak/bitcask/11417981541647679048466287755595961091061972992",
                         fresh,undefined,[],2147483648,
                         [{expiry_secs,-1},
                          {io_mode,erlang},
                          {expiry_grace_time,0},
                          {small_file_threshold,10485760},
                          {dead_bytes_threshold,134217728},
                          {frag_threshold,40},
                          {dead_bytes_merge_trigger,536870912},
                          {frag_merge_trigger,60},
                          {max_file_size,2147483648},
                          {open_timeout,4},
                          {data_root,"/var/lib/riak/bitcask"},
                          {sync_strategy,none},
                          {merge_window,always},
                          {max_fold_age,-1},
                          {max_fold_puts,0},
                          {expiry_secs,-1},
                          {require_hint_crc,true},
                          {key_transform,#Fun<riak_kv_bitcask_backend.1.43379991>},
                          {read_write,true}],
                         #Fun<riak_kv_bitcask_backend.1.43379991>,
                         #Ref<0.2265291839.1457651719.94834>,1,2}},
              {bitcask_file_mod,bitcask_file}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.253.0>},
 {total_heap_size,1542816},
 {heap_size,10958},
 {stack_size,70},
 {reductions,29125337},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]
{backtrace,<<"Program counter: 0x00007f9e9f962d90 (eleveldb:write/3 + 120)\nCP: 0x0000000000000000 (invalid)\n\n0x00007f9d1db7de20 Return addr 0x00007fa3f6e51e28 (riak_kv_eleveldb_backend:do_put/6 + 608)\ny(0)     []\ny(1)     []\ny(2)     #Ref<0.2265291839.1457782787.52859>\n\n0x00007f9d1db7de40 Return addr 0x00007fa3f719fa00 (riak_kv_multi_backend:put/5 + 200)\ny(0)     [{sync,false}]\ny(1)     {state,#Ref<0.2265291839.1457651712.94836>,\"/var/lib/riak/leveldb/11417981541647679048466287755595961091061972992\",[{block_cache_threshold,33554432},{block_restart_interval,16},{block_size_steps,16},{cache_object_warming,true},{compression,snappy},{create_if_missing,true},{delete_threshold,1000},{eleveldb_threads,71},{expiry_enabled,false},{expiry_minutes,0},{fadvise_willneed,false},{limited_developer_mem,false},{sst_block_size,4096},{tiered_slow_level,0},{total_leveldb_mem_percent,35},{use_bloomfilter,true},{whole_file_expiry,true},{write_buffer_size,16662690}],[{block_cache_threshold,33554432},{block_restart_interval,16},{block_size_steps,16},{cache_object_warming,true},{compression,snappy},{create_if_missing,true},{data_root,\"/var/lib/riak/leveldb\"},{delete_threshold,1000},{eleveldb_threads,71},{expiry_enabled,false},{expiry_minutes,0},{fadvise_willneed,false},{limited_developer_mem,false},{sst_block_size,4096},{sync,false},{tiered_slow_level,0},{total_leveldb_mem_percent,35},{use_bloomfilter,true},{verify_checksums,true},{verify_compaction,true},{whole_file_expiry,true},{write_buffer_size,16662690},{write_buffer_size_max,31457280},{write_buffer_size_min,15728640}],[{verify_checksums,true}],[{sync,false}],[{verify_checksums,true},{fill_cache,false}],false,false}\ny(2)     []\ny(3)     []\ny(4)     []\ny(5)     []\ny(6)     []\n\n0x00007f9d1db7de80 Return addr 0x00007f9e6c21e490 (riak_kv_vnode:encode_and_put_no_sib_check/8 + 2104)\ny(0)     riak_kv_eleveldb_backend\ny(1)     {<<\"leveldb_mult\">>,riak_kv_eleveldb_backend,{state,#Ref<0.2265291839.1457651712.94836>,\"/var/lib/riak/leveldb/11417981541647679048466287755595961091061972992\",[{block_cache_threshold,33554432},{block_restart_interval,16},{block_size_steps,16},{cache_object_warming,true},{compression,snappy},{create_if_missing,true},{delete_threshold,1000},{eleveldb_threads,71},{expiry_enabled,false},{expiry_minutes,0},{fadvise_willneed,false},{limited_developer_mem,false},{sst_block_size,4096},{tiered_slow_level,0},{total_leveldb_mem_percent,35},{use_bloomfilter,true},{whole_file_expiry,true},{write_buffer_size,16662690}],[{block_cache_threshold,33554432},{block_restart_interval,16},{block_size_steps,16},{cache_object_warming,true},{compression,snappy},{create_if_missing,true},{data_root,\"/var/lib/riak/leveldb\"},{delete_threshold,1000},{eleveldb_threads,71},{expiry_enabled,false},{expiry_minutes,0},{fadvise_willneed,false},{limited_developer_mem,false},{sst_block_size,4096},{sync,false},{tiered_slow_level,0},{total_leveldb_mem_percent,35},{use_bloomfilter,true},{verify_checksums,true},{verify_compaction,true},{whole_file_expiry,true},{write_buffer_size,16662690},{write_buffer_size_max,31457280},{write_buffer_size_min,15728640}],[{verify_checksums,true}],[{sync,false}],[{verify_checksums,true},{fill_cache,false}],false,false}}\ny(2)     {state,[{<<\"leveldb\">>,riak_kv_bitcask_backend,{state,#Ref<0.2265291839.1457520643.98734>,\"11417981541647679048466287755595961091061972992\",[{io_mode,erlang},{expiry_grace_time,0},{small_file_threshold,10485760},{dead_bytes_threshold,134217728},{frag_threshold,40},{dead_bytes_merge_trigger,536870912},{frag_merge_trigger,60},{max_file_size,2147483648},{open_timeout,4},{data_root,\"/var/lib/riak/bitcask\"},{sync_strategy,none},{merge_window,always},{max_fold_age,-1},{max_fold_puts,0},{expiry_secs,-1},{require_hint_crc,true},{key_transform,#Fun<riak_kv_bitcask_backend.1.43379991>},{read_write,true}],11417981541647679048466287755595961091061972992,\"/var/lib/riak/bitcask\",1}},{<<\"memory_mult\">>,riak_kv_memory_backend,{state,#Ref<0.2265291839.1457651713.96426>,#Ref<0.2265291839.1457651713.96425>,#Ref<0.2265291839.1457651713.96424>,134217728,0,0,129600}},{<<\"leveldb_mult\">>,riak_kv_eleveldb_backend,{state,#Ref<0.2265291839.1457651712.94836>,\"/var/lib/riak/leveldb/11417981541647679048466287755595961091061972992\",[{block_cache_threshold,33554432},{block_restart_interval,16},{block_size_steps,16},{cache_object_warming,true},{compression,snappy},{create_if_missing,true},{delete_threshold,1000},{eleveldb_threads,71},{expiry_enabled,false},{expiry_minutes,0},{fadvise_willneed,false},{limited_developer_mem,false},{sst_block_size,4096},{tiered_slow_level,0},{total_leveldb_mem_percent,35},{use_bloomfilter,true},{whole_file_expiry,true},{write_buffer_size,16662690}],[{block_cache_threshold,33554432},{block_restart_interval,16},{block_size_steps,16},{cache_object_warming,true},{compression,snappy},{create_if_missing,true},{data_root,\"/var/lib/riak/leveldb\"},{delete_threshold,1000},{eleveldb_threads,71},{expiry_enabled,false},{expiry_minutes,0},{fadvise_willneed,false},{limited_developer_mem,false},{sst_block_size,4096},{sync,false},{tiered_slow_level,0},{total_leveldb_mem_percent,35},{use_bloomfilter,true},{verify_checksums,true},{verify_compaction,true},{whole_file_expiry,true},{write_buffer_size,16662690},{write_buffer_size_max,31457280},{write_buffer_size_min,15728640}],[{verify_checksums,true}],[{sync,false}],[{verify_checksums,true},{fill_cache,false}],false,false}},{<<\"bitcask_mult\">>,riak_kv_bitcask_backend,{state,#Ref<0.2265291839.1457520644.95496>,\"11417981541647679048466287755595961091061972992\",[{io_mode,erlang},{expiry_grace_time,0},{small_file_threshold,10485760},{dead_bytes_threshold,134217728},{frag_threshold,40},{dead_bytes_merge_trigger,536870912},{frag_merge_trigger,60},{max_file_size,2147483648},{open_timeout,4},{data_root,\"/var/lib/riak/bitcask\"},{sync_strategy,none},{merge_window,always},{max_fold_age,-1},{max_fold_puts,0},{expiry_secs,-1},{require_hint_crc,true},{key_transform,#Fun<riak_kv_bitcask_backend.1.43379991>},{read_write,true}],11417981541647679048466287755595961091061972992,\"/var/lib/riak/bitcask\",1}}],<<\"bitcask_mult\">>}\ny(3)     []\ny(4)     []\ny(5)     []\ny(6)     []\n\n0x00007f9d1db7dec0 Return addr 0x00007f9e6c217110 (riak_kv_vnode:actual_put/8 + 264)\ny(0)   

Another interesting thing that i noticed, that in the vnode state the compression is set to snappy, but in the config it is defined to lz4.

 {<<"leveldb_mult">>,riak_kv_eleveldb_backend,
                               {state,#Ref<0.1786542404.3883008000.67404>,
                                      "/var/lib/riak/leveldb/451010270895083322414418366346040463096947933184",
                                      [{block_cache_threshold,33554432},
                                       {block_restart_interval,16},
                                       {block_size_steps,16},
                                       {cache_object_warming,true},
                                       {compression,snappy},
                                       {create_if_missing,true},
                                       {delete_threshold,1000},
                                       {eleveldb_threads,71},
                                       {expiry_enabled,false},
                                       {expiry_minutes,0},
                                       {fadvise_willneed,false},
                                       {limited_developer_mem,false},
                                       {sst_block_size,4096},
                                       {tiered_slow_level,0},
                                       {total_leveldb_mem_percent,35},
                                       {use_bloomfilter,true},
                                       {whole_file_expiry,true},
                                       {write_buffer_size,22791803}],
                                      [{block_cache_threshold,33554432},
                                       {block_restart_interval,16},
                                       {block_size_steps,16},
                                       {cache_object_warming,true},
                                       {compression,snappy},
                                       {create_if_missing,true},
                                       {data_root,"/var/lib/riak/leveldb"},
                                       {delete_threshold,1000},
                                       {eleveldb_threads,71},
                                       {expiry_enabled,false},
                                       {expiry_minutes,0},
                                       {fadvise_willneed,false},
                                       {limited_developer_mem,false},
                                       {sst_block_size,4096},
                                       {sync,false},
                                       {tiered_slow_level,0},
                                       {total_leveldb_mem_percent,35},
                                       {use_bloomfilter,true},
                                       {verify_checksums,true},
                                       {verify_compaction,true},
                                       {whole_file_expiry,true},
                                       {write_buffer_size,22791803},
                                       {write_buffer_size_max,31457280},
                                       {write_buffer_size_min,15728640}],
                                      [{verify_checksums,true}],
                                      [{sync,false}],
                                      [{verify_checksums,true},{fill_cache,false}],
                                      true,false}},

It's not fully confirmed, but t seems that in 3.0.9 doesn't have the issue.
Are there any incompatibility between snappy versions?

@systream
Copy link
Author

I forgot to mention that we did "in place" upgrade. Data files kept, but riak upgraded to 3.0.12.
It is a test environment, but we plan to upgrade in prod too.

@martinsumner
Copy link
Contributor

martinsumner commented Apr 12, 2023

So there was an update done to snappy as part of 3.0.12 - https://github.com/basho/eleveldb/releases/tag/riak_kv-3.0.12. The upgrade was from 1.0.4 to 1.1.9 - 6ef9202.

There should be LOG files within the leveldb backend folder for the problem partition (i.e. "/var/lib/riak/leveldb/11417981541647679048466287755595961091061972992"), which are normally quite verbose - so it would be worth looking through those logs at the time of the issue.

@systream
Copy link
Author

Thx, it was a good hint.
Previously id did a repair, beaus when vnodes stuck impossible to stop riak gracefully.

It is definitely something with the compression.

I'll keep digging.

2023/04/13-08:47:52.157605 7f7d3a7a7700 Level-0 table #25200: started
2023/04/13-08:47:52.157609 7f7d2ef90700 running 2...
2023/04/13-08:47:52.157735 7f7d2ef90700 waiting 2...
2023/04/13-08:47:52.359207 7f7d3a7a7700 Level-0 table #25200: 9598847 bytes, 17832 keys Corruption: corrupted compressed block contents
2023/04/13-08:47:52.359259 7f7d3a7a7700 Waiting after background imm compaction error: Corruption: corrupted compressed block contents
2023/04/13-08:47:52.359275 7f7d2ef90700 running 2...
2023/04/13-08:47:52.359325 7f7d2ef90700 waiting 2..

@martinsumner
Copy link
Contributor

I went back to look at the basic upgrade test https://github.com/basho/riak_test/blob/develop-3.0/tests/verify_basic_upgrade.erl - going from 3.0.9 to 3.0.12. This passed with the eleveldb backend, but that is because the default compression algorithm is lz4. When I switched the test to force snappy compression:

2023-04-14 11:05:17 =ERROR REPORT====
** State machine <0.2253.0> terminating 
** Last event in was timeout
** When State == started
**      Data  == {state,730750818665451459101842416358141509827966271488,riak_kv_vnode,undefined,undefined,none,undefined,undefined,undefined,undefined,undefined,0}
** Reason for termination ==
** {function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{error,"Corruption: corrupted compressed block contents"}}},undefined] ....

So it does look like there is a fairly clear issue of incompatibility, that we could have picked up in test.

I haven't been able to dig into snappy history in any meaningful way to see if there is an interim version that can safely bridge between.

I'll have a think about workarounds that may help. One way would be to do the upgrade via rolling transfer rather than rolling restart (so you data transfer into the upgraded nodes - rolling one node in/out at a time), but this is a lot more time consuming (and it may in that case be worth waiting for 3.0.16 to get the final improvements to transfers).

@martinsumner
Copy link
Contributor

Also with regards to the configuration of lz4. Looking at the cuttlefish logic for compression in the eleveldb schema there is a translation operation:

https://github.com/basho/eleveldb/blob/develop/priv/eleveldb.schema#L174-L183

However, this translation operation isn't present in the eleveldb_multi.schema:

https://github.com/basho/eleveldb/blob/develop/priv/eleveldb_multi.schema#L115-L129

This might mean that the compression_algorithm is ignored in multi-backends, as it doesn't get translated into the compression setting - and hence it gets defaulted to snappy.

@systream
Copy link
Author

Thanks.
In our case, the rolling transfer can take weeks. :/
Is there a tweak where I can simply swap 2 nodes?
I mean +1 node, -1 node, so I would expect all partitions from nodeA to move to nodeB.

@martinsumner
Copy link
Contributor

Using https://docs.riak.com/riak/kv/latest/using/admin/commands/index.html#replace - is what I meant by rolling transfer. This allows you to setup a new node with the new code, and then using cluster replace to replace an existing node, then update that node, and replace another node etc.

@systream
Copy link
Author

Thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants