Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

Open
bashoshaun opened this issue Dec 12, 2016 · 1 comment

Comments

@bashoshaun
Copy link

A customer occasionally runs into a problem where their node will not start back up after it's been shutdown to perform a backup.

On startup, they see this kind of output in console.log:

2016-12-11 01:14:47.674 [error] <0.1611.0>@riak_kv_vnode:init:512 Failed to start riak_kv_eleveldb_backend backend for index 1381575766539369164864420818427111292018498732032 error: {db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}
2016-12-11 01:14:47.678 [notice] <0.1611.0>@riak:stop:43 "backend module failed to start."
2016-12-11 01:14:47.678 [error] <0.1611.0> gen_fsm <0.1611.0> in state started terminated with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184
2016-12-11 01:14:47.678 [error] <0.1611.0> CRASH REPORT Process <0.1611.0> with 1 neighbours exited with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in gen_fsm:terminate/7 line 600
2016-12-11 01:14:47.679 [error] <0.163.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.1611.0> exit with reason no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in context child_terminated
2016-12-11 01:14:47.679 [error] <0.160.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.1585.0> exit with reason {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1184}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,908}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_fsm,sync_send_event,[<0.1611.0>,wait_for_init,...]}} in context child_terminated
2016-12-11 01:14:47.683 [info] <0.500.0>@riak_kv_app:prep_stop:232 Stopping application riak_kv - marked service down.

I've attached the full console.log file for context.

The customer has also implemented some basic debugging in their startup script as it was initially thought a process may have been holding a lock on these
console.log.txt

files. This is an example of what they looked like when the node was up and running normally:

lsof:

...
beam.smp  177432           riak  mem       REG              253,8      32526336 1744834583 /data/riak/leveldb/1267395951122892374379757940871151681107879002112/000369.log
beam.smp  177432           riak  mem       REG              253,8      27365376 1879049262 /data/riak/leveldb/68507889249886074290797726533575766546371837952/000366.log
beam.smp  177432           riak  mem       REG              253,8      27631616  805307404 /data/riak/leveldb/125597796958124469533129165311555572001681702912/000367.log
beam.smp  177432           riak  mem       REG              253,8      32116736 1476398089 /data/riak/leveldb/1438665674247607560106752257205091097473808596992/000382.log
beam.smp  177432           riak  mem       REG              253,8      32694272 1476398103 /data/riak/leveldb/1039036320289938793410432185759232459286639542272/000382.log
...

After the stop, there are no leveldb entries in the output of lsof. Similarly, there are no beam processes running.

So when Riak restarts, there's no obvious reason why there should be a failure. When the customer then manually attempts a restart after realising the node is down, it comes up fine.

Customer is running RHEL7.

@Basho-JIRA Basho-JIRA changed the title Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] Dec 12, 2016
@bashoshaun
Copy link
Author

The customer would like to add the following:

Perhaps it could also mention that while no process has the LOCK files open after riak stop, they do persist in the file system, and also that the file system is xfs on thinly provisioned lvm logical volume, which is used to create a snapshot during the node stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants