Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dead lock in nfs server #56

Closed
kofemann opened this issue May 25, 2018 · 0 comments
Closed

dead lock in nfs server #56

kofemann opened this issue May 25, 2018 · 0 comments
Labels

Comments

@kofemann
Copy link
Member

kofemann commented May 25, 2018

Today we have found a dead lock in our nfs4j-0.15.3 based deployment:

Found one Java-level deadlock:
=============================
"OncRpcSvc Worker(31)":
  waiting for ownable synchronizer 0x00000005c0ee2278, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "OncRpcSvc Worker(13)"
"OncRpcSvc Worker(13)":
  waiting to lock monitor 0x00007fcc8404dd98 (object 0x00000005c276d0e0, a org.dcache.nfs.v4.NFS4Client),
  which is held by "OncRpcSvc Worker(31)"

Java stack information for the threads listed above:
===================================================
"OncRpcSvc Worker(31)":
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000005c0ee2278> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at org.dcache.nfs.v4.FileTracker.removeOpen(FileTracker.java:200)
	at org.dcache.nfs.v4.FileTracker.lambda$addOpen$3(FileTracker.java:141)
	at org.dcache.nfs.v4.FileTracker$$Lambda$152/1462883798.notifyDisposed(Unknown Source)
	at org.dcache.nfs.v4.NFS4State.disposeIgnoreFailures(NFS4State.java:119)
	- locked <0x00000005c27ca228> (a org.dcache.nfs.v4.NFS4State)
	at org.dcache.nfs.v4.NFS4Client.drainStates(NFS4Client.java:485)
	at org.dcache.nfs.v4.NFS4Client.updateLeaseTime(NFS4Client.java:277)
	- locked <0x00000005c276d0e0> (a org.dcache.nfs.v4.NFS4Client)
	at org.dcache.nfs.v4.OperationSEQUENCE.process(OperationSEQUENCE.java:60)
	at org.dcache.chimera.nfsv41.door.proxy.ProxyIoMdsOpFactory$1.lambda$process$0(ProxyIoMdsOpFactory.java:53)
	at org.dcache.chimera.nfsv41.door.proxy.ProxyIoMdsOpFactory$1$$Lambda$127/2020621766.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:360)
	at org.dcache.chimera.nfsv41.door.proxy.ProxyIoMdsOpFactory$1.process(ProxyIoMdsOpFactory.java:47)
	at org.dcache.nfs.v4.NFSServerV41.NFSPROC4_COMPOUND_4(NFSServerV41.java:173)
	at org.dcache.nfs.v4.xdr.nfs4_prot_NFS4_PROGRAM_ServerStub.dispatchOncRpcCall(nfs4_prot_NFS4_PROGRAM_ServerStub.java:48)
	at org.dcache.xdr.RpcDispatcher$1.run(RpcDispatcher.java:110)
	at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591)
	at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571)
	at java.lang.Thread.run(Thread.java:748)
"OncRpcSvc Worker(13)":
	at org.dcache.nfs.v4.NFS4Client.isLeaseValid(NFS4Client.java:263)
	- waiting to lock <0x00000005c276d0e0> (a org.dcache.nfs.v4.NFS4Client)
	at org.dcache.nfs.v4.FileTracker.lambda$addOpen$1(FileTracker.java:119)
	at org.dcache.nfs.v4.FileTracker$$Lambda$150/1052688442.test(Unknown Source)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
	at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1351)
	at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
	at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:230)
	at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:196)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.anyMatch(ReferencePipeline.java:449)
	at org.dcache.nfs.v4.FileTracker.addOpen(FileTracker.java:120)
	at org.dcache.nfs.v4.OperationOPEN.process(OperationOPEN.java:262)
	at org.dcache.chimera.nfsv41.door.AccessLogAwareOperationFactory$OpOpen.process(AccessLogAwareOperationFactory.java:251)
	at org.dcache.chimera.nfsv41.door.proxy.ProxyIoMdsOpFactory$1.lambda$process$0(ProxyIoMdsOpFactory.java:53)
	at org.dcache.chimera.nfsv41.door.proxy.ProxyIoMdsOpFactory$1$$Lambda$127/2020621766.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:360)
	at org.dcache.chimera.nfsv41.door.proxy.ProxyIoMdsOpFactory$1.process(ProxyIoMdsOpFactory.java:47)
	at org.dcache.nfs.v4.NFSServerV41.NFSPROC4_COMPOUND_4(NFSServerV41.java:173)
	at org.dcache.nfs.v4.xdr.nfs4_prot_NFS4_PROGRAM_ServerStub.dispatchOncRpcCall(nfs4_prot_NFS4_PROGRAM_ServerStub.java:48)
	at org.dcache.xdr.RpcDispatcher$1.run(RpcDispatcher.java:110)
	at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:591)
	at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:571)
	at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
@kofemann kofemann added the bug label May 25, 2018
dkocher pushed a commit to iterate-ch/nfs4j that referenced this issue Jun 20, 2018
Motivation:
migrate to the next major version of oncrpc4j-3.0.x. The highlights:

  - java9 ready
  - package name changes
  - update of external dependencies

Full changelog for oncrpc4j-2.7.0..oncrpc4j-3.0.1
    * [5f174e5] [maven-release-plugin] prepare for next development iteration
    * [09f1858] fixed stack overflow for recursive constant definitions
    * [d14a7df] Added input for service name while creating OncRpcClients. This makes it easy to identify threads created on behalf of a client.
    * [48a98be] src: make code base JDK9 ready
    * [64213d2] svc: explicitly specify which address to bind during tests
    * [2ffe42a] svc: use java8 stream to filer local end-point address
    * [d91499e] svc: add OncRpcSvc#toString()
    * [ca0539a] pom: update external libs and maven plugins
    * [b2d8247] src: more java8 clenups
    * [3e2dd59] Test case for issue dCache#56 dCache/oncrpc4j#56
    * [39f51cc] Bad maven module for issue test file
    * [6915c33] Add the missing owner parameter to mapping and rpcb operations
    * [b7b1691] Corrct an NPE when dumping an empty rpcbind registry
    * [b1ba74f] Add version properties for plugin/dependencies
    * [e452d65] Describe maven-jar-plugin in top level pom.xml
    * [f17cbd2] Remove unnecessary null check in generated code
    * [b185da2] pom: fix typo in plugin version property
    * [7aaca8e] utils: drop Bytes#to/fromHexString methods
    * [e0282c6] xdr: rename org.dcache.utils.Opaque into org.dcache.xdr.XdrOpaque
    * [68aa383] pom: bump project major number
    * [aac35af] src: split org.dcache.xdr into org.dcache.oncrpc4j.{xdr,net,rpc,util}
    * [766e70b] docs: update readme to describe new changes
    * [691ec88] pom: remove outdated java.net maven repos
    * [2680903] pom: update guava version to 24
    * [3c7f0bb] xdr: drop XdrBuffer
    * [7cb1fa0] xdr: introduce Xdr#getBytes method
    * [182521d] xdr: implement AutoCloseable interface
    * [fc41b2b] pom: add stable automatic module name into jar
    * [c30b87b] gss: use try-with-resource when Xdr is used
    * [b7eb9b6] rpc: rename GrizzlyXdrTransport to GrizzlyRpcTransport
    * [80c1b52] src: update copyright years
    * [b848ea0] libs: update to grizzly-2.4.3
    * [0ea08fe] [maven-release-plugin] prepare branch 3.0
    * [50f9543] [maven-release-plugin] prepare release oncrpc4j-3.0.0
    * [49bbec1] [maven-release-plugin] prepare for next development iteration
    * [24c87f2] xdr: do not flip byte buffer in Xdr#xdrEncodeByteBuffer
    * [085756e] [maven-release-plugin] prepare release oncrpc4j-3.0.1

Modification:
update pom file. Adjust to new package names:

org.dcache.xdr =>  org.dcache.oncrpc4j.rpc and org.dcache.oncrpc4j.xdr

Result:
up-to-date oncrpc4j

Acked-by: Paul Millar
Target: master
kofemann added a commit to kofemann/nfs4j that referenced this issue Jul 18, 2018
Motivation:
In situation where one thread tries to clean an expired client and an other
thread checks client validity a dead lock may happen:

  T1 -> take lock on open states in FileTracker#addOpen
  T2 -> take lock on client object in NFS4Client#updateLeaseTime
  T2 -> waits for lock on open states in FileTracker#removeOpen
  T1 -> wait for lock on client objectin NFS4Client#isLeaseValid

Modification:
use volatile filed to make NFS4Client#isLeaseValid non blocking. Remove draining
of states out of error path in NFS4Client#updateLeaseTime as draining done
in dead-client cleanup phase. Add test to ensure that state disposal is
called on client disposal.

Result:
deadlock is resolved.

Fixes: dCache#56
Acked-by: Albert Rossi
Target: master, 0.17, 0.16, 0.15
(cherry picked from commit d40d475)
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
kofemann added a commit to kofemann/nfs4j that referenced this issue Jul 18, 2018
Motivation:
In situation where one thread tries to clean an expired client and an other
thread checks client validity a dead lock may happen:

  T1 -> take lock on open states in FileTracker#addOpen
  T2 -> take lock on client object in NFS4Client#updateLeaseTime
  T2 -> waits for lock on open states in FileTracker#removeOpen
  T1 -> wait for lock on client objectin NFS4Client#isLeaseValid

Modification:
use volatile filed to make NFS4Client#isLeaseValid non blocking. Remove draining
of states out of error path in NFS4Client#updateLeaseTime as draining done
in dead-client cleanup phase. Add test to ensure that state disposal is
called on client disposal.

Result:
deadlock is resolved.

Fixes: dCache#56
Acked-by: Albert Rossi
Target: master, 0.17, 0.16, 0.15
(cherry picked from commit d40d475)
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
kofemann added a commit to kofemann/nfs4j that referenced this issue Jul 18, 2018
Motivation:
In situation where one thread tries to clean an expired client and an other
thread checks client validity a dead lock may happen:

  T1 -> take lock on open states in FileTracker#addOpen
  T2 -> take lock on client object in NFS4Client#updateLeaseTime
  T2 -> waits for lock on open states in FileTracker#removeOpen
  T1 -> wait for lock on client objectin NFS4Client#isLeaseValid

Modification:
use volatile filed to make NFS4Client#isLeaseValid non blocking. Remove draining
of states out of error path in NFS4Client#updateLeaseTime as draining done
in dead-client cleanup phase. Add test to ensure that state disposal is
called on client disposal.

Result:
deadlock is resolved.

Fixes: dCache#56
Acked-by: Albert Rossi
Target: master, 0.17, 0.16, 0.15
(cherry picked from commit d40d475)
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant