seqid mismatch during CLOSE #96

sraillard · 2020-11-23T22:24:25Z

When multiple OPEN and CLOSE operations are interleaved, it's looking like the wrong seqid is used for the test:

nfs4j/core/src/main/java/org/dcache/nfs/v4/OperationCLOSE.java

Line 58 in b4af4c3

Stateids.checkStateId(nfsState.stateid(), stateid);

Before this test, we added a log to check the different values:
_log.info("[contextCurrentStateId = " + context.currentStateid() + " , _args.opclose.open_stateid = " + _args.opclose.open_stateid + " , nfsState.stateid() = " + nfsState.stateid() + " , stateid = " + stateid + "]");

The seqid returned by nfsState.stateid() seems to be the one from the last OPEN operation, so the test is failing.

Here is the capture and the log :

2020-11-23 22:32:53.007 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (6)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007a, seq: 1] , nfsState.stateid() = [5fbc217e000000020000007a, seq: 1] , stateid = [5fbc217e000000020000007a, seq: 1]]
2020-11-23 22:32:53.010 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (3)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 2] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 3] , stateid = [5fbc217e000000020000007b, seq: 2]]
2020-11-23 22:32:53.011 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (10)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 5] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 7] , stateid = [5fbc217e000000020000007b, seq: 5]]
2020-11-23 22:32:53.012 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (12)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 7] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 7] , stateid = [5fbc217e000000020000007b, seq: 7]]
2020-11-23 22:32:53.012 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (14)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 7] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 8] , stateid = [5fbc217e000000020000007b, seq: 7]]
2020-11-23 22:32:53.013 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (22)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 8] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 8] , stateid = [5fbc217e000000020000007b, seq: 8]]
2020-11-23 22:32:53.013 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (13)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 8] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 8] , stateid = [5fbc217e000000020000007b, seq: 8]]
2020-11-23 22:32:53.013 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (13)] INFO  o.d.n.v.AbstractOperationExecutor - Lost client state: op: CLOSE : NFS4ERR_BAD_STATEID : State not known to the client: [5fbc217e000000020000007b, seq: 8]

bug-nfs-multiple-open-close.zip

In the log, you'll notice three times where the seqid isn't the one expected. And in the capture, we can see that it's the seqid of the previous OPEN. The NFS client is a Linux VM (minikube).

The text was updated successfully, but these errors were encountered:

kofemann · 2020-11-24T13:20:02Z

Thanks for the info. I will try to reproduce it.

kofemann · 2020-12-08T09:13:11Z

I'm not able to reproduce this. To you run any special test, or do yo see errors on the application side?

sraillard · 2020-12-08T09:43:41Z

As we have experienced a lot of issues, I was able to write a small piece of code in PHP to reproduce it. It's quite straightforward: just do some reads or write from different process on the same file at the same time and you can reproduce the issue.

The code is forking 4 process, and each process will try to write 4 times a line in the same log file.
So, that is a total of 16 write operations with concurrent access to the same file.

Here is the test code (can do a write or a read test):

<?php

	echo ("Start of PARENT\n");

	// Reading or writing test
	$WriteTest=TRUE;

	// Test file management
	$Filename="test-file.txt";
	if ($WriteTest) {
		// Write test, remove the file before testing
		if (file_exists($Filename)) unlink($Filename);
	} else {
		// Read test, create a test file
		file_put_contents($Filename,"Reading test");
	}

	// Loop for forking child processes
	$pid=-1;
	for ($i=0; $i<4; $i++) {
		$pid=pcntl_fork();
		if ($pid == -1) {
			die("ERROR fork #".($i+1)." !\n");
		} else if ($pid==0) {
			// PARENT, do nothing
			echo("Still in PARENT\n");
		} else {
			// CHILD, go to child operation
			goto child;
		}
	}
	die("End of PARENT\n");

	// Child operations
child:
	echo("Start of CHILD #".($i+1)." PID=".$pid."\n");

	$ts1=microtime(TRUE);
	for ($j=0; $j<4; $j++) {
		$ts2=microtime(TRUE);
		if ($WriteTest) {
			// Concurrent write test
			file_put_contents($Filename,"LOG - ".date("H:i:s.v")." - ".intval(($ts2-$ts1)*1000)." ms - CHILD #".($i+1).".".($j+1)." PID=".$pid."\n",FILE_APPEND );
		} else {
			// Concurrent read test
			$contents=file_get_contents($Filename);
		}
		$ts1=$ts2;
	}

	die("End of CHILD #".($i+1)." PID=".$pid."\n");

And the results:

time php fork.php
Start of PARENT
Start of CHILD #1 PID=77
Still in PARENT
Start of CHILD #2 PID=78
Still in PARENT
Start of CHILD #3 PID=79
Still in PARENT
Still in PARENT
End of PARENT
Start of CHILD #4 PID=80
End of CHILD #4 PID=80
End of CHILD #3 PID=79
End of CHILD #1 PID=77
End of CHILD #2 PID=78

real    0m16.801s
user    0m0.008s
sys     0m0.007s

Of course, the total time of 16 seconds is showing a blocking issue here!

And the content of the log file written:

LOG - 10:36:04.000 - 0 ms - CHILD #1.1 PID=77
LOG - 10:36:04.000 - 0 ms - CHILD #3.1 PID=79
LOG - 10:36:04.000 - 0 ms - CHILD #4.1 PID=80
LOG - 10:36:04.000 - 0 ms - CHILD #2.1 PID=78
LOG - 10:36:09.000 - 5590 ms - CHILD #4.2 PID=80
LOG - 10:36:09.000 - 5591 ms - CHILD #2.2 PID=78
LOG - 10:36:09.000 - 5441 ms - CHILD #3.2 PID=79
LOG - 10:36:09.000 - 5416 ms - CHILD #1.2 PID=77
LOG - 10:36:15.000 - 5551 ms - CHILD #4.3 PID=80
LOG - 10:36:15.000 - 5875 ms - CHILD #3.3 PID=79
LOG - 10:36:15.000 - 5637 ms - CHILD #2.3 PID=78
LOG - 10:36:15.000 - 323 ms - CHILD #4.4 PID=80
LOG - 10:36:15.000 - 149 ms - CHILD #3.4 PID=79
LOG - 10:36:15.000 - 6050 ms - CHILD #1.3 PID=77
LOG - 10:36:21.000 - 5312 ms - CHILD #1.4 PID=77
LOG - 10:36:20.000 - 5451 ms - CHILD #2.4 PID=78

You can see the blocking time of around 5 seconds on some operations.

dkocher · 2023-03-28T13:00:34Z

Is this possibly related to #120?

kofemann · 2023-03-28T13:08:09Z

@dkocher Sounds like it. The reproducer on my side was

fio --name 'read-test' --description 'Simple NFS read test' \
        --rw=read  --numjobs 8 --thread --blocksize 1M --filename 10GB

Motivation: As parallel threads modifies open-stateid sequence number the FileTracker should always return a copy to avoid that sequence is modified before reply is sent. Modification: Update FileTracker#addOpen and FileTracker#downgradeOpen to return a copy of stateid. Result: Spec compliant behaviour in concurrent environment. Fixes: dCache#96 Acked-by: Albert Rossi Target: master, 0.24, 0.23 (cherry picked from commit 73d73af) Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>

kofemann added info needed rfc compiance labels Nov 24, 2020

kofemann closed this as completed in 73d73af Mar 29, 2023

kofemann added bug and removed info needed labels Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seqid mismatch during CLOSE #96

seqid mismatch during CLOSE #96

sraillard commented Nov 23, 2020

kofemann commented Nov 24, 2020

kofemann commented Dec 8, 2020

sraillard commented Dec 8, 2020

dkocher commented Mar 28, 2023

kofemann commented Mar 28, 2023

seqid mismatch during CLOSE #96

seqid mismatch during CLOSE #96

Comments

sraillard commented Nov 23, 2020

kofemann commented Nov 24, 2020

kofemann commented Dec 8, 2020

sraillard commented Dec 8, 2020

dkocher commented Mar 28, 2023

kofemann commented Mar 28, 2023