Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqid mismatch during CLOSE #96

Closed
sraillard opened this issue Nov 23, 2020 · 5 comments
Closed

seqid mismatch during CLOSE #96

sraillard opened this issue Nov 23, 2020 · 5 comments

Comments

@sraillard
Copy link

When multiple OPEN and CLOSE operations are interleaved, it's looking like the wrong seqid is used for the test:

Stateids.checkStateId(nfsState.stateid(), stateid);

Before this test, we added a log to check the different values:
_log.info("[contextCurrentStateId = " + context.currentStateid() + " , _args.opclose.open_stateid = " + _args.opclose.open_stateid + " , nfsState.stateid() = " + nfsState.stateid() + " , stateid = " + stateid + "]");

The seqid returned by nfsState.stateid() seems to be the one from the last OPEN operation, so the test is failing.

Here is the capture and the log :

2020-11-23 22:32:53.007 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (6)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007a, seq: 1] , nfsState.stateid() = [5fbc217e000000020000007a, seq: 1] , stateid = [5fbc217e000000020000007a, seq: 1]]
2020-11-23 22:32:53.010 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (3)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 2] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 3] , stateid = [5fbc217e000000020000007b, seq: 2]]
2020-11-23 22:32:53.011 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (10)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 5] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 7] , stateid = [5fbc217e000000020000007b, seq: 5]]
2020-11-23 22:32:53.012 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (12)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 7] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 7] , stateid = [5fbc217e000000020000007b, seq: 7]]
2020-11-23 22:32:53.012 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (14)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 7] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 8] , stateid = [5fbc217e000000020000007b, seq: 7]]
2020-11-23 22:32:53.013 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (22)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 8] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 8] , stateid = [5fbc217e000000020000007b, seq: 8]]
2020-11-23 22:32:53.013 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (13)] INFO  o.d.n.v.OperationCLOSE - [contextCurrentStateId = [000000000000000000000000, seq: 0] , _args.opclose.open_stateid = [5fbc217e000000020000007b, seq: 8] , nfsState.stateid() = [5fbc217e000000020000007b, seq: 8] , stateid = [5fbc217e000000020000007b, seq: 8]]
2020-11-23 22:32:53.013 [minikube.mshome.net/172.24.45.130:735] [nfs4j@2049 (13)] INFO  o.d.n.v.AbstractOperationExecutor - Lost client state: op: CLOSE : NFS4ERR_BAD_STATEID : State not known to the client: [5fbc217e000000020000007b, seq: 8]

bug-nfs-multiple-open-close.zip

In the log, you'll notice three times where the seqid isn't the one expected. And in the capture, we can see that it's the seqid of the previous OPEN. The NFS client is a Linux VM (minikube).

@kofemann
Copy link
Member

Thanks for the info. I will try to reproduce it.

@kofemann
Copy link
Member

kofemann commented Dec 8, 2020

I'm not able to reproduce this. To you run any special test, or do yo see errors on the application side?

@sraillard
Copy link
Author

As we have experienced a lot of issues, I was able to write a small piece of code in PHP to reproduce it. It's quite straightforward: just do some reads or write from different process on the same file at the same time and you can reproduce the issue.

The code is forking 4 process, and each process will try to write 4 times a line in the same log file.
So, that is a total of 16 write operations with concurrent access to the same file.

Here is the test code (can do a write or a read test):

<?php

	echo ("Start of PARENT\n");

	// Reading or writing test
	$WriteTest=TRUE;

	// Test file management
	$Filename="test-file.txt";
	if ($WriteTest) {
		// Write test, remove the file before testing
		if (file_exists($Filename)) unlink($Filename);
	} else {
		// Read test, create a test file
		file_put_contents($Filename,"Reading test");
	}

	// Loop for forking child processes
	$pid=-1;
	for ($i=0; $i<4; $i++) {
		$pid=pcntl_fork();
		if ($pid == -1) {
			die("ERROR fork #".($i+1)." !\n");
		} else if ($pid==0) {
			// PARENT, do nothing
			echo("Still in PARENT\n");
		} else {
			// CHILD, go to child operation
			goto child;
		}
	}
	die("End of PARENT\n");

	// Child operations
child:
	echo("Start of CHILD #".($i+1)." PID=".$pid."\n");

	$ts1=microtime(TRUE);
	for ($j=0; $j<4; $j++) {
		$ts2=microtime(TRUE);
		if ($WriteTest) {
			// Concurrent write test
			file_put_contents($Filename,"LOG - ".date("H:i:s.v")." - ".intval(($ts2-$ts1)*1000)." ms - CHILD #".($i+1).".".($j+1)." PID=".$pid."\n",FILE_APPEND );
		} else {
			// Concurrent read test
			$contents=file_get_contents($Filename);
		}
		$ts1=$ts2;
	}

	die("End of CHILD #".($i+1)." PID=".$pid."\n");

And the results:

time php fork.php
Start of PARENT
Start of CHILD #1 PID=77
Still in PARENT
Start of CHILD #2 PID=78
Still in PARENT
Start of CHILD #3 PID=79
Still in PARENT
Still in PARENT
End of PARENT
Start of CHILD #4 PID=80
End of CHILD #4 PID=80
End of CHILD #3 PID=79
End of CHILD #1 PID=77
End of CHILD #2 PID=78

real    0m16.801s
user    0m0.008s
sys     0m0.007s

Of course, the total time of 16 seconds is showing a blocking issue here!

And the content of the log file written:

LOG - 10:36:04.000 - 0 ms - CHILD #1.1 PID=77
LOG - 10:36:04.000 - 0 ms - CHILD #3.1 PID=79
LOG - 10:36:04.000 - 0 ms - CHILD #4.1 PID=80
LOG - 10:36:04.000 - 0 ms - CHILD #2.1 PID=78
LOG - 10:36:09.000 - 5590 ms - CHILD #4.2 PID=80
LOG - 10:36:09.000 - 5591 ms - CHILD #2.2 PID=78
LOG - 10:36:09.000 - 5441 ms - CHILD #3.2 PID=79
LOG - 10:36:09.000 - 5416 ms - CHILD #1.2 PID=77
LOG - 10:36:15.000 - 5551 ms - CHILD #4.3 PID=80
LOG - 10:36:15.000 - 5875 ms - CHILD #3.3 PID=79
LOG - 10:36:15.000 - 5637 ms - CHILD #2.3 PID=78
LOG - 10:36:15.000 - 323 ms - CHILD #4.4 PID=80
LOG - 10:36:15.000 - 149 ms - CHILD #3.4 PID=79
LOG - 10:36:15.000 - 6050 ms - CHILD #1.3 PID=77
LOG - 10:36:21.000 - 5312 ms - CHILD #1.4 PID=77
LOG - 10:36:20.000 - 5451 ms - CHILD #2.4 PID=78

You can see the blocking time of around 5 seconds on some operations.

@dkocher
Copy link
Contributor

dkocher commented Mar 28, 2023

Is this possibly related to #120?

@kofemann
Copy link
Member

@dkocher Sounds like it. The reproducer on my side was

fio --name 'read-test' --description 'Simple NFS read test' \
        --rw=read  --numjobs 8 --thread --blocksize 1M --filename 10GB

kofemann added a commit to kofemann/nfs4j that referenced this issue Mar 29, 2023
Motivation:
As parallel threads modifies open-stateid sequence number the
FileTracker should always return a copy to avoid that sequence
is modified before reply is sent.

Modification:
Update FileTracker#addOpen and FileTracker#downgradeOpen to return a
copy of stateid.

Result:
Spec compliant behaviour in concurrent environment.

Fixes: dCache#96
Acked-by: Albert Rossi
Target: master, 0.24, 0.23
(cherry picked from commit 73d73af)
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
kofemann added a commit to kofemann/nfs4j that referenced this issue Mar 29, 2023
Motivation:
As parallel threads modifies open-stateid sequence number the
FileTracker should always return a copy to avoid that sequence
is modified before reply is sent.

Modification:
Update FileTracker#addOpen and FileTracker#downgradeOpen to return a
copy of stateid.

Result:
Spec compliant behaviour in concurrent environment.

Fixes: dCache#96
Acked-by: Albert Rossi
Target: master, 0.24, 0.23
(cherry picked from commit 73d73af)
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
@kofemann kofemann added bug and removed info needed labels Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants