-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forcing interruption of pipeline makes staged files not to be re-downloaded #1552
Comments
I think a fix for this problem could be done copy the file to a temp file name, when the download is complete, rename it to the expected file name. This should prevent the problem due to interrupted downloads
|
Yes, I think so. Related with staged files, I also found another problem that is somehow related. I have updated a gist containing a csv file online. However since the file is staged, when the pipeline is resumed it continues using the old staged file. |
Proposed patch [6ddf0d8] fails when using S3 files.
|
@wleepang are you aware if it's possible to rename a S3 file without having to copy it? |
https://docs.aws.amazon.com/cli/latest/reference/s3/mv.html but does not appear to be atomic (i.e. copies behind the scenes) |
Indeed, there's no move operation in the API https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/examples-s3-objects.html#copy-object |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@pditommaso: Is there any working solution or workaround for this? Unfortunately, if the staging processes are interrupted, downstream processes fail for us due to truncated files. Also, is there a way to stage S3 data during the process step as opposed to preemptively via staging? Using the Slurm executor, I would be able to gain a huge increase in cumulative bandwidth by parallelizing the staging step across many machines. |
Any suggestion on how to improve it? |
Hi @pditommaso: My colleague and I thought of a few solutions. I have very limited experience with Groovy and the code base but I am happy to dig deeper and learn. Sidecar file denoting completionFor each S3 file, could we create an additional sidecar file that marks a transferred file's completion? When a transfer is completed to S3 as part of the staging process, Nextflow could write a file alongside it, e.g. XXXXXXfile.bam and XXXXXXfile.bam.staged to indicate that the transfer completed successfully. I believe this would be compatible with both POSIX-compliant storage and S3. The PR above seemed to fail because move ops are not atomic in S3 and thus it was a copy and delete operation. Source and destination file size comparisonSimilar to the rsync algorithm, could we compare source and destination logical file sizes to confirm that the files are the same logical size? This way, if a transfer was interrupted, Nextflow would realize that during the subsequent process that the file size is out of wack, and delete the staged file, and re-transfer it. (Expensive) Checksum comparisonsI am not a huge fan of this method since this can be quite expensive for large files, but another approach here could be comparing the source and destination file checksums. Perhaps this could be an optional approach for workflows that are mission critical and sensitive to file corruption, etc. |
I think the source and dest file comparison should work and it'd add a little overhead. Nice idea. |
I've implemented the file size comparison. This is just enough to solve the problem. For reference it could also possible to implement another strategy, recomputing the local staging path which is determined using source file metadata (ie. file path, last modified and timestamp), but replacing the source file size with the local file size, this should result in the same target file path. I leave this as a future enhancement. Solved by 847789f. |
Bug report
Expected behavior and actual behavior
The interruption (keyboard interruption
CTRL+C
) of a pipeline while a file is being staged results in a truncated file that is placed inNXF_WORK/stage/hashToFile/file.foo
. When the pipeline is run again, since the file is already present in the stage path, nextflow does not try to stage it again and thus, any step that needs this file a posteriori will either use a truncated file or fail, in the case that the truncated file makes a process to exit.Steps to reproduce the problem
Run this script
main.nf
:nextflow run main.nf
Manual interruption of the run using
CTRL+C
keyboard combination when the messageStaging foreign file
appears on the terminal:CTRL+C
Run
main.nf
again:nextflow run main.nf
Program output
Environment
Additional context
Any
The text was updated successfully, but these errors were encountered: