Forcing interruption of pipeline makes staged files not to be re-downloaded #1552

JoseEspinosa · 2020-04-03T16:17:55Z

Bug report

Expected behavior and actual behavior

The interruption (keyboard interruption CTRL+C) of a pipeline while a file is being staged results in a truncated file that is placed in NXF_WORK/stage/hashToFile/file.foo. When the pipeline is run again, since the file is already present in the stage path, nextflow does not try to stage it again and thus, any step that needs this file a posteriori will either use a truncated file or fail, in the case that the truncated file makes a process to exit.

Steps to reproduce the problem

$ cat main.nf

#!/usr/bin/env nextflow

c = Channel.fromSRA( 'SRR1013514' )

process test {

    input:
    set val(id), file(fastq) from c

    """
    gunzip -f *.fastq.gz
    """
}

Run this script main.nf:
nextflow run main.nf
Manual interruption of the run using CTRL+C keyboard combination when the message Staging foreign file appears on the terminal:
CTRL+C
Run main.nf again:
nextflow run main.nf

Program output

Apr-03 18:13:14.723 [main] DEBUG nextflow.cli.Launcher - $> nextflow run test_fromSra2.nf
Apr-03 18:13:14.905 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 20.01.0
Apr-03 18:13:14.913 [main] INFO  nextflow.cli.CmdRun - Launching `test_fromSra2.nf` [infallible_kare] - revision: 6090bfff4d
Apr-03 18:13:14.931 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /home/kadomu/git/mein_artic/nextflow.config
Apr-03 18:13:14.932 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/kadomu/git/mein_artic/nextflow.config
Apr-03 18:13:14.950 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Apr-03 18:13:16.266 [main] DEBUG nextflow.Session - Session uuid: 156a9a84-bf24-42f2-8782-6b843ae98a51
Apr-03 18:13:16.266 [main] DEBUG nextflow.Session - Run name: infallible_kare
Apr-03 18:13:16.267 [main] DEBUG nextflow.Session - Executor pool size: 8
Apr-03 18:13:16.274 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 20.01.0 build 5264
  Created: 12-02-2020 10:14 UTC (11:14 CEST)
  System: Linux 5.3.0-42-generic
  Runtime: Groovy 2.5.8 on OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
  Encoding: UTF-8 (UTF-8)
  Process: 16378@kadomu [127.0.1.1]
  CPUs: 8 - Mem: 15.3 GB (3.2 GB) - Swap: 2 GB (869.9 MB)
Apr-03 18:13:16.288 [main] DEBUG nextflow.Session - Work-dir: /home/kadomu/nxf_scratch [ext2/ext3]
Apr-03 18:13:16.310 [main] DEBUG nextflow.Session - Observer factory: TowerFactory
Apr-03 18:13:16.312 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Apr-03 18:13:16.456 [main] DEBUG nextflow.Session - Session start invoked
Apr-03 18:13:16.459 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /home/kadomu/git/mein_artic/results/pipeline_info/execution_trace.txt
Apr-03 18:13:16.559 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Apr-03 18:13:16.576 [main] DEBUG nextflow.Session - Workflow process names [dsl1]: test
Apr-03 18:13:16.598 [main] WARN  nextflow.Session - There's no process matching config selector: get_software_versions
Apr-03 18:13:16.637 [ForkJoinPool.commonPool-worker-1] WARN  nextflow.datasource.SraExplorer - Define the NCBI_API_KEY env variable to use NCBI search service -- Read more https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Apr-03 18:13:16.640 [ForkJoinPool.commonPool-worker-1] DEBUG nextflow.datasource.SraExplorer - SRA search url=https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=sra&usehistory=y&retmode=json&term=SRR1013514
Apr-03 18:13:16.709 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: null
Apr-03 18:13:16.709 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'local'
Apr-03 18:13:16.715 [main] DEBUG nextflow.executor.Executor - [warm up] executor > local
Apr-03 18:13:16.720 [main] DEBUG n.processor.LocalPollingMonitor - Creating local task monitor for executor 'local' > cpus=8; memory=15.3 GB; capacity=8; pollInterval=100ms; dumpInterval=5m
Apr-03 18:13:16.788 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > test -- maxForks: 8; blocking: false
Apr-03 18:13:16.813 [main] DEBUG nextflow.script.ScriptRunner - > Await termination 
Apr-03 18:13:16.813 [main] DEBUG nextflow.Session - Session await
Apr-03 18:13:17.670 [ForkJoinPool.commonPool-worker-1] DEBUG nextflow.datasource.SraExplorer - SRA data request url=https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=sra&retmode=json&query_key=1&WebEnv=NCID_1_71554959_130.14.22.33_9001_1585930397_973352404_0MetA0_S_MegaStore&retstart=0&retmax=1000
Apr-03 18:13:18.533 [Actor Thread 3] DEBUG n.util.BlockingThreadExecutorFactory - Thread pool name=FileTransfer; maxThreads=24; maxQueueSize=72; keepAlive=1m
Apr-03 18:13:18.539 [FileTransfer-thread-1] DEBUG nextflow.file.FilePorter - Local cache found for foreign file ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/004/SRR1013514/SRR1013514.fastq.gz at /home/kadomu/nxf_scratch/stage/8a/233e3d63939e9d5ce2093e6f58b719/SRR1013514.fastq.gz
Apr-03 18:13:18.625 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Apr-03 18:13:18.630 [Task submitter] INFO  nextflow.Session - [b6/0bc8bc] Submitted process > test (1)
Apr-03 18:13:18.717 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: test (1); status: COMPLETED; exit: 1; error: -; workDir: /home/kadomu/nxf_scratch/b6/0bc8bc47b3542ad04ea8f8e43860b9]
Apr-03 18:13:18.741 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'test (1)'

Caused by:
  Process `test (1)` terminated with an error exit status (1)

Command executed:

  gunzip -f *.fastq.gz

Command exit status:
  1

Command output:
  (empty)

Command error:
  
  gzip: SRR1013514.fastq.gz: unexpected end of file

Work dir:
  /home/kadomu/nxf_scratch/b6/0bc8bc47b3542ad04ea8f8e43860b9

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Apr-03 18:13:18.749 [main] DEBUG nextflow.Session - Session await > all process finished
Apr-03 18:13:18.764 [Task monitor] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Apr-03 18:13:18.774 [main] DEBUG nextflow.Session - Session await > all barriers passed
Apr-03 18:13:18.783 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=1; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=8ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=1; peakMemory=7 GB; ]
Apr-03 18:13:18.783 [main] DEBUG nextflow.trace.TraceFileObserver - Flow completing -- flushing trace file
Apr-03 18:13:18.786 [main] DEBUG nextflow.trace.ReportObserver - Flow completing -- rendering html report
Apr-03 18:13:18.958 [main] DEBUG nextflow.trace.ReportObserver - Execution report summary data:
  [{"cpuUsage":null,"process":"test","mem":null,"memUsage":null,"timeUsage":null,"vmem":null,"reads":null,"cpu":null,"time":{"mean":8,"min":8,"q1":8,"q2":8,"q3":8,"max":8,"minLabel":"test (1)","maxLabel":"test (1)","q1Label":"test (1)","q2Label":"test (1)","q3Label":"test (1)"},"writes":null}]
Apr-03 18:13:19.409 [main] DEBUG nextflow.trace.TimelineObserver - Flow completing -- rendering html timeline
Apr-03 18:13:19.423 [main] WARN  nextflow.dag.GraphvizRenderer - To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Apr-03 18:13:19.557 [main] DEBUG nextflow.CacheDB - Closing CacheDB done
Apr-03 18:13:19.560 [main] DEBUG nextflow.Session - AWS S3 uploader shutdown
Apr-03 18:13:19.598 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Environment

Nextflow version: [20.01.0 build 5264]
Java version: [1.8.0_242]
Operating system: [ubuntu 18.04]
Bash version: [GNU bash, version 4.4.20(1)]

Additional context

Any

The text was updated successfully, but these errors were encountered:

pditommaso · 2020-04-06T12:59:49Z

I think a fix for this problem could be done copy the file to a temp file name, when the download is complete, rename it to the expected file name. This should prevent the problem due to interrupted downloads

nextflow/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy

Line 287 in 5050868

return FileHelper.copyPath(source, target)

JoseEspinosa · 2020-04-06T20:38:38Z

Yes, I think so. Related with staged files, I also found another problem that is somehow related. I have updated a gist containing a csv file online. However since the file is staged, when the pipeline is resumed it continues using the old staged file.

pditommaso · 2020-08-21T09:17:01Z

Proposed patch [6ddf0d8] fails when using S3 files.

Aug-21 11:10:11.055 [Actor Thread 9] ERROR nextflow.processor.TaskProcessor - Error executing process > 'MULTIQC'

Caused by:
  java.lang.UnsupportedOperationException

java.lang.UnsupportedOperationException: null
	at com.upplication.s3fs.S3FileSystemProvider.move(S3FileSystemProvider.java:558)
	at nextflow.file.FileHelper.movePath(FileHelper.groovy:846)
	at nextflow.file.FilePorter$FileTransfer.stageForeignFile0(FilePorter.groovy:297)
	at nextflow.file.FilePorter$FileTransfer.stageForeignFile(FilePorter.groovy:258)
	at nextflow.file.FilePorter$FileTransfer.run(FilePorter.groovy:244)

pditommaso · 2020-08-21T09:27:06Z

@wleepang are you aware if it's possible to rename a S3 file without having to copy it?

maxbates · 2020-09-23T23:57:57Z

https://docs.aws.amazon.com/cli/latest/reference/s3/mv.html but does not appear to be atomic (i.e. copies behind the scenes)

pditommaso · 2020-09-24T06:37:55Z

Indeed, there's no move operation in the API https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/examples-s3-objects.html#copy-object

stale · 2021-03-31T22:21:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

colingarvey · 2021-08-23T21:47:53Z

@pditommaso: Is there any working solution or workaround for this? Unfortunately, if the staging processes are interrupted, downstream processes fail for us due to truncated files.

Also, is there a way to stage S3 data during the process step as opposed to preemptively via staging? Using the Slurm executor, I would be able to gain a huge increase in cumulative bandwidth by parallelizing the staging step across many machines.

pditommaso · 2021-08-24T08:40:33Z

Any suggestion on how to improve it?

colingarvey · 2021-08-24T17:14:31Z

Hi @pditommaso: My colleague and I thought of a few solutions. I have very limited experience with Groovy and the code base but I am happy to dig deeper and learn.

Sidecar file denoting completion

For each S3 file, could we create an additional sidecar file that marks a transferred file's completion? When a transfer is completed to S3 as part of the staging process, Nextflow could write a file alongside it, e.g. XXXXXXfile.bam and XXXXXXfile.bam.staged to indicate that the transfer completed successfully. I believe this would be compatible with both POSIX-compliant storage and S3. The PR above seemed to fail because move ops are not atomic in S3 and thus it was a copy and delete operation.

Source and destination file size comparison

Similar to the rsync algorithm, could we compare source and destination logical file sizes to confirm that the files are the same logical size? This way, if a transfer was interrupted, Nextflow would realize that during the subsequent process that the file size is out of wack, and delete the staged file, and re-transfer it.

(Expensive) Checksum comparisons

I am not a huge fan of this method since this can be quite expensive for large files, but another approach here could be comparing the source and destination file checksums. Perhaps this could be an optional approach for workflows that are mission critical and sensitive to file corruption, etc.

pditommaso · 2021-08-30T17:09:57Z

I think the source and dest file comparison should work and it'd add a little overhead. Nice idea.

pditommaso · 2021-12-09T19:03:00Z

I've implemented the file size comparison. This is just enough to solve the problem.

For reference it could also possible to implement another strategy, recomputing the local staging path which is determined using source file metadata (ie. file path, last modified and timestamp), but replacing the source file size with the local file size, this should result in the same target file path. I leave this as a future enhancement.

Solved by 847789f.

JoseEspinosa mentioned this issue Apr 3, 2020

Pull data directly fromSRA nf-core/viralrecon#6

Closed

drpatelh mentioned this issue Apr 17, 2020

fromSRA input ? nf-core/eager#382

Closed

jchorl mentioned this issue Aug 19, 2020

stage files to tmp files, then atomically move #1713

Merged

pditommaso closed this as completed in #1713 Aug 21, 2020

pditommaso added this to the v20.10.0 milestone Aug 21, 2020

pditommaso reopened this Aug 21, 2020

pditommaso modified the milestones: v20.10.0, v21.01.0 Nov 1, 2020

stale bot added the stale label Mar 31, 2021

pditommaso removed this from the v21.04.0 milestone Apr 25, 2021

stale bot removed the stale label Apr 25, 2021

pditommaso added the help wanted label Sep 6, 2021

pditommaso closed this as completed Dec 9, 2021

pditommaso added this to the 22.04.0 milestone Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forcing interruption of pipeline makes staged files not to be re-downloaded #1552

Forcing interruption of pipeline makes staged files not to be re-downloaded #1552

JoseEspinosa commented Apr 3, 2020

pditommaso commented Apr 6, 2020

JoseEspinosa commented Apr 6, 2020 •

edited

Loading

pditommaso commented Aug 21, 2020

pditommaso commented Aug 21, 2020

maxbates commented Sep 23, 2020

pditommaso commented Sep 24, 2020

stale bot commented Mar 31, 2021

colingarvey commented Aug 23, 2021

pditommaso commented Aug 24, 2021

colingarvey commented Aug 24, 2021 •

edited

Loading

pditommaso commented Aug 30, 2021

pditommaso commented Dec 9, 2021 •

edited

Loading

Forcing interruption of pipeline makes staged files not to be re-downloaded #1552

Forcing interruption of pipeline makes staged files not to be re-downloaded #1552

Comments

JoseEspinosa commented Apr 3, 2020

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

pditommaso commented Apr 6, 2020

JoseEspinosa commented Apr 6, 2020 • edited Loading

pditommaso commented Aug 21, 2020

pditommaso commented Aug 21, 2020

maxbates commented Sep 23, 2020

pditommaso commented Sep 24, 2020

stale bot commented Mar 31, 2021

colingarvey commented Aug 23, 2021

pditommaso commented Aug 24, 2021

colingarvey commented Aug 24, 2021 • edited Loading

Sidecar file denoting completion

Source and destination file size comparison

(Expensive) Checksum comparisons

pditommaso commented Aug 30, 2021

pditommaso commented Dec 9, 2021 • edited Loading

JoseEspinosa commented Apr 6, 2020 •

edited

Loading

colingarvey commented Aug 24, 2021 •

edited

Loading

pditommaso commented Dec 9, 2021 •

edited

Loading