Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forcing interruption of pipeline makes staged files not to be re-downloaded #1552

Closed
JoseEspinosa opened this issue Apr 3, 2020 · 12 comments · Fixed by #1713
Closed

Forcing interruption of pipeline makes staged files not to be re-downloaded #1552

JoseEspinosa opened this issue Apr 3, 2020 · 12 comments · Fixed by #1713
Milestone

Comments

@JoseEspinosa
Copy link
Contributor

Bug report

Expected behavior and actual behavior

The interruption (keyboard interruption CTRL+C) of a pipeline while a file is being staged results in a truncated file that is placed in NXF_WORK/stage/hashToFile/file.foo. When the pipeline is run again, since the file is already present in the stage path, nextflow does not try to stage it again and thus, any step that needs this file a posteriori will either use a truncated file or fail, in the case that the truncated file makes a process to exit.

Steps to reproduce the problem

$ cat main.nf

#!/usr/bin/env nextflow

c = Channel.fromSRA( 'SRR1013514' )

process test {

    input:
    set val(id), file(fastq) from c

    """
    gunzip -f *.fastq.gz
    """
}
  1. Run this script main.nf:
    nextflow run main.nf

  2. Manual interruption of the run using CTRL+C keyboard combination when the message Staging foreign file appears on the terminal:
    CTRL+C

  3. Run main.nf again:
    nextflow run main.nf

Program output

Apr-03 18:13:14.723 [main] DEBUG nextflow.cli.Launcher - $> nextflow run test_fromSra2.nf
Apr-03 18:13:14.905 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 20.01.0
Apr-03 18:13:14.913 [main] INFO  nextflow.cli.CmdRun - Launching `test_fromSra2.nf` [infallible_kare] - revision: 6090bfff4d
Apr-03 18:13:14.931 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /home/kadomu/git/mein_artic/nextflow.config
Apr-03 18:13:14.932 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/kadomu/git/mein_artic/nextflow.config
Apr-03 18:13:14.950 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Apr-03 18:13:16.266 [main] DEBUG nextflow.Session - Session uuid: 156a9a84-bf24-42f2-8782-6b843ae98a51
Apr-03 18:13:16.266 [main] DEBUG nextflow.Session - Run name: infallible_kare
Apr-03 18:13:16.267 [main] DEBUG nextflow.Session - Executor pool size: 8
Apr-03 18:13:16.274 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 20.01.0 build 5264
  Created: 12-02-2020 10:14 UTC (11:14 CEST)
  System: Linux 5.3.0-42-generic
  Runtime: Groovy 2.5.8 on OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
  Encoding: UTF-8 (UTF-8)
  Process: 16378@kadomu [127.0.1.1]
  CPUs: 8 - Mem: 15.3 GB (3.2 GB) - Swap: 2 GB (869.9 MB)
Apr-03 18:13:16.288 [main] DEBUG nextflow.Session - Work-dir: /home/kadomu/nxf_scratch [ext2/ext3]
Apr-03 18:13:16.310 [main] DEBUG nextflow.Session - Observer factory: TowerFactory
Apr-03 18:13:16.312 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Apr-03 18:13:16.456 [main] DEBUG nextflow.Session - Session start invoked
Apr-03 18:13:16.459 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /home/kadomu/git/mein_artic/results/pipeline_info/execution_trace.txt
Apr-03 18:13:16.559 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Apr-03 18:13:16.576 [main] DEBUG nextflow.Session - Workflow process names [dsl1]: test
Apr-03 18:13:16.598 [main] WARN  nextflow.Session - There's no process matching config selector: get_software_versions
Apr-03 18:13:16.637 [ForkJoinPool.commonPool-worker-1] WARN  nextflow.datasource.SraExplorer - Define the NCBI_API_KEY env variable to use NCBI search service -- Read more https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Apr-03 18:13:16.640 [ForkJoinPool.commonPool-worker-1] DEBUG nextflow.datasource.SraExplorer - SRA search url=https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=sra&usehistory=y&retmode=json&term=SRR1013514
Apr-03 18:13:16.709 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: null
Apr-03 18:13:16.709 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'local'
Apr-03 18:13:16.715 [main] DEBUG nextflow.executor.Executor - [warm up] executor > local
Apr-03 18:13:16.720 [main] DEBUG n.processor.LocalPollingMonitor - Creating local task monitor for executor 'local' > cpus=8; memory=15.3 GB; capacity=8; pollInterval=100ms; dumpInterval=5m
Apr-03 18:13:16.788 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > test -- maxForks: 8; blocking: false
Apr-03 18:13:16.813 [main] DEBUG nextflow.script.ScriptRunner - > Await termination 
Apr-03 18:13:16.813 [main] DEBUG nextflow.Session - Session await
Apr-03 18:13:17.670 [ForkJoinPool.commonPool-worker-1] DEBUG nextflow.datasource.SraExplorer - SRA data request url=https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=sra&retmode=json&query_key=1&WebEnv=NCID_1_71554959_130.14.22.33_9001_1585930397_973352404_0MetA0_S_MegaStore&retstart=0&retmax=1000
Apr-03 18:13:18.533 [Actor Thread 3] DEBUG n.util.BlockingThreadExecutorFactory - Thread pool name=FileTransfer; maxThreads=24; maxQueueSize=72; keepAlive=1m
Apr-03 18:13:18.539 [FileTransfer-thread-1] DEBUG nextflow.file.FilePorter - Local cache found for foreign file ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/004/SRR1013514/SRR1013514.fastq.gz at /home/kadomu/nxf_scratch/stage/8a/233e3d63939e9d5ce2093e6f58b719/SRR1013514.fastq.gz
Apr-03 18:13:18.625 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Apr-03 18:13:18.630 [Task submitter] INFO  nextflow.Session - [b6/0bc8bc] Submitted process > test (1)
Apr-03 18:13:18.717 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: test (1); status: COMPLETED; exit: 1; error: -; workDir: /home/kadomu/nxf_scratch/b6/0bc8bc47b3542ad04ea8f8e43860b9]
Apr-03 18:13:18.741 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'test (1)'

Caused by:
  Process `test (1)` terminated with an error exit status (1)

Command executed:

  gunzip -f *.fastq.gz

Command exit status:
  1

Command output:
  (empty)

Command error:
  
  gzip: SRR1013514.fastq.gz: unexpected end of file

Work dir:
  /home/kadomu/nxf_scratch/b6/0bc8bc47b3542ad04ea8f8e43860b9

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Apr-03 18:13:18.749 [main] DEBUG nextflow.Session - Session await > all process finished
Apr-03 18:13:18.764 [Task monitor] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Apr-03 18:13:18.774 [main] DEBUG nextflow.Session - Session await > all barriers passed
Apr-03 18:13:18.783 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=1; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=8ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=1; peakMemory=7 GB; ]
Apr-03 18:13:18.783 [main] DEBUG nextflow.trace.TraceFileObserver - Flow completing -- flushing trace file
Apr-03 18:13:18.786 [main] DEBUG nextflow.trace.ReportObserver - Flow completing -- rendering html report
Apr-03 18:13:18.958 [main] DEBUG nextflow.trace.ReportObserver - Execution report summary data:
  [{"cpuUsage":null,"process":"test","mem":null,"memUsage":null,"timeUsage":null,"vmem":null,"reads":null,"cpu":null,"time":{"mean":8,"min":8,"q1":8,"q2":8,"q3":8,"max":8,"minLabel":"test (1)","maxLabel":"test (1)","q1Label":"test (1)","q2Label":"test (1)","q3Label":"test (1)"},"writes":null}]
Apr-03 18:13:19.409 [main] DEBUG nextflow.trace.TimelineObserver - Flow completing -- rendering html timeline
Apr-03 18:13:19.423 [main] WARN  nextflow.dag.GraphvizRenderer - To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Apr-03 18:13:19.557 [main] DEBUG nextflow.CacheDB - Closing CacheDB done
Apr-03 18:13:19.560 [main] DEBUG nextflow.Session - AWS S3 uploader shutdown
Apr-03 18:13:19.598 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Environment

  • Nextflow version: [20.01.0 build 5264]
  • Java version: [1.8.0_242]
  • Operating system: [ubuntu 18.04]
  • Bash version: [GNU bash, version 4.4.20(1)]

Additional context

Any

@pditommaso
Copy link
Member

I think a fix for this problem could be done copy the file to a temp file name, when the download is complete, rename it to the expected file name. This should prevent the problem due to interrupted downloads

return FileHelper.copyPath(source, target)

@JoseEspinosa
Copy link
Contributor Author

JoseEspinosa commented Apr 6, 2020

Yes, I think so. Related with staged files, I also found another problem that is somehow related. I have updated a gist containing a csv file online. However since the file is staged, when the pipeline is resumed it continues using the old staged file.

@pditommaso
Copy link
Member

Proposed patch [6ddf0d8] fails when using S3 files.

Aug-21 11:10:11.055 [Actor Thread 9] ERROR nextflow.processor.TaskProcessor - Error executing process > 'MULTIQC'

Caused by:
  java.lang.UnsupportedOperationException

java.lang.UnsupportedOperationException: null
	at com.upplication.s3fs.S3FileSystemProvider.move(S3FileSystemProvider.java:558)
	at nextflow.file.FileHelper.movePath(FileHelper.groovy:846)
	at nextflow.file.FilePorter$FileTransfer.stageForeignFile0(FilePorter.groovy:297)
	at nextflow.file.FilePorter$FileTransfer.stageForeignFile(FilePorter.groovy:258)
	at nextflow.file.FilePorter$FileTransfer.run(FilePorter.groovy:244)

@pditommaso pditommaso reopened this Aug 21, 2020
@pditommaso
Copy link
Member

@wleepang are you aware if it's possible to rename a S3 file without having to copy it?

@maxbates
Copy link

https://docs.aws.amazon.com/cli/latest/reference/s3/mv.html but does not appear to be atomic (i.e. copies behind the scenes)

@pditommaso
Copy link
Member

@pditommaso pditommaso modified the milestones: v20.10.0, v21.01.0 Nov 1, 2020
@stale
Copy link

stale bot commented Mar 31, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 31, 2021
@pditommaso pditommaso removed this from the v21.04.0 milestone Apr 25, 2021
@stale stale bot removed the stale label Apr 25, 2021
@colingarvey
Copy link

@pditommaso: Is there any working solution or workaround for this? Unfortunately, if the staging processes are interrupted, downstream processes fail for us due to truncated files.

Also, is there a way to stage S3 data during the process step as opposed to preemptively via staging? Using the Slurm executor, I would be able to gain a huge increase in cumulative bandwidth by parallelizing the staging step across many machines.

@pditommaso
Copy link
Member

Any suggestion on how to improve it?

@colingarvey
Copy link

colingarvey commented Aug 24, 2021

Hi @pditommaso: My colleague and I thought of a few solutions. I have very limited experience with Groovy and the code base but I am happy to dig deeper and learn.

Sidecar file denoting completion

For each S3 file, could we create an additional sidecar file that marks a transferred file's completion? When a transfer is completed to S3 as part of the staging process, Nextflow could write a file alongside it, e.g. XXXXXXfile.bam and XXXXXXfile.bam.staged to indicate that the transfer completed successfully. I believe this would be compatible with both POSIX-compliant storage and S3. The PR above seemed to fail because move ops are not atomic in S3 and thus it was a copy and delete operation.

Source and destination file size comparison

Similar to the rsync algorithm, could we compare source and destination logical file sizes to confirm that the files are the same logical size? This way, if a transfer was interrupted, Nextflow would realize that during the subsequent process that the file size is out of wack, and delete the staged file, and re-transfer it.

(Expensive) Checksum comparisons

I am not a huge fan of this method since this can be quite expensive for large files, but another approach here could be comparing the source and destination file checksums. Perhaps this could be an optional approach for workflows that are mission critical and sensitive to file corruption, etc.

@pditommaso
Copy link
Member

I think the source and dest file comparison should work and it'd add a little overhead. Nice idea.

@pditommaso
Copy link
Member

pditommaso commented Dec 9, 2021

I've implemented the file size comparison. This is just enough to solve the problem.

For reference it could also possible to implement another strategy, recomputing the local staging path which is determined using source file metadata (ie. file path, last modified and timestamp), but replacing the source file size with the local file size, this should result in the same target file path. I leave this as a future enhancement.

Solved by 847789f.

@pditommaso pditommaso added this to the 22.04.0 milestone Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants