Fix transcriptome staging issues on DNAnexus for rsem/prepareference #727

drejom · 2021-11-20T02:51:03Z

Check Documentation

I have checked the following places for your error:

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

When using the app v1.0.0-beta.6, the test profile runs successfully (-profile test,docker -r tar --skip_bbsplit), but I haven't managed a run otherwise. It fails pretty quietly somewhere around Star_Align.

The log from that step shows some output files missing:

dfda3e01f2b6: Verifying Checksum
dfda3e01f2b6: Download complete
7ff999a2256f: Download complete
3aaade50789a: Pull complete
00cf8b9f3d2a: Pull complete
7ff999a2256f: Pull complete
10c3bb32200b: Verifying Checksum
10c3bb32200b: Download complete
1721f154786d: Verifying Checksum
1721f154786d: Download complete
d2ba336f2e44: Pull complete
dfda3e01f2b6: Pull complete
10c3bb32200b: Pull complete
1721f154786d: Pull complete
Digest: sha256:e33a844c7244068c6bf252f4b94e34500be4a62719eeb59dcab260a9da1fcd1d
Status: Downloaded newer image for quay.io/biocontainers/star:2.6.1d--0
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Nov 15 22:38:28 ..... started STAR run
Nov 15 22:38:28 ..... loading genome
Nov 15 22:38:41 ..... processing annotations GTF
Nov 15 22:38:48 ..... inserting junctions into the genome indices
Nov 15 22:40:13 ..... started 1st pass mapping
Nov 15 22:41:02 ..... finished 1st pass mapping
Nov 15 22:41:03 ..... inserting junctions into the genome indices
Nov 15 22:42:35 ..... started mapping
CPU: 16% (16 cores) * Memory: 36836/112707MB * Storage: 35/515GB * Net: 55↓/1↑MBps
Nov 15 22:46:46 ..... finished successfully
file-G69K6f09V6kb7Z246qzPYbZx
file-G69K6g09V6kYfPBJ7p18zyZp
file-G69K6j89V6kXV1xB5QjK3kpY
ls: cannot access '*sortedByCoord.out.bam': No such file or directory
ls: cannot access '*Aligned.unsort.out.bam': No such file or directory
ls: cannot access '*fastq.gz': No such file or directory
file-G69K6k89V6kxQxG757xpj0QJ
file-G69K6k89V6kzX6pj7p4qj3p6
file-G69K6pQ9V6kYK0qZ5KpyPqP1
file-G69K6pQ9V6kbjj0f77VfGBj9
file-G69K6k89V6kyG6F562BpxP9g
file-G69K6k89V6kx8f596ZgKZGJ4
file-G69K6qQ9V6kV6f6f6Z3jjZk1
file-G69K6vQ9V6kZp15b6qxGZK5Y
file-G69K6yQ9V6ky0B6J8zq10BbV

However, the log from the run shows an issue with SALMON_QUANT

Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (AGO1_SCR_rep1)'
Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT` input file name collision -- There are multiple input files for each of the following file names: genome.transcripts.fa
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Execution cancelled -- Finishing pending tasks before exit

The nextflow.log is stuck in an 'open' state, so I cant read/download/attach it

System

Hardware: DNAnexus
Version 1.0.0-beta.6

The text was updated successfully, but these errors were encountered:

drpatelh · 2021-11-22T11:55:10Z

I am trying to reproduce this on AWS Batch at the moment but I suspect it will work there because our full-sized AWS tests work. Is this something you can help us to debug too please @GHAStVHenry? I'm sure @drejom would be happy to provide you with any info you need.

drpatelh · 2021-11-22T12:00:14Z

@drejom would you mind dumping the contents of .command.sh and .command.run here please? (redacting whatever is required) if you have access to them?

pditommaso · 2021-11-23T10:39:52Z

One way to debug this is copying the .command.sh and .command.run and running the locally (provided you have dx tool installed)

drejom · 2021-11-24T00:19:08Z

I'm a bit stumped because the process that causes the error (eg SALMON_QUANT in the attached logs) only appears in the error message; there's no record of the job being submitted, so I can't retrieve the run folder or its contents. Not sure how to proceed?

GHAStVHenry · 2021-11-29T00:11:25Z

I found the problem...
There are 2 genome.transcripts.fa files saved in the rsem folder of the work folder.

genome.chrlist 
genome.fa
genome.grp
genome.idx.fa
genome.n2g.idx.fa
genome.seq
genome.ti
genome.transcripts.fa : file-G6X08jj0BGgQy5q2BqvyFk2X
genome.transcripts.fa : file-G6X08Y00BGgV5q1gKgf5zqXP

I'm not familiar with rsem-prepare-reference, but I'm guessing that STAR --runMode genomeGenerate and rsem-prepare-reference both write the file. The container which the work folders are is blob storage so the second write is creating a second file with a unique fileID, normally the second one would overwrite the first and no one would be wiser. They have the same md5sum, so they are identical... just created/modified at different times:

Created               Sun Nov 28 16:35:56 2021
Last modified         Sun Nov 28 16:36:01 2021

and

Created               Sun Nov 28 16:36:19 2021
Last modified         Sun Nov 28 16:36:22 2021

Is there a way to add to the RSEM_PREPAREREFERENCE process --star option, a cleanup script to clear out the unnecessary one after the STAR --runMode genomeGenerate?

The reason why the test profile works is it takes the fasta as a user input, doesn't take use the RSEM_PREPAREREFERENCE process to create it.
The reason why I didn't come across this issues when I successfully ran rnaseq 3 months ago on DNAnexus with real data is because my aligner of choice is HISAT2 and when you don't choose STAR an alternative version of the process runs without the STAR --runMode genomeGenerate and therefore no duplicate file.

@drpatelh can you add that rm of genome.transcripts.fa after the STAR --runMode genomeGenerate command?

drpatelh · 2021-11-29T16:39:37Z

Hi @GHAStVHenry ! Thanks for troubleshooting this!

In principle, this sounds like a plausible explanation but I am a little confused as to how it is happening with the default parameters used by the pipeline:

STAR and rsem-prepare-reference should only be run sequentially if you use --aligner star_rsem but the default is --aligner star_salmon:

rnaseq/modules/nf-core/modules/rsem/preparereference/main.nf

Lines 36 to 56 in 964425e

    
                   STAR \\ 
        
                       --runMode genomeGenerate \\ 
        
                       --genomeDir rsem/ \\ 
        
                       --genomeFastaFiles $fasta \\ 
        
                       --sjdbGTFfile $gtf \\ 
        
                       --runThreadN $task.cpus \\ 
        
                       $memory \\ 
        
                       $options.args2 
        
                   rsem-prepare-reference \\ 
        
                       --gtf $gtf \\ 
        
                       --num-threads $task.cpus \\ 
        
                       ${args.join(' ')} \\ 
        
                       $fasta \\ 
        
                       rsem/genome 
        
                   cat <<-END_VERSIONS > versions.yml 
        
                   ${getProcessName(task.process)}: 
        
                       ${getSoftwareName(task.process)}: \$(rsem-calculate-expression --version | sed -e "s/Current version: RSEM v//g") 
        
                       star: \$(STAR --version | sed -e "s/STAR_//g") 
        
                   END_VERSIONS

I tried running with --aligner star_rsem and changed this line to rsem/genome.test:

rnaseq/modules/nf-core/modules/rsem/preparereference/main.nf

Line 50 in 964425e

rsem/genome

The file listing is below and you will see that we now don't have any genome.transcripts.fa at all indicating that STAR isn't creating one beforehand.

Genome
Log.out
SA
SAindex
chrLength.txt
chrName.txt
chrNameLength.txt
chrStart.txt
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
genome.fa
genome.test.chrlist
genome.test.grp
genome.test.idx.fa
genome.test.n2g.idx.fa
genome.test.seq
genome.test.ti
genome.test.transcripts.fa
genomeParameters.txt
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab

So if I had to narrow it down based on your observation, I suspect that when rsem-prepare-reference is used in isolation to create the transcriptome here:

rnaseq/modules/nf-core/modules/rsem/preparereference/main.nf

Lines 60 to 65 in 964425e

    
                   rsem-prepare-reference \\ 
        
                       --gtf $gtf \\ 
        
                       --num-threads $task.cpus \\ 
        
                       $options.args \\ 
        
                       $fasta \\ 
        
                       rsem/genome

it is somehow writing a file with the same name internally but still need to confirm.

drpatelh · 2021-11-29T16:43:20Z

Are you able to upload those files here or see if they are any different @GHAStVHenry along with any timestamps.

genome.transcripts.fa : file-G6X08jj0BGgQy5q2BqvyFk2X
genome.transcripts.fa : file-G6X08Y00BGgV5q1gKgf5zqXP

GHAStVHenry · 2021-11-29T16:51:21Z

The files have the same md5sums, so they'll be the same, let me know if you want me to upload them and there is a sequential write time difference. The actual times are above.

Hmmm, you are right re- default --aligner star_salmon... the .command.sh shows only the rsem-prepare-reference not the sequential STAR.

Actually my first post firmly put the blame in the rsem-prepare-reference and then I read the conditional for the sequential STAR and changed my mind, forgetting that that isn't even what happened with my test. I'm not familiar with rsem-prepare-reference but it seems weird that it would write the same file twice... but that seems to be what's happening.

drpatelh · 2021-11-29T16:55:02Z

Ok. Thanks! If they are the same then that may be less problematic otherwise we would have no way of picking which one to take (or telling NF to anyway).

I have pushed a quick fix to the tar branch for this in d98e7a2

This just takes a hard-named file called genome.transcripts.fa instead of a glob which was causing the original staging issue. Hopefully, this means any one of the above files will be used downstream in the pipeline and should solve our issue!

Would you mind giving it a go with -r tar in the command?

GHAStVHenry · 2021-11-29T18:15:06Z

Alright, tried it... it kinda sorta worked... the
Process NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT input file name collision -- There are multiple input files for each of the following file names: genome.transcripts.fa
is resolved... but now
SALMON_QUANT started but had this error...

============
Exception : [The provided transcript file: "genome.transcripts.fa" does not exist!
]
============

SALMON_QUANT's work folder didn't have the fasta in there... actually no inputs are there, but looking at other process work folders, it doesn't appear that inputs are saved/retained, so I can't confirm this.

GHAStVHenry · 2021-11-29T20:49:04Z

Alright, I think I understand the problem... it isn't rsem-prepare-reference

rnaseq/modules/nf-core/modules/rsem/preparereference/main.nf

Lines 25 to 28 in 964425e

    
           output: 
        
           path "rsem"                , emit: index 
        
           path "rsem/*transcripts.fa", emit: transcript_fasta 
        
           path "versions.yml"        , emit: versions

outputs the transcripts fasta both in the index output in the rsem folder as well as explicitly as transcript_fasta. Does index need the transcripts fasta in there?

I'm not sure which, but DNAnexus/Nextflow is capable of handling the multiple files of the same name as is evidenced by my test from above:
WARN: Multiple files matching path: 'dx://container-G6XQbfQ02QXpYQxB5211v0Vz:/scratch/d0/06fbf7008a445d76956859f0e94be7/rsem' -- picking: file-G6XQx8j02QXkZy4z59fF1XQK
Each glob based output sees multiple files and chooses one, but then because there are 2 inputs containing the same file, 2 copies get written anyway, which causes the conflict.
...but for some reason with your modification no file is getting sent now... I can't confirm that it really isn't getting there, inputs aren't uploaded/retained in the work folders so I don't know if it was there.

I set up some tests, but I accidentally pushed my commits to the wrong remote, I sent it to NFCore repo (tar branch) instead of my fork, but if it doesn't work, or if you have a better way of fixing it, you can revert it. Will update with result of test...

GHAStVHenry · 2021-11-29T23:44:12Z

Ok, after fighting with the glob output of index to try and exclude the transcripts fasta finally got it working... it FINALLY got past SALMON_QUANT... now waiting to see if it gets through the end to see if index needs transcripts fasta in it

EDIT: WORKED!!!

drpatelh · 2021-11-30T11:58:06Z

Awesome! Great work! 🥳

Ok I pushed a last commit based on what you found. This is just so we don't mess with the default files generated by RSEM and pass them all along in the index.

Would you mind running a last test using the -r tar branch?

cbae500

GHAStVHenry · 2021-11-30T14:43:40Z

The first test I tried was similar to that and didn't work... yours did though, at least it got past SALMON_QUANT, will update once it gets to the end!

EDIT: WORKED!!!

drejom · 2021-11-30T20:15:30Z

Worked for me too! Rippa!! Thanks @GHAStVHenry @drpatelh

drpatelh · 2021-11-30T21:51:20Z

Rippa <- 🤣

Ok. Will leave this open until we properly push the fixes into the main pipeline. Thanks guys.

drejom added the bug Something isn't working label Nov 20, 2021

drejom mentioned this issue Nov 20, 2021

Untar needs --no-same-owner on DNAnexus #725

Closed

4 tasks

drpatelh added this to the 3.5 milestone Dec 13, 2021

drpatelh mentioned this issue Dec 13, 2021

Fix transcriptome staging issues on DNAnexus for rsem/prepareference nf-core/modules#1163

Merged

drpatelh closed this as completed in nf-core/modules#1163 Dec 13, 2021

drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Dec 13, 2021

Fix nf-core#727

f4c86b6

drpatelh changed the title ~~NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:STAR_ALIGN fails on dnanexus~~ Fix transcriptome staging issues on DNAnexus for rsem/prepareference Dec 13, 2021

drpatelh mentioned this issue Dec 13, 2021

Pre-release issue fixes #732

Merged

drpatelh mentioned this issue Mar 2, 2023

Wrap params string in file method nf-core/viralrecon#358

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix transcriptome staging issues on DNAnexus for rsem/prepareference #727

Fix transcriptome staging issues on DNAnexus for rsem/prepareference #727

drejom commented Nov 20, 2021 •

edited

Loading

drpatelh commented Nov 22, 2021

drpatelh commented Nov 22, 2021

pditommaso commented Nov 23, 2021

drejom commented Nov 24, 2021

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

drpatelh commented Nov 29, 2021

drpatelh commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021

drpatelh commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

drpatelh commented Nov 30, 2021

GHAStVHenry commented Nov 30, 2021 •

edited

Loading

drejom commented Nov 30, 2021

drpatelh commented Nov 30, 2021

Fix transcriptome staging issues on DNAnexus for rsem/prepareference #727

Fix transcriptome staging issues on DNAnexus for rsem/prepareference #727

Comments

drejom commented Nov 20, 2021 • edited Loading

Check Documentation

Description of the bug

Steps to reproduce

System

drpatelh commented Nov 22, 2021

drpatelh commented Nov 22, 2021

pditommaso commented Nov 23, 2021

drejom commented Nov 24, 2021

GHAStVHenry commented Nov 29, 2021 • edited Loading

drpatelh commented Nov 29, 2021

drpatelh commented Nov 29, 2021 • edited Loading

GHAStVHenry commented Nov 29, 2021

drpatelh commented Nov 29, 2021 • edited Loading

GHAStVHenry commented Nov 29, 2021 • edited Loading

GHAStVHenry commented Nov 29, 2021 • edited Loading

GHAStVHenry commented Nov 29, 2021 • edited Loading

drpatelh commented Nov 30, 2021

GHAStVHenry commented Nov 30, 2021 • edited Loading

drejom commented Nov 30, 2021

drpatelh commented Nov 30, 2021

drejom commented Nov 20, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

drpatelh commented Nov 29, 2021 •

edited

Loading

drpatelh commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 29, 2021 •

edited

Loading

GHAStVHenry commented Nov 30, 2021 •

edited

Loading