Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SortMeRNA to use SilvaDB 138 (for commercial use) #570

Closed
nh13 opened this issue Feb 16, 2021 · 14 comments
Closed

Update SortMeRNA to use SilvaDB 138 (for commercial use) #570

nh13 opened this issue Feb 16, 2021 · 14 comments

Comments

@nh13
Copy link
Member

nh13 commented Feb 16, 2021

SilvaDB release 138 is now available for commercial use! See: https://www.arb-silva.de/silva-license-information/

@nh13 nh13 changed the title Update SortMeRNA to use SilvaDB 138 (for commercial use)000000000 Update SortMeRNA to use SilvaDB 138 (for commercial use) Feb 16, 2021
@drpatelh
Copy link
Member

Hi @nh13! Hope you are well! SortMeRNA is one of those tools for which I would like to plead ignorance because I have never used it 😅 How can we accommodate this information into the pipeline? I am aware of issues with run-times as highlighted here but that's off topic.

We do have a parameter that allows you to override the default databases you provide to the pipeline i.e. --ribo_database_manifest but I suspect that's off topic too?

So based on my deductions I am assuming you mean we change the sentences here and here?

@nh13
Copy link
Member Author

nh13 commented Feb 17, 2021

@drpatelh

It'd be great if either SortMeRNA could update them (see this issue), but for nf-core I'd expect to be able to use them for commercial use by default. Also, the SortMeRNA databases are very old 29/11/2014, but like you, I "neither have the time nor the inclination" to update them 😆 !

So why not just align to the full SilvaDB release 38, which allows for both commercial and non-commercial use by default? It is more comprehensive than the set up there? Perhaps some RNA-Seq analysis experts could weigh in?

@drpatelh
Copy link
Member

I am fairly well versed on the dark side of RNA-seq analysis but I fear this issue falls into the even darker realm of classify my DNA/RNA-type voodoo magic. @apeltzer what do we need to sacrifice here?

@drejom !! Been a while!

@drpatelh
Copy link
Member

I just saw that you edited the issue @drejom 😂 Fate...hope you are well!

@drejom
Copy link
Contributor

drejom commented Feb 17, 2021

I am! Just a pandemic and an insurrection between drinks! Looking forward to a UK visit….one day!

@drpatelh drpatelh added this to the 3.1 milestone Apr 11, 2021
@drpatelh
Copy link
Member

Ping @d4straub @apeltzer. Any ideas how we can incorporate this information into the pipeline? I am planning on getting a release together over the next couple of weeks. Can include this if it's an easy fix. Thanks!

@apeltzer
Copy link
Member

@d4straub is the person to ask - not too much experience on SortMeRNA / SILVA either, sorry :-(

@d4straub
Copy link
Contributor

Updating to v4.3.1 would improve runtime, see https://github.com/biocore/sortmerna/releases/tag/v4.3.1
The SILVA database might be also updated to v138 in v4.3.1, as earlier mentioned for 4.2 that "next release" would come with SILVA v138 . Will investigate this next week.

@drpatelh
Copy link
Member

So I made a concerted effort to try and use the latest Biocontainer thinking I could just swap out the container and put my feet up because everything else with the process would just work. No no....a couple of hours later after having experienced Segmentation faults and various issues where downstream processes in the pipeline were failing due to corrupt fastq files being generated I gave up to do something else. I also tried to get it to generate uncompressed fastq's that I could zip after the process using the --zip-out parameter. The inline help comments are here but the value evaluation takes completely different types of parameters as defined here. I tried all of those values but no success. I may be missing something stupendously obvious here but it appears that it is going to be too much hassle than it's worth bumping the version on this but be great if someone else can confirm!

The module file is here

@nh13
Copy link
Member Author

nh13 commented Apr 13, 2021

It may be a better solution to just use bowtie/bwa/etc to align to the rRNA sequences directly and remove those that have any valid mappings. SortMeRNA is still quite slow.

@drpatelh
Copy link
Member

drpatelh commented Apr 13, 2021

Yup. The newer releases were supposed to address this but it appears that we are now just seeing a different set of issues😅

A metagenomics classifier type approach using Kraken2 would be quite cool too which would bypass the mapping and generate filtered fastqs directly - maybe not as sensitive as mapping if done loosely but would do the trick I think.

I used to run RNA-SeQC for the longest time to get rRNA estimates as a QC metric and then to deal with the counts appropriately downstream if required, before the differential analysis. This pipeline also generates a feature biotypes plot with this info in the MultiQC report. Personally, I think that is the best way and bypasses the need to do any FastQ filtering at all. It appears the links are broken on the RNA-SeQC website too - not doing very well. Time to shut the lid!

Have a good evening!

@drpatelh drpatelh removed this from the 3.1 milestone Apr 14, 2021
@d4straub
Copy link
Contributor

d4straub commented Apr 15, 2021

It may be a better solution to just use bowtie/bwa/etc to align to the rRNA sequences directly and remove those that have any valid mappings. SortMeRNA is still quite slow.

This might work more or less for an isolate but not for environmental samples (i.e. a mixture of organisms with previously unknown rRNA sequences), here SortMeRNA has advantages. But this was my intention, to make this pipeline fit for metatranscriptomics when adding SortMeRNA.

Your tests @drpatelh suggest that it might be better to just stay with version 4.2.0 (despite being slow, but at least not breaking the pipeline, correct?) and attempt to just change the database to silva 138 to allow commercial use. Would that sound fine to you?

@drpatelh
Copy link
Member

Your tests @drpatelh suggest that it might be better to just stay with version 4.2.0 (despite being slow, but at least not breaking the pipeline, correct?)

I think this may be the path of least resistance given that the latest release still seems quite buggy and most people aren't using this option when running the pipeline. It would be great if you have some time to confirm this is the case. Bumping the version in the SortMeRNA module and running nextflow run nf-core/rnaseq ..... -r dev should reproduce the errors. Don't worry if you don't have time.

Yup, if we can't update the software version maybe it is worth updating the SILVA databases which I assume are independent and won't break anything with the current tool version in the pipeline (or make it even sloooooooower)?

@drpatelh
Copy link
Member

drpatelh commented Oct 5, 2021

The latest version of SortMeRNA (v4.3.4) is now working smoothly via a simple update of the existing nf-core/module. It now also supports native compression of output files which is nice. I believe the databases have also been updated as of >4.2.0 as mentioned here so will close this issue!

@drpatelh drpatelh closed this as completed Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants