Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fromSRA input ? #382

Closed
apeltzer opened this issue Apr 10, 2020 · 20 comments
Closed

fromSRA input ? #382

apeltzer opened this issue Apr 10, 2020 · 20 comments
Labels
enhancement New feature or request

Comments

@apeltzer
Copy link
Member

Nextflow supports this - maybe worth a try to allow users to also specify SRA input files?

@apeltzer apeltzer added the enhancement New feature or request label Apr 10, 2020
@apeltzer apeltzer added this to the Unclear Topics / Feature Requests milestone Apr 10, 2020
@ktmeaton
Copy link

I would be very interested in this! I've been trying to figure out what is the best approach to connect SRA input into EAGER. I've unfortunately had limited success using the fromSRA method when trying to download larger files (ex. 5M+ clusters). I often run into downstream errors about unexpected EOF or truncated sequences. My understanding so far is that fromSRA serves up the FTP URL which is used for download. I'd be interested to know if anyone else experiences this issue or if it's just me.

My preferred route for SRA data download has previously been the SRA toolkit using prefetch and fasterq-dump. I know very little about transfer protocols, but maybe since prefetch uses HTTP or Aspera, that is why I find it more robust than FTP.

For EAGER integration, I'm currently playing with including SRA accessions in the tsv-input file. Perhaps by providing an alternate flag --input-sra metadata.tsv ? And the SRA accession can go in the R1 column, and R2 marked as NA:
metadata_mammoth_sra.txt

I'm not sure if this is good way to go about it, but this is what I'm trying for the moment!

@drpatelh
Copy link
Member

drpatelh commented Apr 16, 2020

Myself and @JoseEspinosa have been working on this for the nf-core/viralrecon pipeline. Mainly because this will allow users to be able to download and compare existing COVID-19 data along with their own.

I implemented a simple 3 column design whereby you can provide local/remote paths to SE/PE data or an SRA id can be added in the first column.
https://github.com/nf-core/viralrecon/blob/dev/docs/usage.md#--input

This script then validates the samplesheet and fetches the metadata for the SRA id:
https://github.com/nf-core/viralrecon/blob/dev/bin/check_samplesheet.py

This is important because it allows us to provide any SR* or PR* id as input because they get expanded out to SRR ids here. Also allows us to figure out which samples are SE or PE and whether they were sequenced on Illumina (currently the only supported platform by the pipeline).

This then creates a valid samplesheet that is used by the pipeline:
https://github.com/nf-core/viralrecon/blob/de7b6a96d4cf1b8a5e66cd837d9bf80edb9f6c67/main.nf#L373-L424

Seems to be working pretty well. So far....

@jfy133
Copy link
Member

jfy133 commented Apr 16, 2020

I would be very interested in this! I've been trying to figure out what is the best approach to connect SRA input into EAGER. I've unfortunately had limited success using the fromSRA method when trying to download larger files (ex. 5M+ clusters). I often run into downstream errors about unexpected EOF or truncated sequences. My understanding so far is that fromSRA serves up the FTP URL which is used for download. I'd be interested to know if anyone else experiences this issue or if it's just me.

My preferred route for SRA data download has previously been the SRA toolkit using prefetch and fasterq-dump. I know very little about transfer protocols, but maybe since prefetch uses HTTP or Aspera, that is why I find it more robust than FTP.

For EAGER integration, I'm currently playing with including SRA accessions in the tsv-input file. Perhaps by providing an alternate flag --input-sra metadata.tsv ? And the SRA accession can go in the R1 column, and R2 marked as NA:
metadata_mammoth_sra.txt

I'm not sure if this is good way to go about it, but this is what I'm trying for the moment!

Glad to hear there is more interest in this, and thank you for bringing this up. With the EOFs etc, it's good to know that happens.

Some more notes/thoughts on my half: Actually when @apeltzer first suggested this I had some other reservations but related: I know that a lot of aDNA people upload completely insane stuff (already merged files, stuff still with adapters, only mapped BAM files and no raw-reads), which would also complicate 'guestimating' via metadata what format the data is in (as @drpatelh does). I've also not tried out fromSRA myself yet so I don't have any experiences with the truncated sequences etc. @drpatelh - have you had issues like that in the past?

For the limitations for the FTP transfer, I don't see why we couldn't include sratoolkit in our container environment and have a downloading process, but that wouldn't work for direct ENA stuff as cleanly, and would potentiall add a lot of complexity. I know some sysadmins don't really like Aspera for wierd reasons, so that would have to be another optional functioanlity if we included this. But on the otherhand it seems redundent to make an alternative to the fromSRA nextflow function?

@jfy133
Copy link
Member

jfy133 commented Apr 16, 2020

@ktmeaton Also - I don't think we would need to have a separate input TSV for SRA stuff, just an extra (maybe optional?) column, which anything in those columns would be read with fromSRA rather than fromCSV. So that would save us some work!

@apeltzer
Copy link
Member Author

@ktmeaton Also - I don't think we would need to have a separate input TSV for SRA stuff, just an extra (maybe optional?) column, which anything in those columns would be read with fromSRA rather than fromCSV. So that would save us some work!

Yes, that would definitely work. I'm going to check out Harshils code ( thanks again @drpatelh ) and also thanks @ktmeaton to give us some feedback on use cases here - this is highly appreciated and super helpful to understand what users really need 👍 👏

@drpatelh
Copy link
Member

drpatelh commented Apr 17, 2020

The fromSRA function may not work as well as expected in some cases:
nextflow-io/nextflow#1552

We have chosen to use parallel-fastq-dump instead in the pipeline to download and create the fastq files we need for the pipeline which seems to be behaving so far:
https://github.com/nf-core/viralrecon/blob/460a29f962edc31d7818d63b26b6fad5b98d0cbc/main.nf#L443-L474

Some more notes/thoughts on my half: Actually when @apeltzer first suggested this I had some other reservations but related: I know that a lot of aDNA people upload completely insane stuff (already merged files, stuff still with adapters, only mapped BAM files and no raw-reads), which would also complicate 'guestimating' via metadata what format the data is in (as @drpatelh does). I've also not tried out fromSRA myself yet so I don't have any experiences with the truncated sequences etc. @drpatelh - have you had issues like that in the past?

I intentionally am making very few assumptions about the metadata i.e. only whether its SE/PE and sequenced on ILLUMINA. I would be hesitant to go much further than that to be honest for the reasons you mentioned @jfy133. For example, if you use SRA ids the pipeline fetches and saves the metadata for those ids in the results folder. Below is an example output and you can quite clearly see how the data is filled in inconsistently:
https://gist.github.com/JoseEspinosa/ff437797788366853f2c8a36389cf3eb

@ktmeaton Also - I don't think we would need to have a separate input TSV for SRA stuff, just an extra (maybe optional?) column, which anything in those columns would be read with fromSRA rather than fromCSV. So that would save us some work!

Yes, this can all be done via the same input design providing you have a validation script to check and reformat everything as required to read it into the pipeline. Personally, I think a design file input should be as simple as possible and so adding more columns should only really be done if absolutely necessary! Its tough enough getting users to fill in a 3 column design!

@ktmeaton
Copy link

The fromSRA function may not work as well as expected in some cases:
nextflow-io/nextflow#1552

I have also experienced that issue while doing FTP transfers of bacterial genomes from NCBI's website (which usually work really well in NextFlow). It's very frustrating, so I'm glad someone documented it!

I intentionally am making very few assumptions about the metadata i.e. only whether its SE/PE and sequenced on ILLUMINA. I would be hesitant to go much further than that to be honest for the reasons you mentioned @jfy133. For example, if you use SRA ids the pipeline fetches and saves the metadata for those ids in the results folder. Below is an example output and you can quite clearly see how the data is filled in inconsistently:
https://gist.github.com/JoseEspinosa/ff437797788366853f2c8a36389cf3eb

Similarly, I find that working with the SRA/other people's data requires making fewer assumptions. I have to spend time collecting the relevant Run metadata and reading the source pub for methods to figure out how to process it. For me, blindly putting SRA accessions into Nextflow pipelines has been a recipe for insta-crash or suspicious results 😝 Eliminating guesswork prior to analysis is key.

Yes, this can all be done via the same input design providing you have a validation script to check and reformat everything as required to read it into the pipeline. Personally, I think a design file input should be as simple as possible and so adding more columns should only really be done if absolutely necessary! Its tough enough getting users to fill in a 3 column design!

Simplificity is the ideal, but I confess I would really like to use the sample merging feature that is available in the tsv-input branch of EAGER. For a use case here, I want to analyze a neat Bronze Age plague genome that is split over 19 SRA experiments. So a few extra column about library ID/Lane is very very helpful. I previously managed files like this manually... hence my excitement about this thread.

@apeltzer
Copy link
Member Author

Simplificity is the ideal, but I confess I would really like to use the sample merging feature that is available in the tsv-input branch of EAGER. For a use case here, I want to analyze a neat Bronze Age plague genome that is split over 19 SRA experiments. So a few extra column about library ID/Lane is very very helpful. I previously managed files like this manually... hence my excitement about this thread.

I'm 100% in line with this. That was exactly the reason for having the fromSRA- if that is not really stable at the moment in Nextflow, we can add a separate process, e.g. that handles SRA data more appropriately and generates FastQ files out of given SRA IDs & add the same for ENA in the very same way, for example, using this: https://github.com/wwood/ena-fast-download

That way people could have either SE/PE data, both from SRA + ENA & local data mixed.
If we'd add this - people could simply have either SRA or ENA IDs in a separate column, define what the content of that file is & then be happy about it :-)

@jfy133
Copy link
Member

jfy133 commented Apr 17, 2020

@ktmeaton, I think this would require more input from the wider aDNA community on what their experiences on how often the come across this need, and what issues they find with the data (like my twitter poll from a couple months ago). For exactly this resaon you say here:

For me, blindly putting SRA accessions into Nextflow pipelines has been a recipe for insta-crash or suspicious results

So I propose we won't integrate this into the intial release of the TSV-input (v2.2), but come next release.

The current TSV-input code is starting to mature (only one more process to add for the strip-FASTQ functionality, and reintegrating --reads and converting that to a fake TSV file), so adding a whole new column now would lead to a bit more complexity for getting beta testers to try it out (have you tried the tsv-input branch yet btw? have you got any feedback?). But I think after a survey/another poll we could gauge how much interest there is, and prioritise for v2.3 or not.

@ktmeaton
Copy link

@ktmeaton, I think this would require more input from the wider aDNA community on what their experiences on how often the come across this need, and what issues they find with the data (like my twitter poll from a couple months ago).

That sounds like a good plan! Perhaps this is not a widely needed feature, in that case I can just continue with my local solution.

(have you tried the tsv-input branch yet btw? have you got any feedback?).

The tsv-input branch works great! I just pulled the commit from b2b411b and actually got perfect FTP transfer and integration of SRA data today with this input: metadata_BronzeAge.txt.

So this method works sometimes(?) if you already have pre-parsed out the SRA URL for download.

@jfy133
Copy link
Member

jfy133 commented Apr 21, 2020

@ktmeaton Good to hear! It would be a good test to see which is more stable - the fromSRA vs just a normal FTP link. I see the fromSRA needs the NCBI_API_KEY etc (https://www.nextflow.io/docs/latest/channel.html#fromsra), which would require more documentation etc.

@drpatelh
Copy link
Member

This was bothering me so I spent most of yesterday trying to come up with a more robust solution to deal with SRA, ENA and GEO ids. Still testing before I officially add it to the viralrecon pipeline but based on your example @ktmeaton and an input samplesheet like below:

sample,fastq_1,fastq_2
ERS848976,,

here is the output from the samplesheet validation script I have written in Python.

sample_id,single_end,is_sra,is_ftp,fastq_1,fastq_2,md5_1,md5_2
ERX1097904_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/006/ERR1018966/ERR1018966.fastq.gz,,ad4c8a60c5f37d9ac729af4642043101,
ERX1097905_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/007/ERR1018967/ERR1018967.fastq.gz,,cfde499929357b9e589c4063a2aa5340,
ERX1097906_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/008/ERR1018968/ERR1018968.fastq.gz,,de6a35ae1aa0ccd69cc47b8ec31cc5f7,
ERX1097907_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/009/ERR1018969/ERR1018969.fastq.gz,,545512307384cd5d693ec21763d969c0,
ERX1097908_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/000/ERR1018970/ERR1018970.fastq.gz,,e599480b1cf38d3c65422b95f5ad7b32,
ERX1097909_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/001/ERR1018971/ERR1018971.fastq.gz,,b66ce9be43f84366d34eed1ebb794e7d,
ERX1097910_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/002/ERR1018972/ERR1018972.fastq.gz,,159a355462610093453c57ac939e38cc,
ERX1097911_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/003/ERR1018973/ERR1018973.fastq.gz,,6ccce5a3cb877598bd61436960d9c805,
ERX1097912_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/004/ERR1018974/ERR1018974.fastq.gz,,666d471c79b2ab21a35a4e484e07e1ee,
ERX1097913_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/005/ERR1018975/ERR1018975.fastq.gz,,123531badcf33ca00435a19d23c12c30,
ERX1097914_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/006/ERR1018976/ERR1018976.fastq.gz,,438f88acad4f764cfa5cf1715632b5e2,
ERX1097915_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/007/ERR1018977/ERR1018977.fastq.gz,,3ff320e2e5597b3bb161a3c5bafadd96,
ERX1097916_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/008/ERR1018978/ERR1018978.fastq.gz,,a86a02b8f6df8f5903d378d32280400b,
ERX1097917_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/009/ERR1018979/ERR1018979.fastq.gz,,21e32629ee698683075c61b74d59847b,
ERX1097918_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/000/ERR1018980/ERR1018980.fastq.gz,,ec2e3fab2379f17e6d587149b16dbecb,
ERX1097919_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/001/ERR1018981/ERR1018981.fastq.gz,,c1ccea70936b0363ca0ea72d6909fca3,
ERX1097920_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/002/ERR1018982/ERR1018982.fastq.gz,,b8f8f3c9b94d5f80014b276b28b5a423,
ERX1097921_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/003/ERR1018983/ERR1018983.fastq.gz,,77913696bb20aac5df7cfd929929c3d1,
ERX1097922_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/004/ERR1018984/ERR1018984.fastq.gz,,362807e093e0c5811235e4fdbb76b342,

I also wanted to be able to merge experiments that had multiple runs of the same library. This turned out to be the trickiest bit but I think I have it working now 😅

@apeltzer
Copy link
Member Author

That sounds like what we want ....?! Awesome!

@apeltzer
Copy link
Member Author

apeltzer commented Jun 5, 2020

@ktmeaton
Copy link

https://github.com/nf-core/viralrecon/blob/master/bin/fetch_sra_runinfo.py

That is a beautiful fetch and validation script! For an alternative approach, I'm starting with a SQL database of SRA metadata, and use this script to automatically generate the tsv input file for eager:

https://github.com/ktmeaton/plague-phylogeography/blob/master/scripts/sqlite_EAGER_tsv.py

But I think this is rather atypical usage (starting from a database), and I'd imagine most people using the tsv input option are constructing it manually?

I wish there was a 'compliments' section on GitHub. The tsv input + library merging is working so well for SRA projects 🎉

@apeltzer
Copy link
Member Author

Thank you for all the kind comments on here - I bet the team really appreciates it and are all as happy as I am that this here helps people analyzing their data in a nice and tidy way 😉

Can we add your comment to a new "Testimonials" section in our readme? It would be the first of such testimonials, but we would really appreciate it! If you tweet about your experiences, the nf-core webpage also has a testimonials section, too 😄

There is a preprint btw on the paper for this, we've added you to the acknowledgments section as you were providing so much feedback on using the pipeline - please have a look if you want: https://www.biorxiv.org/content/10.1101/2020.06.11.145615v1

@ktmeaton
Copy link

Congratulations on the preprint! It's exciting to read about all the new changes and tools that have been added. That's very kind of you to add me to the acknowledgements, I'm glad these conversations haven't been too much of burden :) I noticed that my name has a small typo (Katerine -> Katherine), but if you're passed edits that's totally fine (just very excited to be mentioned!).

Certainly, you can add any of my comments to testimonials. Twitter still (admittedly) intimidates me, but I can use this as motivation to learn more!

@ktmeaton
Copy link

Also as an update about SRA input to eager, I've found network stability to be crucial. So far I've found downloading SRA files with nextflow greatly benefits from being on an HPC network (less crashes). Since this method doesn't work smoothly on local workstations (for me), it might be a less desired feature by the community for accessibility reasons?

That is to say, feel free to close this issue whenever you deem appropriate! Unless you have other users advocating for the fromSRA() feature, my current preference is to prep SRA files before (and outside) of nf-core/eager.

@apeltzer
Copy link
Member Author

Sorry for the typo, we'll make sure that the next update contains a corrected reference 😓

@jfy133
Copy link
Member

jfy133 commented Jul 21, 2020

I've pinged you in the correction!

For the SRA stuff - If you're finding it is not stable, I think it is probably better to not offer this, in lieu of other custom scripts that people can use to build it. Maybe we could have a section of the README of 'complementary external helper tools' or something (@apeltzer would that be OK with nf-core guidelines do you think? )

So far I've found the direct standard FTP downloads from the ENA to work 99.99% of the time to be honest. So maybe the fromSRA() function isn't as important either, given FTP/HTTPS links are already recognised by Nextflow and will automatically download..

Also because I've come across more cases recently of people not uploading/labelling their data properly in the ENA (orginally described here #382 (comment)), forcing making people to make their own TSV will hopefully also make them think to check more carefully the nature of the data they are trying to use is (if that makes sense)...

I will close this for now in this case. I will make a new issue though about the 'complementary external helper tools'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants