-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fromSRA input ? #382
Comments
I would be very interested in this! I've been trying to figure out what is the best approach to connect SRA input into EAGER. I've unfortunately had limited success using the My preferred route for SRA data download has previously been the SRA toolkit using For EAGER integration, I'm currently playing with including SRA accessions in the tsv-input file. Perhaps by providing an alternate flag I'm not sure if this is good way to go about it, but this is what I'm trying for the moment! |
Myself and @JoseEspinosa have been working on this for the nf-core/viralrecon pipeline. Mainly because this will allow users to be able to download and compare existing COVID-19 data along with their own. I implemented a simple 3 column design whereby you can provide local/remote paths to SE/PE data or an SRA id can be added in the first column. This script then validates the samplesheet and fetches the metadata for the SRA id: This is important because it allows us to provide any SR* or PR* id as input because they get expanded out to SRR ids here. Also allows us to figure out which samples are SE or PE and whether they were sequenced on Illumina (currently the only supported platform by the pipeline). This then creates a valid samplesheet that is used by the pipeline: Seems to be working pretty well. So far.... |
Glad to hear there is more interest in this, and thank you for bringing this up. With the EOFs etc, it's good to know that happens. Some more notes/thoughts on my half: Actually when @apeltzer first suggested this I had some other reservations but related: I know that a lot of aDNA people upload completely insane stuff (already merged files, stuff still with adapters, only mapped BAM files and no raw-reads), which would also complicate 'guestimating' via metadata what format the data is in (as @drpatelh does). I've also not tried out For the limitations for the FTP transfer, I don't see why we couldn't include |
@ktmeaton Also - I don't think we would need to have a separate input TSV for SRA stuff, just an extra (maybe optional?) column, which anything in those columns would be read with |
Yes, that would definitely work. I'm going to check out Harshils code ( thanks again @drpatelh ) and also thanks @ktmeaton to give us some feedback on use cases here - this is highly appreciated and super helpful to understand what users really need 👍 👏 |
The We have chosen to use
I intentionally am making very few assumptions about the metadata i.e. only whether its SE/PE and sequenced on ILLUMINA. I would be hesitant to go much further than that to be honest for the reasons you mentioned @jfy133. For example, if you use SRA ids the pipeline fetches and saves the metadata for those ids in the results folder. Below is an example output and you can quite clearly see how the data is filled in inconsistently:
Yes, this can all be done via the same input design providing you have a validation script to check and reformat everything as required to read it into the pipeline. Personally, I think a design file input should be as simple as possible and so adding more columns should only really be done if absolutely necessary! Its tough enough getting users to fill in a 3 column design! |
I have also experienced that issue while doing FTP transfers of bacterial genomes from NCBI's website (which usually work really well in NextFlow). It's very frustrating, so I'm glad someone documented it!
Similarly, I find that working with the SRA/other people's data requires making fewer assumptions. I have to spend time collecting the relevant Run metadata and reading the source pub for methods to figure out how to process it. For me, blindly putting SRA accessions into Nextflow pipelines has been a recipe for insta-crash or suspicious results 😝 Eliminating guesswork prior to analysis is key.
Simplificity is the ideal, but I confess I would really like to use the sample merging feature that is available in the tsv-input branch of EAGER. For a use case here, I want to analyze a neat Bronze Age plague genome that is split over 19 SRA experiments. So a few extra column about library ID/Lane is very very helpful. I previously managed files like this manually... hence my excitement about this thread. |
I'm 100% in line with this. That was exactly the reason for having the That way people could have either SE/PE data, both from SRA + ENA & local data mixed. |
@ktmeaton, I think this would require more input from the wider aDNA community on what their experiences on how often the come across this need, and what issues they find with the data (like my twitter poll from a couple months ago). For exactly this resaon you say here:
So I propose we won't integrate this into the intial release of the TSV-input (v2.2), but come next release. The current TSV-input code is starting to mature (only one more process to add for the strip-FASTQ functionality, and reintegrating |
That sounds like a good plan! Perhaps this is not a widely needed feature, in that case I can just continue with my local solution.
The tsv-input branch works great! I just pulled the commit from b2b411b and actually got perfect FTP transfer and integration of SRA data today with this input: metadata_BronzeAge.txt. So this method works sometimes(?) if you already have pre-parsed out the SRA URL for download. |
@ktmeaton Good to hear! It would be a good test to see which is more stable - the fromSRA vs just a normal FTP link. I see the fromSRA needs the NCBI_API_KEY etc (https://www.nextflow.io/docs/latest/channel.html#fromsra), which would require more documentation etc. |
This was bothering me so I spent most of yesterday trying to come up with a more robust solution to deal with sample,fastq_1,fastq_2
ERS848976,, here is the output from the samplesheet validation script I have written in Python. sample_id,single_end,is_sra,is_ftp,fastq_1,fastq_2,md5_1,md5_2
ERX1097904_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/006/ERR1018966/ERR1018966.fastq.gz,,ad4c8a60c5f37d9ac729af4642043101,
ERX1097905_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/007/ERR1018967/ERR1018967.fastq.gz,,cfde499929357b9e589c4063a2aa5340,
ERX1097906_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/008/ERR1018968/ERR1018968.fastq.gz,,de6a35ae1aa0ccd69cc47b8ec31cc5f7,
ERX1097907_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/009/ERR1018969/ERR1018969.fastq.gz,,545512307384cd5d693ec21763d969c0,
ERX1097908_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/000/ERR1018970/ERR1018970.fastq.gz,,e599480b1cf38d3c65422b95f5ad7b32,
ERX1097909_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/001/ERR1018971/ERR1018971.fastq.gz,,b66ce9be43f84366d34eed1ebb794e7d,
ERX1097910_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/002/ERR1018972/ERR1018972.fastq.gz,,159a355462610093453c57ac939e38cc,
ERX1097911_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/003/ERR1018973/ERR1018973.fastq.gz,,6ccce5a3cb877598bd61436960d9c805,
ERX1097912_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/004/ERR1018974/ERR1018974.fastq.gz,,666d471c79b2ab21a35a4e484e07e1ee,
ERX1097913_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/005/ERR1018975/ERR1018975.fastq.gz,,123531badcf33ca00435a19d23c12c30,
ERX1097914_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/006/ERR1018976/ERR1018976.fastq.gz,,438f88acad4f764cfa5cf1715632b5e2,
ERX1097915_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/007/ERR1018977/ERR1018977.fastq.gz,,3ff320e2e5597b3bb161a3c5bafadd96,
ERX1097916_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/008/ERR1018978/ERR1018978.fastq.gz,,a86a02b8f6df8f5903d378d32280400b,
ERX1097917_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/009/ERR1018979/ERR1018979.fastq.gz,,21e32629ee698683075c61b74d59847b,
ERX1097918_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/000/ERR1018980/ERR1018980.fastq.gz,,ec2e3fab2379f17e6d587149b16dbecb,
ERX1097919_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/001/ERR1018981/ERR1018981.fastq.gz,,c1ccea70936b0363ca0ea72d6909fca3,
ERX1097920_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/002/ERR1018982/ERR1018982.fastq.gz,,b8f8f3c9b94d5f80014b276b28b5a423,
ERX1097921_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/003/ERR1018983/ERR1018983.fastq.gz,,77913696bb20aac5df7cfd929929c3d1,
ERX1097922_T1,1,1,1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/004/ERR1018984/ERR1018984.fastq.gz,,362807e093e0c5811235e4fdbb76b342, I also wanted to be able to merge experiments that had multiple runs of the same library. This turned out to be the trickiest bit but I think I have it working now 😅 |
That sounds like what we want ....?! Awesome! |
https://github.com/nf-core/viralrecon/blob/master/bin/fetch_sra_runinfo.py Just to keep this nice here :_) |
That is a beautiful fetch and validation script! For an alternative approach, I'm starting with a SQL database of SRA metadata, and use this script to automatically generate the tsv input file for eager: https://github.com/ktmeaton/plague-phylogeography/blob/master/scripts/sqlite_EAGER_tsv.py But I think this is rather atypical usage (starting from a database), and I'd imagine most people using the tsv input option are constructing it manually? I wish there was a 'compliments' section on GitHub. The tsv input + library merging is working so well for SRA projects 🎉 |
Thank you for all the kind comments on here - I bet the team really appreciates it and are all as happy as I am that this here helps people analyzing their data in a nice and tidy way 😉 Can we add your comment to a new "Testimonials" section in our readme? It would be the first of such testimonials, but we would really appreciate it! If you tweet about your experiences, the nf-core webpage also has a testimonials section, too 😄 There is a preprint btw on the paper for this, we've added you to the acknowledgments section as you were providing so much feedback on using the pipeline - please have a look if you want: https://www.biorxiv.org/content/10.1101/2020.06.11.145615v1 |
Congratulations on the preprint! It's exciting to read about all the new changes and tools that have been added. That's very kind of you to add me to the acknowledgements, I'm glad these conversations haven't been too much of burden :) I noticed that my name has a small typo (Katerine -> Katherine), but if you're passed edits that's totally fine (just very excited to be mentioned!). Certainly, you can add any of my comments to testimonials. Twitter still (admittedly) intimidates me, but I can use this as motivation to learn more! |
Also as an update about SRA input to eager, I've found network stability to be crucial. So far I've found downloading SRA files with nextflow greatly benefits from being on an HPC network (less crashes). Since this method doesn't work smoothly on local workstations (for me), it might be a less desired feature by the community for accessibility reasons? That is to say, feel free to close this issue whenever you deem appropriate! Unless you have other users advocating for the |
Sorry for the typo, we'll make sure that the next update contains a corrected reference 😓 |
I've pinged you in the correction! For the SRA stuff - If you're finding it is not stable, I think it is probably better to not offer this, in lieu of other custom scripts that people can use to build it. Maybe we could have a section of the README of 'complementary external helper tools' or something (@apeltzer would that be OK with nf-core guidelines do you think? ) So far I've found the direct standard FTP downloads from the ENA to work 99.99% of the time to be honest. So maybe the Also because I've come across more cases recently of people not uploading/labelling their data properly in the ENA (orginally described here #382 (comment)), forcing making people to make their own TSV will hopefully also make them think to check more carefully the nature of the data they are trying to use is (if that makes sense)... I will close this for now in this case. I will make a new issue though about the 'complementary external helper tools'. |
Nextflow supports this - maybe worth a try to allow users to also specify SRA input files?
The text was updated successfully, but these errors were encountered: