Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer strand-specificity from sub-sampled BAM #197

Closed
drpatelh opened this issue Apr 28, 2019 · 5 comments
Closed

Infer strand-specificity from sub-sampled BAM #197

drpatelh opened this issue Apr 28, 2019 · 5 comments
Milestone

Comments

@drpatelh
Copy link
Member

Strand-specificity is not always known beforehand, especially for public datasets where this information may not have been provided. Using the wrong strand information will mean that the results from HiSat2 and featureCounts will not be correct.

It should be possible to sub-sample fastq files, map these with STAR, and then use infer_experiment.py to work out the strand-specificity before running the steps mentioned above. For example, a cut-off of 70% can be used to infer the actual strandedness. This can then be passed to the rest of the pipeline.

At the very least, with the current set-up it would be good to check that the strand-specificty parameter matches the results of infer_experiment.py maybe with some logging info that returns a warning if this isn't the case. Although, I'm not sure how well this will work with HiSat2 because this information would already have been used to perform the alignments.

@ewels
Copy link
Member

ewels commented Apr 28, 2019

How about just having a sub-sample option? I've thought about doing auto-strand specification before but avoided it because (a) it's quite complex for something that won't be used loads and (b) it has the potential to do bad things if it goes wrong (eg. degraded datasets?).

A sub-sample option would be a lot simpler and would allow manual specification pretty quickly and easily. Could also throw bigger scarier warnings if the infer-experiment results look wrong... ☠️

@nservant
Copy link

Hi. We implemented the auto-detection on our production pipeline, for the reason mentioned by @drpatelh, by simply running a bowtie2 fast mapping on 200K reads, running infer_experiment.py and writting a small parser than return stranded/yes/no.
My feeling is that it makes sense for production purpose, when you have many samples per day, and not always the strandness information.
However, from our experience, infer_experiment can sometimes return an "unknown datatype", especially on old public datasets ... so that's good to keep a way to force the parameter.
Finally, we now infer strand-specificity on-the-fly, only if the stranded parameter is not set (default behavior). If the parameter is set, we directly use the value provided by the user.
Note that this could also be useful to process samples with different strandness in a single process.
In this case however, it might be good to have a warning somewhere !

@ewels
Copy link
Member

ewels commented Apr 28, 2019

Yes, we already have forward / reverse / unstranded, so we could add auto and make that the default I guess. That should be pretty clear and wouldn't hurt setups such as ours where we prep all libraries and know what the strandedness should be.. 👍

@drpatelh
Copy link
Member Author

We have a similar implementation with STAR in our local pipeline too. Something I suggested after being caught out a couple of times because we didnt have the appropriate kit information. Access to the appropriate metadata is key here, and this isnt always easy to come by...

I have seen the unknown datatype issue too but for most instances its spot on, and I agree that if the strandedness is calculated per sample then you arent limited by a single setting. However, Ive been thinking about this alot lately, and I would argue that we should be explicit in suggesting how our pipelines are run i.e. per experiment/across experiments/it doesnt matter. I say this because downstream analyses like differential analysis can only work if the pipeline is run for samples that are relevant for a particular biological interpretation...

Anywho, this would all be much easier if the pipeline was just using STAR because we may have been able to get away with checking the infer_experiment.py agrees, and spitting out an appropriate message. However, a full-on sub-sampling implementation would be required for HiSat2 and other mapping algorithms such as RSEM.

@apeltzer apeltzer added this to the 1.4 milestone Jul 11, 2019
@apeltzer apeltzer removed this from the 1.4 milestone Sep 22, 2019
@drpatelh drpatelh added this to the 1.5 milestone Sep 16, 2020
@drpatelh
Copy link
Member Author

For now, instead of using a full-blown sub-sampling and auto-detection approach I have decide to add a WARNING to the top of the MultiQC report that flags up any samples where the strandedness provided in the samplesheet isn't the same as that calculated by RSeQC infer_experiment.py. I think this should be fine given that there may be issues where infer_experiment.py can chuck out a load of undetermined reads and if for example you have heavily contaminated samples:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants