-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer strand-specificity from sub-sampled BAM #197
Comments
How about just having a sub-sample option? I've thought about doing auto-strand specification before but avoided it because (a) it's quite complex for something that won't be used loads and (b) it has the potential to do bad things if it goes wrong (eg. degraded datasets?). A sub-sample option would be a lot simpler and would allow manual specification pretty quickly and easily. Could also throw bigger scarier warnings if the infer-experiment results look wrong... ☠️ |
Hi. We implemented the auto-detection on our production pipeline, for the reason mentioned by @drpatelh, by simply running a bowtie2 fast mapping on 200K reads, running infer_experiment.py and writting a small parser than return stranded/yes/no. |
Yes, we already have |
We have a similar implementation with I have seen the unknown datatype issue too but for most instances its spot on, and I agree that if the strandedness is calculated per sample then you arent limited by a single setting. However, Ive been thinking about this alot lately, and I would argue that we should be explicit in suggesting how our pipelines are run i.e. per experiment/across experiments/it doesnt matter. I say this because downstream analyses like differential analysis can only work if the pipeline is run for samples that are relevant for a particular biological interpretation... Anywho, this would all be much easier if the pipeline was just using |
For now, instead of using a full-blown sub-sampling and auto-detection approach I have decide to add a WARNING to the top of the MultiQC report that flags up any samples where the strandedness provided in the samplesheet isn't the same as that calculated by RSeQC |
Strand-specificity is not always known beforehand, especially for public datasets where this information may not have been provided. Using the wrong strand information will mean that the results from
HiSat2
andfeatureCounts
will not be correct.It should be possible to sub-sample fastq files, map these with
STAR
, and then useinfer_experiment.py
to work out the strand-specificity before running the steps mentioned above. For example, a cut-off of 70% can be used to infer the actual strandedness. This can then be passed to the rest of the pipeline.At the very least, with the current set-up it would be good to check that the strand-specificty parameter matches the results of
infer_experiment.py
maybe with some logging info that returns a warning if this isn't the case. Although, I'm not sure how well this will work withHiSat2
because this information would already have been used to perform the alignments.The text was updated successfully, but these errors were encountered: