Infer strand-specificity from sub-sampled BAM #197

drpatelh · 2019-04-28T08:46:17Z

Strand-specificity is not always known beforehand, especially for public datasets where this information may not have been provided. Using the wrong strand information will mean that the results from HiSat2 and featureCounts will not be correct.

It should be possible to sub-sample fastq files, map these with STAR, and then use infer_experiment.py to work out the strand-specificity before running the steps mentioned above. For example, a cut-off of 70% can be used to infer the actual strandedness. This can then be passed to the rest of the pipeline.

At the very least, with the current set-up it would be good to check that the strand-specificty parameter matches the results of infer_experiment.py maybe with some logging info that returns a warning if this isn't the case. Although, I'm not sure how well this will work with HiSat2 because this information would already have been used to perform the alignments.

The text was updated successfully, but these errors were encountered:

ewels · 2019-04-28T14:31:18Z

How about just having a sub-sample option? I've thought about doing auto-strand specification before but avoided it because (a) it's quite complex for something that won't be used loads and (b) it has the potential to do bad things if it goes wrong (eg. degraded datasets?).

A sub-sample option would be a lot simpler and would allow manual specification pretty quickly and easily. Could also throw bigger scarier warnings if the infer-experiment results look wrong... ☠️

nservant · 2019-04-28T17:46:50Z

Hi. We implemented the auto-detection on our production pipeline, for the reason mentioned by @drpatelh, by simply running a bowtie2 fast mapping on 200K reads, running infer_experiment.py and writting a small parser than return stranded/yes/no.
My feeling is that it makes sense for production purpose, when you have many samples per day, and not always the strandness information.
However, from our experience, infer_experiment can sometimes return an "unknown datatype", especially on old public datasets ... so that's good to keep a way to force the parameter.
Finally, we now infer strand-specificity on-the-fly, only if the stranded parameter is not set (default behavior). If the parameter is set, we directly use the value provided by the user.
Note that this could also be useful to process samples with different strandness in a single process.
In this case however, it might be good to have a warning somewhere !

ewels · 2019-04-28T21:30:40Z

Yes, we already have forward / reverse / unstranded, so we could add auto and make that the default I guess. That should be pretty clear and wouldn't hurt setups such as ours where we prep all libraries and know what the strandedness should be.. 👍

drpatelh · 2019-04-28T21:35:31Z

We have a similar implementation with STAR in our local pipeline too. Something I suggested after being caught out a couple of times because we didnt have the appropriate kit information. Access to the appropriate metadata is key here, and this isnt always easy to come by...

I have seen the unknown datatype issue too but for most instances its spot on, and I agree that if the strandedness is calculated per sample then you arent limited by a single setting. However, Ive been thinking about this alot lately, and I would argue that we should be explicit in suggesting how our pipelines are run i.e. per experiment/across experiments/it doesnt matter. I say this because downstream analyses like differential analysis can only work if the pipeline is run for samples that are relevant for a particular biological interpretation...

Anywho, this would all be much easier if the pipeline was just using STAR because we may have been able to get away with checking the infer_experiment.py agrees, and spitting out an appropriate message. However, a full-on sub-sampling implementation would be required for HiSat2 and other mapping algorithms such as RSEM.

drpatelh · 2020-09-18T15:18:55Z

For now, instead of using a full-blown sub-sampling and auto-detection approach I have decide to add a WARNING to the top of the MultiQC report that flags up any samples where the strandedness provided in the samplesheet isn't the same as that calculated by RSeQC infer_experiment.py. I think this should be fine given that there may be issues where infer_experiment.py can chuck out a load of undetermined reads and if for example you have heavily contaminated samples:

apeltzer added this to the 1.4 milestone Jul 11, 2019

apeltzer removed this from the 1.4 milestone Sep 22, 2019

drpatelh added the feature-request label Aug 18, 2020

drpatelh added this to the 1.5 milestone Sep 16, 2020

drpatelh closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer strand-specificity from sub-sampled BAM #197

Infer strand-specificity from sub-sampled BAM #197

drpatelh commented Apr 28, 2019

ewels commented Apr 28, 2019

nservant commented Apr 28, 2019

ewels commented Apr 28, 2019

drpatelh commented Apr 28, 2019

drpatelh commented Sep 18, 2020

Infer strand-specificity from sub-sampled BAM #197

Infer strand-specificity from sub-sampled BAM #197

Comments

drpatelh commented Apr 28, 2019

ewels commented Apr 28, 2019

nservant commented Apr 28, 2019

ewels commented Apr 28, 2019

drpatelh commented Apr 28, 2019

drpatelh commented Sep 18, 2020