Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User specified splicing sites #197

Closed
wants to merge 8 commits into from
Closed

User specified splicing sites #197

wants to merge 8 commits into from

Conversation

Piezoid
Copy link
Contributor

@Piezoid Piezoid commented Jul 6, 2018

This PR is mostly a feature request with a PoC implementation.

The bioinformatics team at the french sequencing center (Genoscope), working on MinION RNA-seq data, have expressed the need of specifying the splicing sites from an external source. In their de novo workflow, they infer splicing sites by aligning cDNA SR on genomics contigs. They'd like to leverage this information for improving the alignment of cDNA and directRNA ONT reads on their assembly.
With high error rate, false positives of inferred splicing sites may cause exons to be misaligned, yielding the wrong isoform. There is also the rare cases of non canonical sites.

This PR loads splicing sites from a BED containing introns intervals or a TSV file listing unpaired splicing sites. The file is indicated on the command line with the '-u' flag. The BED format is compatible with files from UCSC. The TSV format is more proprietary and use the minimap2's internal coordinates (-1bp compared to BED's introns intervals). The TSV file has a column indicating wether the site is gap opening of closing.

Sites with contig name not in the reference index are discarded. The splicing coordinates are stored and sorted in one vector per reference sequence. The MSB bit of the coordinate is set if the site is gap opening (donor).
Multi-part indexes, are untested. In verbose mode it will complain about sites on contigs not found in the current part of the index.

Before each alignment, splicing sites for the region are retrieved and the corresponding nucleotides of the extracted reference sequence are tagged in their high bits part. This allows ksw_exts2_sse to remain unaware of sequence offsets. ksw_exts2_sse constructs it's donor and acceptor arrays by reading the tags and then untag the reference.
I'm not quite happy with all these contortions. There is a 5/10% loss in performance even without using specified sites.
The fact that ksw_exts2_sse uses contiguous 16-bytes aligned pool for its arrays and is unaware of sequence offsets complicates the parameter passing without additional copying and allocating.

It also support an "hybrid mode" where splicing sites are both inferred and specified. Each specified site score +noncan/2, while canonical inferred sites score's remains 0 and non-canonical -noncan.
I don't know if it's ok to use a positive score. My reasoning is that we add information by using externally supplied sites.

Initially in hybrid mode, the alignments with the sites inferred from the opposite strand of the transcript scored higher since they get more sites by adding the specified sites that may be on the opposite strand. The solution implemented is to remove the canonical sites on the opposite alignment strand from the list of specified sites.

I may add real world results when my colleagues are done with the evaluation.
Meanwhile, I'm open for comments and reviews.
Thanks !

Edit: Here is a pile-up of a small exon that got misaligned without knowledge of the gene model.
This is mouse brain sample direct RNA ONT read mapped on mm10. Top pile-up is generated with -ax splice -u introns.bed and bottom with -ax splice (original behaviour, the exon is missed). The region shown is chr9:37,544,827-37,546,222, an excerpt of the neurogranin gene.

@armintoepfer
Copy link
Contributor

To me, it sounds too specific for a generic aligner.

@Piezoid
Copy link
Contributor Author

Piezoid commented Jul 8, 2018

I tend to agree, but the heuristic for detecting splicing sites seems less generic than loading them from an external source.

@lh3
Copy link
Owner

lh3 commented Jul 16, 2018

I like the idea of user-defined splice sites. However, I may need to reimplement some part of this PR. I will probably remove the support of custom TSV, too. I will come back to this issue later. Thank you.

Piezoid and others added 8 commits November 9, 2018 16:23
In hybrid mode (-ub -uintrons.bed) user supplied splicing sites are added to list of canonical sites detected on one strand for the first alignement and then another alignement is made with canonical sites on the opposite strand. The best alignement is conserved.

When user sites for a transcript on one strand are added to the cannonical sites on the other strand this produces more splicing combinations and thus a better scoring but meaningless alignement.

This commit add logic discarding user provided sites that are canonical on the opposite strand from which the other cannonical sites are detected. This should help restore the balance and the detection of the correct transcript strand.
@lh3
Copy link
Owner

lh3 commented Apr 28, 2019

I have implemented the basic idea at HEAD. New option --junc-bed specifies introns (i.e. junctions). This file can be generated paftools.js gff2bed -j. Option --junc-bonus specifies the score bonus for a donor/acceptor site in --junc-bed.

Thanks for the suggestion. I am closing this PR. If you find bugs, please create a separate issue.

@lh3 lh3 closed this Apr 28, 2019
@lh3 lh3 added this to the 2.17 milestone Apr 29, 2019
@leleory
Copy link

leleory commented Apr 7, 2020

Hi Heng,
--junc-bed requires a BED12 file representing known annotation, but for de-novo assemblies such a file may not be available. Nevertheless, if RNA-seq data is available for the species that can also provide splice junction information. E.g. STAR would create an SJ.out.tab file with such information (or it would be possible to generate such file from the mapped RNA-seq reads). Would it be possible to modify the --junc-bed parameter so it would accept a simple bedfile representing possible introns as well?
Thank you,
Lel

@lh3
Copy link
Owner

lh3 commented Apr 7, 2020

--junc-bed accepts a list of introns in BED.

@lh3
Copy link
Owner

lh3 commented Apr 8, 2020

PS: see the manpage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants