Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to bring in custom annotations (BSgenome, TxDb)? #19

Open
mirax87 opened this issue Dec 10, 2020 · 4 comments
Open

How to bring in custom annotations (BSgenome, TxDb)? #19

mirax87 opened this issue Dec 10, 2020 · 4 comments

Comments

@mirax87
Copy link

mirax87 commented Dec 10, 2020

Hi,

thanks for this interesting tool. I am current trying to get ularcirc to run with some of my data.

Unfortunately, the reference genome for alignments don't match the UCSC chromosome naming conventions, so I thought of creating my own BSgenome and TxDb. I already forged the BSgenome, the TxDb is yet to come.

For now, with the BSgenome loaded in to the name space, I tried to find it in the shiny App under Setup configuration. My custom BSgenome was not listed - I could imagine that it would be due to my missing TxDb (yet to be produced).

My question for you:
Is it yet possible to bring in custom genome + annotation and if so, how can I achieve that?

best,
-Michael

@davhum
Copy link
Collaborator

davhum commented Dec 11, 2020

In theory it should be possible to bring in custom genome + annotation. However it will require that an annotation database is available. i.e. Ularcirc first searches for annotation database libraries that is named as follows:

org..eg.db

so for humans this is

org.Hs.eg.db

The two letter code is then used to identify matching genome and transcript data bases.

If an annotation data base library exists for you organism then it sounds like you are very close to having all the required items.

@mirax87
Copy link
Author

mirax87 commented Feb 3, 2021

What about the BSgenome and TxDB? They seem to be mandatory as well. Also where is the annotation database required to be - it's checking somewhere online, right?

If there is a local installation of the database possible, it would be great, if there was a wrapper, where the user provides the genome fasta, the genome annotation (e.g. gtf) file (and else might be necessary) to bring in custom annotations suitable for ularcirc.
Would that be feasible?

@davhum
Copy link
Collaborator

davhum commented Feb 3, 2021

Agree have a wrapper is a good idea - but I am unsure of what is involved for some of those files. I have experience in making TxDb from gtf, but have not generated genome or annotation database. You mentioned you had generated genome file, was that easy to do? I suspect the annotation database is the most involved.

Perhaps another solution to your problem is to convert your alignment coordinates to UCSC coordinated. I could make a wrapper for that. If you could generate a small test dataset I could generate a simple method to convert to a format that is compatible with existing databases.

@mirax87
Copy link
Author

mirax87 commented Feb 3, 2021

I thought about the conversion of alignments - or even remapping - but the downstream effects of the conversion will be to costly for me as I am using more tools for circRNA prediction and quantification (mostly from the CIRI world). Thank you for the offer, though.

Regarding the BSgenome, I think it's not too tricky and believe it can be automated (in a wrapper). The BSgenome has some documentation on the how to forge a new one. In brief, you create sort of a dictionary (seed.dcf), with all relevant BSgenome information and compile it with BSgenome::forgeBSgenomeDataPkg. There are more forums and discussions around that can help be of help. Here is the BSgenome documentation, check for 'How to forge a BSgenome data package'.


This is what the seed.dcf file looks in my case, but cannot guarantee that these are the minimum specs:

Package: BSgenome.dm6.ensembl
Title: "dm6 from local repository"
Description: "compatible with snakePipes alignments"
Version: 0.999                                            # random number
organism: Drosophila_melanogaster
common_name: Fruitfly
provider: FlyBase
provider_version: dm6
release_name: dm6
release_date: 2018_03
source_url: <path to fasta directory>
organism_biocview: dm6_ensembl
BSgenomeObjname: dm6_ensembl
seqs_srcdir: <path to fasta directory>
seqfile_name: genome.2bit                                  # genome in 2bit

Genome fasta to 2bit conversion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants