Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify element for TIR and TSD #2

Open
hyphaltip opened this issue Jul 12, 2011 · 1 comment
Open

Classify element for TIR and TSD #2

hyphaltip opened this issue Jul 12, 2011 · 1 comment
Assignees

Comments

@hyphaltip
Copy link
Owner

Identify TSD and TIRs for a putative element that was screened from previous analysis step in pipeline.

@ghost ghost assigned arensburger Jul 12, 2011
@arensburger
Copy link
Collaborator

Uploaded a first draft of a script to address this issue: id_TIR_in_FASTA.pl. This script takes a fasta sequence as input and returns a gff-like file with all the possible location for TIRs in each sequence that are compatible with a specified set of constraints. This script is purposefully designed to find lots of hits, these will be narrowed down later by comparing the possible TSD/TIRs from different branches of the tree.

Here's the basic concept behind this script.

This script expects as input at least one fasta file. This fasta file is assumed 1) to be a section of a genome assembly, 2) to contain the sequence of a putative TE transposase, 3) to include some sequence upstream and downstream from the transposase sequence where TIR and TSDs will be searched for. Optionally the start and end of the tranposase sequence can be specified in the fasta title, otherwise the script will split the sequence into two equal halves and look for TIR and TSDs in each half (this might be useful when dealing with MITEs later).

The basic workflow is:

  1. do a local blast between the two sequences flanking the transposase to identify possible TIRs
  2. look at the sequences directly adjacent to the TIRs as possible TSDs. If TSDs are allowed to include indels then generate sequences with all allowed combinations of insertions and deletions in the TSDs.
  3. compare all observed and all possible TSDs and select those that are similar enough given the allowed number of substitutions in the TSD sequence
  4. write the positions in gff-like format

The TSD part is not very elegant, but given the very low number of sequences in the TSDs I don't see another way of dealing with indels than just brute force.

Next step is to take the TIR locations from different fasta files and determine 1) which fasta files have the same or similar TSDs, 2) who has dissimilar TSDs. Those fasta files with similar TIRs and dissimilar TSDs should be scored as having a high probability of being active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants