detrans

Purpose

This project is designed to take an amino acid sequence and convert it into a nucleotide sequence. The higher level purpose of this project is to create a different view of codon bias and how we model it. Here, we attempt to model codon bias as a natural language processing problem, specifically, translation. Generating a language model that can then be used to translate one amino acid sequence string to a nucleotide sequence string provides a method for genering sequences that can be inserted into vectors that are optimized for the codon bias of a particular organism.

Methods

We utilized recurrent neural networks (RNNs) in order to accomplish this. Specifically, we use Long Short-Term Memory (LSTM) networks as a way to learn nucleotide sequence encodings of arbitrary length amino acid sequences. LSTMs provide a way to model the long term dependencies that occur within sequences. Training of networks is best accomplished using GPUs as they provide significant speedup for network training.

Workflow

Gather coding DNA sequences (CDS) for training for many species related to your vector
Train network using these sequences
Select a specific species (the vector you will be using), select only their highly expressed genes
Fine tune the network trained on many species with many species by using the subset of highly expressed genes for your vector

See the below for a more comprehensive explanation of steps including different scripts to use for different aspects of the workflow.

Fetch and prepare CDS data for training
Create a text file that contains the NCBI genome IDs for species you will train with
Use [entrez_fetch_genome.py] (scripts/entrez_fetch_genome.py) to fetch genomes from NCBI.
Use [entrez_fetch_ft.py] (scripts/entrez_fetch_ft.py) to fetch the feature tables for the specified species from NCBI.
Use [extract_cds_from_fasta.py] (scripts/extract_cds_from_fasta.py) to extract the CDS from genomes using the feature tables. This will also create a fasta file containing the translated sequences. Note that this script will remove any sequences that contain ambiguity codes. Ambiguity codes are not supported by this application at this time.
Use [fasta_nlp.py] (scripts/fasta_nlp.py) to convert the fasta CDS and amino acid sequence files to the correct format for training.
Train the network
Use [detrans_train.py] (networks/detrans_train.py) to train the general network
Prepare data for one-shot learning
Select species of interest (most likely the vector you're using).
Extract CDS for selected species.
Filter CDS, keep only highly expressed (or other characteristic) sequences.
One-shot learning
use [detrans_one_shot.py] (networks/detrans_one_shot.py) to fine tune your network for your specific vector
Detranslate proteins
Use [detrans_classify.py] (networks/detrans_classify.py) to generate a nucleotide sequence from a polypeptide.

Tutorial

Use the following steps for an end to end example on how to run detrans.

# Install dependencies, it is suggested you use virtualenv
pip install -r requirements.txt

# Create list of genomes to use for training

# Fetch and prepare CDS data for training
scripts/entrez_fetch_genome.py args...
scripts/entrez_fetch_ft.py args...
scripts/extract_cds_from_fasta.py args..

# Format data for training
scripts/fasta_nlp.py args...

# Train the model
networks/detrans_train.py args...

# Prepare data for one-shot learning
scripts/fasta_nlp.py args...

# One-shot learning
networks/detrains_train.py --one_shot --load_model model_prefix args...

# Detranslate sequences of interest

Tips for running

You may consider training with the --gru option to use GRUs instead of LSTMs as GRUs may train faster
During one-shot training, use 1 (or few) training epochs so that you don't overtrain your models

Dependencies

keras (https://github.com/fchollet/keras)
Make sure that you're running the newest version of keras (pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps)
theano (https://github.com/Theano/Theano)
scikit-learn (https://github.com/scikit-learn/scikit-learn)
h5py

Publication

Acknowledgements

The authors would like to thank the following individuals for their support in developing this project:

BYU Deep-Learning Study Group
1. Mike Brodie
2. Aaron Dennis
3. Derrall Heath
4. Logan Mitchell
5. Christopher Tensmeyer
Alexander Lemon
BYU Computational Science Laboratory

Contributors

@masakistan (sfujimoto@gmail.com)

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
networks		networks
scripts		scripts
test		test
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

detrans

Purpose

Methods

Workflow

Tutorial

Tips for running

Dependencies

Publication

Acknowledgements

Contributors

About

Releases

Packages

Contributors 2

Languages

byucsl/detrans

Folders and files

Latest commit

History

Repository files navigation

detrans

Purpose

Methods

Workflow

Tutorial

Tips for running

Dependencies

Publication

Acknowledgements

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages