GitHub

NEvolve Readme

About NEvolve

The advent of inexpensive next-generation genome sequencing (a set of DNA sequencing methodologies that capture the majority of information present in an individual's genetic code) has enabled rapid advances in our ability to estimate species' historical and contemporary population trends.

This is timely, as human activities threaten biodiversity on a global scale, which necessitates unprecedented levels of monitoring and intervention to mitigate a catastrophic number of extinctions.

To effectively manage the health of at-risk species and allocate funding, it is important for us to have a robust understanding of species' current and historical population trends (for example, recent declines in a species with a history of large population swings would likely be less concerncing than a continuous decline in a species that historically had a large population). See example plots below

Primary functionality

NEvolve is a neural net project that aims to classify and elucidate species' demographic histories using solely genomic data. Our code generates simulated genomic data derived from potential demographic scenarios, uses supervised learning to train a convolutional neural network, and then attempts to classify real-world seabird genomic data.

We are particularly interested in assessing the history and current prospects of seabirds (see two example images below, black-legged kittiwake and Leach's storm-petrel). However, we hope to make our method generalizable to any species.

Conceptual flow chart

Current progress

Our pipeline consists of:

Simulation generation
- Using ms, a program written in C
- Outputs simulation results to a text file
Conversion
- Converts the text output of ms to hdf5
A preliminary convolutional neural network
- Uses Pytorch

Goals and future directions

Though all steps in our pipeline function, we have not managed to streamline the conversion and training of our neural network. To effectively train our network, we will need to process at least a million simulations (likely more will be necessary for more complicated demographic inference). We hope to be able to achieve a fast and efficient training pipeline by the end of the backathon . We would specifically like to:

Efficiently convert our simulations to conveniently- batched hdf5 files
Optimize data loading onto the GPU
Network training speed and accuracy (i.e., hyperparameter tuning)

Future

NEvolve is currently designed to assess 'RadSeq' datasets, which are derived from reduced representation sequencing of genomic data (capturing perhaps a couple percent of the amount of information in the genome). We would like to scale up to analyzing whole-genome sequence data. We also plan to shift to using a continuous output layer as opposed to categories.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
CNN		CNN
figures		figures
instructions		instructions
pipeline		pipeline
sim_conversions		sim_conversions
sim_generation		sim_generation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEvolve Readme

About NEvolve

Primary functionality

Conceptual flow chart

Current progress

Goals and future directions

Future

About

Releases

Packages

Contributors 3

Languages

License

katieb1/NEvolve

Folders and files

Latest commit

History

Repository files navigation

NEvolve Readme

About NEvolve

Primary functionality

Conceptual flow chart

Current progress

Goals and future directions

Future

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages