Skip to content
forked from kr-colab/popvae

Dataframe dimensionality reduction with a VAE

License

Notifications You must be signed in to change notification settings

RhettRautsaw/pyVAE

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary

pyVAE is a modification of popVAE (manuscript; GitHub) and is designed to fit a variational autoencoder (VAE) to generalized multi-dimensional data (e.g., transcriptome expression data) and output the latent space.

This page is forked from popVAE and can be used to also install an archived version of popVAE (v0.1). The manuscript describing popVAE should be cited if you use pyVAE (see #Citation below).


What is a VAE?

A VAE is a machine learning methodology which uses neural networks to learn the latent representation of data. It involves two steps, an encoding step and a decoding step (seem image below). For the purposes of of pyVAE, we use the majority of our data to train a model and -- once trained -- we can re-input out data into the model to view it in reduced-dimension latent space.

Importantly, VAEs maintain what is known as "global structure". Meaning that the distance between the points in latent space is meaningful in some way. While useful for data visualization, other machine-learning methods for dimensional-reduction which have become common in fields such as single-cell RNAseq -- such as UMAP and t-SNE -- fail to maintain global structure. These methodologies are therefore not useful for downstream analyses and can even distort the data to produce erroneous results (Chari et al. 2022).

Install

pyVAE requires python 3.7 and tensorflow 1.15. We recommend you first install anaconda3, then install in a new environment.

Clone this repo and install with:

conda create --name pyVAE python=3.7.7
conda activate pyVAE
git clone https://github.com/rhettrautsaw/pyVAE.git
cd pyVAE
python setup.py install

Run

pyVAE requires input tab-delimited txt format.

SECTION IN PROGRESS

Working on making a small test dataset. Generally you will fit a model with:

pyVAE.py --infile data/pabu/pabu_test_genotypes.vcf --out out/pabu_test --seed 42

It should fit in less than a minute on a regular laptop CPU. For running on larger datasets we recommend using a CUDA-enabled GPU.

Output

At default settings pyVAE will output 4 files:
pabu_test_latent_coords.txt -- best-fit latent space coordinates by sample.
pabu_test_history.txt -- training and validation loss by epoch.
pabu_test_history.pdf -- a plot of training and validation loss by epoch.
pabu_test_training_preds.txt -- latent coordinates output during model training, stored every --prediction_freq epochs.

Parameters

Many hyperparameters and filtering options can be adjusted at the command line. Run pyVAE.py --h to see all parameters.

Default settings work well on most datasets, but validation loss can usually be improved by tuning hyperparameters. We've seen most effects from changing three settings: network size, early stopping patience, and the proportion of samples used for model training versus validation.

--search_network_sizes runs short optimizations for a range of network sizes and selects the network with lowest validation loss. Alternately, --depth and --width set the number of layers and the number of hidden units per layer in the network. If you're running low on GPU memory, reducing --width will help.

--patience sets the number of epochs the optimizer will run after the last improvement in validation loss -- we've found that increasing this value (to, say, 300) sometimes helps with small datasets.

--train_prop sets the proportion of samples used for model training, with the rest used for validation.

To run a grid search over a specific set of network sizes with increased patience and a larger validation set on the test data, use:

pyVAE.py --infile data/pabu/pabu_test_genotypes.vcf \
--out out/pabu_test --seed 42 --patience 300 \
--search_network_sizes --width_range 32,256,512 \
--depth_range 3,5,8 --train_prop 0.75

Plotting

I recommend using ggpubr for plotting the results of pyVAE.

SECTION IN PROGRESS

pabu_test_latent_coords.txt

Citation

If you use pyVAE, please cite popVAE

Battey CJ, Coffing GC, Kern AD. 2021. Visualizing population structure with variational autoencoders. G3 Genes|Genomes|Genetics. 11(1):jkaa036.

as well as my own paper

IN PREP

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%