Skip to content

tCNN a deep learning R package

Li Chen edited this page Apr 11, 2019 · 9 revisions

Background

With the development and decreasing cost of next-generation sequencing technologies, the study of the human microbiome has become an important research field accounting for huge potentials in clinical applications such as drug response predictions, patient stratification, and disease diagnosis. Thus, it is essential and desirable to build a microbiome-based predictor for clinical outcomes driven by microbiome profiles consisting of microbiome abundance and a phylogenetic tree. One important characteristic for microbiome data is the number of microbiome is far larger than small sample size (n>>p), resulting in a high-dimensional problem. Another important characteristic is all microbiome species are correlated on the phylogenetic tree and clustered at different phylogenetic depths. Thus, the phylogenetic tree provides for the structure of microbiome, which might be important prior information for prediction. Though sparse regression models such as Lasso[1], Elastic Net[2] or popular machine models such as randomForest[3] have been widely used in the high-dimensional prediction task. They have strong assumption of sparsity and do not consider the correlations of microbiome. Thus, prediction methods that consider the phylogenetic tree and high-dimensionality in a flexible way are under-developed.

With the development of artificial intelligence, convolutional neural networks (CNN) provide potential solutions to the curse of dimensionality. Compared with a typical neural networks (NN), CNN holds the advantages to incorporate the feature correlations by utilizing the convolutional kernels on the convolutional layers. We develop a deep learning prediction method”Tree-regularized convolutional Neural Network,” (tCNN) for microbiome-based prediction. The advantage of tCNN is that it uses the convolutional kernel to capture the signals of microbiome species with a close evolutionary relationship in a local receptive field. Moreover, tCNN uses the different convolutional layer to capture different taxonomic rank (e.g. species, genus, family, etc). Together, the convolutional layers with its built-in convolutional kernels capture microbiome signals at different taxonomic levels while encouraging local smoothing induced by the phylogenetic tree. ‘tCNN’ is implemented in user-friendly R package based on TensorFlow. We vision this work will benefit machine learning and statistic community in bioinformatics research.

Related work

glmnet, one R package that implements sparse regression model such as Lasso and Elastic Net do not consider the correlated microbiome in the tree and has strong assumption of sparsity[4]. glmgraph[5], one R package for microbiome-based prediction method based on correlated microbiome in the phylogenetic tree, has been developed and previously supported by Google Summer Of Code 2014 R-projects. However, glmgraph faces the challenges of dramatically increased computational time when the dimension of microbiome species increase and convergence issue when the outcome is binary. randomForest, another R package, that are used for microbiome-based prediction do not consider the tree structure neither.

Though R has interfaces for Keras[6] to implement CNN productively using the high-level Keras and Estimator APIs, there are limitations for these APIs for fine operations. In contrast, Tensorflow[7] is in a lower-level, more basic approach, which is more flexible in designing a CNN for a specific research problem. In this task, we develop an R package ‘tCNN’ based on Tensorflow by incorporating the tree structure without the sparsity assumption in the prediction task. Our developed R package ‘tCNN’ aims to outperform existing methods in both prediction accuracy and computational time than existing methods in both simulated datasets and real datasets.

Details of your coding project

The purpose of this work is to provide R users a package in high-dimensional prediction task with consideration of correlated features in pre-defined tree/graph with applications in microbiome research. The R package will be implemented based TensorFlow API.

  1. Design the convolutional layers to incorporate the tree structure and design the convolutional kernels for capturing the local correlations of microbiome species in the tree. Optimize the training process of CNN using different optimization algorithms.
  2. Test the prediction performance of CNN along with other computing methods in simulated datasets.
  3. Test the prediction performance of CNN along with other computing methods on real datasets[8,9].
  4. Write the R package with detailed documentation in vignettes and submit it to CRAN

Expected impact

Deep learning has been widely implemented in Python framework, however, still has a barrier for R users who are not familiar with Python programming. Moreover, few R packages for deep learning have been developed especially for high-dimensional prediction task in microbiome research. This proposed work aims to provide a user-friendly R package that will be interested to statisticians and computer scientists. Following up the developed R package, we will try to submit the work to academic journal such as Oxford Bioinformatics.

Mentors

  • Jun Chen Chen.Jun2@mayo.edu (https://www.mayo.edu/research/faculty/chen-jun-ph-d/bio-20126134) is an Associate Professor in Division of Biomedical Statistics and Informatics at Mayo Clinic. His research concerns the development and application of powerful and robust statistical methods for high-dimensional "omics" data, arising from modern high-throughput technologies such as microarray and next-generation sequencing. He has successfully mentored students in previous Google Summer of Code. He developed CRAN R package such as CpGFilter, glmgraph, structFDR, GUniFrac, SmartSVA, etc.
  • Li Chen lzc0061@auburn.edu (https://lichen-lab.github.io/) is an Assistant Professor in Pharmacy and Computer Science at Auburn University. His research focuses on developing statistical and informatics methods for analyzing of large-scale omics and population-based epidemiological data. He has successfully participated in previous Google Summer of Code. He developed CRAN R package such as glmgraph and R-BioConductor packages such as ChIPComp and traseR.

Tests for potential students

Easy:

  • Can you explain what the neural network is?
  • Why do we need to use activation function in a neural network?
  • What is the convolutional operation?
  • Can you install TensorFlow in your machine and implement a simple CNN on MNIST by TF estimator API following the official document?
  • What is a linear regression?
  • Can you explain Convolution kernel, channel, stride, window size, padding?
  • Can you explain pooling?
  • What is softmax?

Medium:

  • What are overfitting and underfitting?
  • What are L1 and L2 regularization?
  • What should we do if the loss doesn’t converge?
  • Can you implement a simple CNN without estimator API?
  • What is the purpose of the learning rate?
  • Can you explain SGD?

Hard:

  • How can you design simulations to test if the model works?
  • What should be the main component of our package?

Solutions of tests

Students, please post a link to your test results here.

Students

Ye Wang, 4th year PhD student in Computer Science at Department of Computer Science and Software Engineering, Auburn University, AL, USA.

Reference

  1. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288.
  2. Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67, 301-320.
  3. Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition 1 278-282.
  4. Hastie, T. and Qian, J., (2014). glmnet vignette. Retrieve from http://www.web.stanford.edu/~ hastie/Papers/Glmnet_Vignette. pdf. Accessed September, 20, 2016.
  5. Chen, L., Liu, H., Kocher, J.P.A., Li, H. and Chen, J., (2015). glmgraph: an R package for variable selection and predictive modeling of structured genomic data. Bioinformatics, 31, 3991-3993.
  6. Chollet, Francois et al. Keras. 2015 https://keras.io
  7. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ..., Kudlur, M.(2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation 265-283
  8. Yatsunenko, T., Rey, F. E., Manary, M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., ..., Heath, A. C. (2012). Human gut microbiome viewed across age and geography. Nature, 486, 222.
  9. Smith, M. I., Yatsunenko, T., Manary, M. J., Trehan, I., Mkakosya, R., Cheng, J., ..., Liu, J. (2013). Gut microbiomes of Malawian twin pairs discordant for kwashiorkor. Science, 339, 548-554.
Clone this wiki locally