Drug discovery and high-throughput screening sub-section outline (#174)

This build is based on 855a2cb. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/205610680 https://travis-ci.org/greenelab/deep-review/jobs/205610681 [ci skip] The full commit message that triggered this build is copied below: Drug discovery and high-throughput screening sub-section outline (#174) Outline high throughput screening sub-section in the Treat section and add tags for related references
greenelab · Feb 26, 2017 · 30efb95 · 30efb95
1 parent 2cc9ab9
commit 30efb95
Show file tree

Hide file tree

Showing 5 changed files with 486 additions and 14 deletions.
diff --git a/all-sections.md b/all-sections.md
@@ -730,23 +730,49 @@ feature selection and construction.*
 
 ### Ligand-Based Prediction of Bioactivity
 
-*Deep learning has been applied in some fashion to structure-based,
-compound-protein interaction-based, and ligand-based prediction problems
-with the overall goal of finding chemical compounds that impair protein
-activity.  AtomNet [@ref_2] is worth including as an example
-of how 3D convolutions can be used for structure-based modeling.  The main
-emphasis should be on ligand-based approaches.  There was substantial hype
-after the Merck Kaggle competition, which we can comment on.  Multitask
-networks have been impactful here.  There are also creative approaches
-for unsupervised and supervised learning with chemical compounds.  Neural
-networks have opened new avenues for representation learning.  These
-approaches may not be dominant on supervised learning
-tasks yet, but they are uniquely tailored to neural networks and have
-great future potential.*
+**TODO: expand outline**
+
+- Short introduction to problem, related reviews, use vHTS definition from
+[@ref_120] (vHTS doesn't fit neatly into classic classification,
+regression, or ranking)
+- Introduce ligand-based approaches, hype and excitement surrounding
+performance of a "high-parameter" network on the Merck Kaggle challenge,
+cover other neural networks trained on fingerprints or descriptors as features
+that followed, Tox21 Data Challenge
+- Multitask networks related to the above point
+- Realistic view of where things stand today, high-parameter networks struggle
+with overfitting, cross validation needs to be done carefully because of temporal
+structure [@ref_119], low parameter networks based on chemical
+similarity (IRV) work very well, especially well-suited for the domain in which
+training data can be limited and contains few positive instances, may touch on
+BACE example here and other discussions of training data limitations (e.g.
+[@ref_118])
+- "Creative experimentation" phase of the field, new ideas for representation
+learning and novel approaches including graph convolutions, autoencoders,
+one shot learning, and generative models
+- These "creative" approaches are definitely interesting but aren't necessarily
+outperforming existing methods, improvements on the software and
+reusability side could be important to help establish more rigorous
+benchmarking, DeepChem as example of this
+- Future outlook, what would need to happen for the "creative" approaches
+to overtake the current state of the art, can representation learning be
+improved by incorporating more information about chemical properties or
+even more "tasks" during training, how much will future growth depend on
+data versus algorithms
+- Future outlook part 2, how the above approaches relate to traditional
+methods like docking (note neural networks that include docking scores as
+features), deep learning efforts in this direction that use structure (e.g.
+[@ref_2 @ref_117]), "zero-shot learning",
+analogies to other domains where deep learning can capture the behavior
+of complex physics (e.g. quantum physics example), maybe briefly mention
+other compound-protein interaction-based networks although that doesn't seem
+to fit here and is somewhat out of scope
+- Future output part 3 (most speculative), what would successful generative
+networks mean for the HTS field?
 
 ### Modeling Metabolism and Chemical Reactivity
 
-Add a reveiw here of metabolism and chemical reactivity.
+*Add a review here of metabolism and chemical reactivity.*
 
 
 ## Discussion

diff --git a/bibliography.bib b/bibliography.bib
@@ -778,3 +778,79 @@ @article{ref_116
  year = {2012}
 }
 
+
+@article{ref_117,
+ abstract = {Computational approaches to drug discovery can reduce the time and cost
+associated with experimental assays and enable the screening of novel
+chemotypes. Structure-based drug design methods rely on scoring functions to
+rank and predict binding affinities and poses. The ever-expanding amount of
+protein-ligand binding and structural data enables the use of deep machine
+learning techniques for protein-ligand scoring.
+We describe convolutional neural network (CNN) scoring functions that take as
+input a comprehensive 3D representation of a protein-ligand interaction. A CNN
+scoring function automatically learns the key features of protein-ligand
+interactions that correlate with binding. We train and optimize our CNN scoring
+functions to discriminate between correct and incorrect binding poses and known
+binders and non-binders. We find that our CNN scoring function outperforms the
+AutoDock Vina scoring function when ranking poses both for pose prediction and
+virtual screening.},
+ archiveprefix = {arXiv},
+ author = {Matthew Ragoza and Joshua Hochuli and Elisa Idrobo and Jocelyn Sunseri and David Ryan Koes},
+ eprint = {1612.02751v1},
+ file = {1612.02751v1.pdf},
+ link = {http://arxiv.org/abs/1612.02751v1},
+ month = {12},
+ primaryclass = {stat.ML},
+ title = {Protein-Ligand Scoring with Convolutional Neural Networks},
+ year = {2016}
+}
+
+
+@article{ref_118,
+ abstract = {Recent advances in machine learning have made significant contributions to
+drug discovery. Deep neural networks in particular have been demonstrated to
+provide significant boosts in predictive power when inferring the properties
+and activities of small-molecule compounds. However, the applicability of these
+techniques has been limited by the requirement for large amounts of training
+data. In this work, we demonstrate how one-shot learning can be used to
+significantly lower the amounts of data required to make meaningful predictions
+in drug discovery applications. We introduce a new architecture, the residual
+LSTM embedding, that, when combined with graph convolutional neural networks, significantly improves the ability to learn meaningful distance metrics over
+small-molecules. We open source all models introduced in this work as part of
+DeepChem, an open-source framework for deep-learning in drug discovery.},
+ archiveprefix = {arXiv},
+ author = {Han Altae-Tran and Bharath Ramsundar and Aneesh S. Pappu and Vijay Pande},
+ eprint = {1611.03199v1},
+ file = {1611.03199v1.pdf},
+ link = {http://arxiv.org/abs/1611.03199v1},
+ month = {Dec},
+ primaryclass = {cs.LG},
+ title = {Low Data Drug Discovery with One-shot Learning},
+ year = {2016}
+}
+
+
+@article{ref_119,
+ abstract = {Deep learning methods such as multitask neural networks have recently been
+applied to ligand-based virtual screening and other drug discovery
+applications. Using a set of industrial ADMET datasets, we compare neural
+networks to standard baseline models and analyze multitask learning effects
+with both random cross-validation and a more relevant temporal validation
+scheme. We confirm that multitask learning can provide modest benefits over
+single-task models and show that smaller datasets tend to benefit more than
+larger datasets from multitask learning. Additionally, we find that adding
+massive amounts of side information is not guaranteed to improve performance
+relative to simpler multitask learning. Our results emphasize that multitask
+effects are highly dataset-dependent, suggesting the use of dataset-specific
+models to maximize overall performance.},
+ archiveprefix = {arXiv},
+ author = {Steven Kearnes and Brian Goldman and Vijay Pande},
+ eprint = {1606.08793v3},
+ file = {1606.08793v3.pdf},
+ link = {http://arxiv.org/abs/1606.08793v3},
+ month = {Jun},
+ primaryclass = {stat.ML},
+ title = {Modeling Industrial ADMET Data with Multitask Networks},
+ year = {2016}
+}
+
diff --git a/bibliography.json b/bibliography.json
@@ -9499,6 +9499,127 @@
     "type": "webpage",
     "id": "ref_115"
   },
+  {
+    "indexed": {
+      "date-parts": [
+        [
+          2016,
+          11,
+          27
+        ]
+      ],
+      "date-time": "2016-11-27T14:43:44Z",
+      "timestamp": 1480257824618
+    },
+    "reference-count": 0,
+    "publisher": "American Chemical Society (ACS)",
+    "issue": "4",
+    "content-domain": {
+      "domain": [],
+      "crossmark-restriction": false
+    },
+    "short-container-title": [
+      "J. Chem. Inf. Model."
+    ],
+    "cited-count": 0,
+    "published-print": {
+      "date-parts": [
+        [
+          2009,
+          4,
+          27
+        ]
+      ]
+    },
+    "DOI": "10.1021/ci8004379",
+    "type": "article-journal",
+    "created": {
+      "date-parts": [
+        [
+          2009,
+          3,
+          26
+        ]
+      ],
+      "date-time": "2009-03-26T12:05:55Z",
+      "timestamp": 1238069155000
+    },
+    "page": "756-766",
+    "source": "CrossRef",
+    "title": "Influence Relevance Voting: An Accurate And Interpretable Virtual High Throughput Screening Method",
+    "prefix": "http://id.crossref.org/prefix/10.1021",
+    "volume": "49",
+    "author": [
+      {
+        "given": "S. Joshua",
+        "family": "Swamidass",
+        "affiliation": []
+      },
+      {
+        "given": "Chloé-Agathe",
+        "family": "Azencott",
+        "affiliation": []
+      },
+      {
+        "given": "Ting-Wan",
+        "family": "Lin",
+        "affiliation": []
+      },
+      {
+        "given": "Hugo",
+        "family": "Gramajo",
+        "affiliation": []
+      },
+      {
+        "given": "Shiou-Chuan",
+        "family": "Tsai",
+        "affiliation": []
+      },
+      {
+        "given": "Pierre",
+        "family": "Baldi",
+        "affiliation": []
+      }
+    ],
+    "member": "http://id.crossref.org/member/316",
+    "container-title": "Journal of Chemical Information and Modeling",
+    "original-title": [],
+    "deposited": {
+      "date-parts": [
+        [
+          2016,
+          9,
+          2
+        ]
+      ],
+      "date-time": "2016-09-02T02:55:56Z",
+      "timestamp": 1472784956000
+    },
+    "score": 1.0,
+    "subtitle": [],
+    "short-title": [],
+    "issued": {
+      "date-parts": [
+        [
+          2009,
+          4,
+          27
+        ]
+      ]
+    },
+    "alternative-id": [
+      "10.1021/ci8004379"
+    ],
+    "URL": "http://dx.doi.org/10.1021/ci8004379",
+    "citing-count": 0,
+    "subject": [
+      "Chemistry(all)",
+      "Chemical Engineering(all)",
+      "Library and Information Sciences",
+      "Computer Science Applications"
+    ],
+    "id": "ref_120"
+  },
   {
     "abstract": "The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.",
     "author": [
@@ -10535,5 +10656,101 @@
     },
     "title": "Advances in optimizing recurrent networks",
     "type": "article-journal"
+  },
+  {
+    "abstract": "Computational approaches to drug discovery can reduce the time and cost associated with experimental assays and enable the screening of novel chemotypes. Structure-based drug design methods rely on scoring functions to rank and predict binding affinities and poses. The ever-expanding amount of protein-ligand binding and structural data enables the use of deep machine learning techniques for protein-ligand scoring. We describe convolutional neural network (CNN) scoring functions that take as input a comprehensive 3D representation of a protein-ligand interaction. A CNN scoring function automatically learns the key features of protein-ligand interactions that correlate with binding. We train and optimize our CNN scoring functions to discriminate between correct and incorrect binding poses and known binders and non-binders. We find that our CNN scoring function outperforms the AutoDock Vina scoring function when ranking poses both for pose prediction and virtual screening.",
+    "author": [
+      {
+        "family": "Ragoza",
+        "given": "Matthew"
+      },
+      {
+        "family": "Hochuli",
+        "given": "Joshua"
+      },
+      {
+        "family": "Idrobo",
+        "given": "Elisa"
+      },
+      {
+        "family": "Sunseri",
+        "given": "Jocelyn"
+      },
+      {
+        "family": "Koes",
+        "given": "David Ryan"
+      }
+    ],
+    "id": "ref_117",
+    "issued": {
+      "date-parts": [
+        [
+          2016,
+          12
+        ]
+      ]
+    },
+    "title": "Protein-ligand scoring with convolutional neural networks",
+    "type": "article-journal"
+  },
+  {
+    "abstract": "Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds. However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the residual LSTM embedding, that, when combined with graph convolutional neural networks, significantly improves the ability to learn meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery.",
+    "author": [
+      {
+        "family": "Altae-Tran",
+        "given": "Han"
+      },
+      {
+        "family": "Ramsundar",
+        "given": "Bharath"
+      },
+      {
+        "family": "Pappu",
+        "given": "Aneesh S."
+      },
+      {
+        "family": "Pande",
+        "given": "Vijay"
+      }
+    ],
+    "id": "ref_118",
+    "issued": {
+      "date-parts": [
+        [
+          2016,
+          12
+        ]
+      ]
+    },
+    "title": "Low data drug discovery with one-shot learning",
+    "type": "article-journal"
+  },
+  {
+    "abstract": "Deep learning methods such as multitask neural networks have recently been applied to ligand-based virtual screening and other drug discovery applications. Using a set of industrial ADMET datasets, we compare neural networks to standard baseline models and analyze multitask learning effects with both random cross-validation and a more relevant temporal validation scheme. We confirm that multitask learning can provide modest benefits over single-task models and show that smaller datasets tend to benefit more than larger datasets from multitask learning. Additionally, we find that adding massive amounts of side information is not guaranteed to improve performance relative to simpler multitask learning. Our results emphasize that multitask effects are highly dataset-dependent, suggesting the use of dataset-specific models to maximize overall performance.",
+    "author": [
+      {
+        "family": "Kearnes",
+        "given": "Steven"
+      },
+      {
+        "family": "Goldman",
+        "given": "Brian"
+      },
+      {
+        "family": "Pande",
+        "given": "Vijay"
+      }
+    ],
+    "id": "ref_119",
+    "issued": {
+      "date-parts": [
+        [
+          2016,
+          6
+        ]
+      ]
+    },
+    "title": "Modeling industrial admet data with multitask networks",
+    "type": "article-journal"
   }
 ]