Skip to content

Commit

Permalink
Drug discovery and high-throughput screening sub-section outline (#174)
Browse files Browse the repository at this point in the history
This build is based on
855a2cb.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/205610680
https://travis-ci.org/greenelab/deep-review/jobs/205610681

[ci skip]

The full commit message that triggered this build is copied below:

Drug discovery and high-throughput screening sub-section outline (#174)

Outline high throughput screening sub-section in the Treat section and add tags for related references
  • Loading branch information
agitter committed Feb 26, 2017
1 parent 2cc9ab9 commit 30efb95
Show file tree
Hide file tree
Showing 5 changed files with 486 additions and 14 deletions.
54 changes: 40 additions & 14 deletions all-sections.md
Original file line number Diff line number Diff line change
Expand Up @@ -730,23 +730,49 @@ feature selection and construction.*

### Ligand-Based Prediction of Bioactivity

*Deep learning has been applied in some fashion to structure-based,
compound-protein interaction-based, and ligand-based prediction problems
with the overall goal of finding chemical compounds that impair protein
activity. AtomNet [@ref_2] is worth including as an example
of how 3D convolutions can be used for structure-based modeling. The main
emphasis should be on ligand-based approaches. There was substantial hype
after the Merck Kaggle competition, which we can comment on. Multitask
networks have been impactful here. There are also creative approaches
for unsupervised and supervised learning with chemical compounds. Neural
networks have opened new avenues for representation learning. These
approaches may not be dominant on supervised learning
tasks yet, but they are uniquely tailored to neural networks and have
great future potential.*
**TODO: expand outline**

- Short introduction to problem, related reviews, use vHTS definition from
[@ref_120] (vHTS doesn't fit neatly into classic classification,
regression, or ranking)
- Introduce ligand-based approaches, hype and excitement surrounding
performance of a "high-parameter" network on the Merck Kaggle challenge,
cover other neural networks trained on fingerprints or descriptors as features
that followed, Tox21 Data Challenge
- Multitask networks related to the above point
- Realistic view of where things stand today, high-parameter networks struggle
with overfitting, cross validation needs to be done carefully because of temporal
structure [@ref_119], low parameter networks based on chemical
similarity (IRV) work very well, especially well-suited for the domain in which
training data can be limited and contains few positive instances, may touch on
BACE example here and other discussions of training data limitations (e.g.
[@ref_118])
- "Creative experimentation" phase of the field, new ideas for representation
learning and novel approaches including graph convolutions, autoencoders,
one shot learning, and generative models
- These "creative" approaches are definitely interesting but aren't necessarily
outperforming existing methods, improvements on the software and
reusability side could be important to help establish more rigorous
benchmarking, DeepChem as example of this
- Future outlook, what would need to happen for the "creative" approaches
to overtake the current state of the art, can representation learning be
improved by incorporating more information about chemical properties or
even more "tasks" during training, how much will future growth depend on
data versus algorithms
- Future outlook part 2, how the above approaches relate to traditional
methods like docking (note neural networks that include docking scores as
features), deep learning efforts in this direction that use structure (e.g.
[@ref_2 @ref_117]), "zero-shot learning",
analogies to other domains where deep learning can capture the behavior
of complex physics (e.g. quantum physics example), maybe briefly mention
other compound-protein interaction-based networks although that doesn't seem
to fit here and is somewhat out of scope
- Future output part 3 (most speculative), what would successful generative
networks mean for the HTS field?

### Modeling Metabolism and Chemical Reactivity

Add a reveiw here of metabolism and chemical reactivity.
*Add a review here of metabolism and chemical reactivity.*


## Discussion
Expand Down
76 changes: 76 additions & 0 deletions bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -778,3 +778,79 @@ @article{ref_116
year = {2012}
}


@article{ref_117,
abstract = {Computational approaches to drug discovery can reduce the time and cost
associated with experimental assays and enable the screening of novel
chemotypes. Structure-based drug design methods rely on scoring functions to
rank and predict binding affinities and poses. The ever-expanding amount of
protein-ligand binding and structural data enables the use of deep machine
learning techniques for protein-ligand scoring.
We describe convolutional neural network (CNN) scoring functions that take as
input a comprehensive 3D representation of a protein-ligand interaction. A CNN
scoring function automatically learns the key features of protein-ligand
interactions that correlate with binding. We train and optimize our CNN scoring
functions to discriminate between correct and incorrect binding poses and known
binders and non-binders. We find that our CNN scoring function outperforms the
AutoDock Vina scoring function when ranking poses both for pose prediction and
virtual screening.},
archiveprefix = {arXiv},
author = {Matthew Ragoza and Joshua Hochuli and Elisa Idrobo and Jocelyn Sunseri and David Ryan Koes},
eprint = {1612.02751v1},
file = {1612.02751v1.pdf},
link = {http://arxiv.org/abs/1612.02751v1},
month = {12},
primaryclass = {stat.ML},
title = {Protein-Ligand Scoring with Convolutional Neural Networks},
year = {2016}
}


@article{ref_118,
abstract = {Recent advances in machine learning have made significant contributions to
drug discovery. Deep neural networks in particular have been demonstrated to
provide significant boosts in predictive power when inferring the properties
and activities of small-molecule compounds. However, the applicability of these
techniques has been limited by the requirement for large amounts of training
data. In this work, we demonstrate how one-shot learning can be used to
significantly lower the amounts of data required to make meaningful predictions
in drug discovery applications. We introduce a new architecture, the residual
LSTM embedding, that, when combined with graph convolutional neural networks, significantly improves the ability to learn meaningful distance metrics over
small-molecules. We open source all models introduced in this work as part of
DeepChem, an open-source framework for deep-learning in drug discovery.},
archiveprefix = {arXiv},
author = {Han Altae-Tran and Bharath Ramsundar and Aneesh S. Pappu and Vijay Pande},
eprint = {1611.03199v1},
file = {1611.03199v1.pdf},
link = {http://arxiv.org/abs/1611.03199v1},
month = {Dec},
primaryclass = {cs.LG},
title = {Low Data Drug Discovery with One-shot Learning},
year = {2016}
}


@article{ref_119,
abstract = {Deep learning methods such as multitask neural networks have recently been
applied to ligand-based virtual screening and other drug discovery
applications. Using a set of industrial ADMET datasets, we compare neural
networks to standard baseline models and analyze multitask learning effects
with both random cross-validation and a more relevant temporal validation
scheme. We confirm that multitask learning can provide modest benefits over
single-task models and show that smaller datasets tend to benefit more than
larger datasets from multitask learning. Additionally, we find that adding
massive amounts of side information is not guaranteed to improve performance
relative to simpler multitask learning. Our results emphasize that multitask
effects are highly dataset-dependent, suggesting the use of dataset-specific
models to maximize overall performance.},
archiveprefix = {arXiv},
author = {Steven Kearnes and Brian Goldman and Vijay Pande},
eprint = {1606.08793v3},
file = {1606.08793v3.pdf},
link = {http://arxiv.org/abs/1606.08793v3},
month = {Jun},
primaryclass = {stat.ML},
title = {Modeling Industrial ADMET Data with Multitask Networks},
year = {2016}
}

217 changes: 217 additions & 0 deletions bibliography.json
Original file line number Diff line number Diff line change
Expand Up @@ -9499,6 +9499,127 @@
"type": "webpage",
"id": "ref_115"
},
{
"indexed": {
"date-parts": [
[
2016,
11,
27
]
],
"date-time": "2016-11-27T14:43:44Z",
"timestamp": 1480257824618
},
"reference-count": 0,
"publisher": "American Chemical Society (ACS)",
"issue": "4",
"content-domain": {
"domain": [],
"crossmark-restriction": false
},
"short-container-title": [
"J. Chem. Inf. Model."
],
"cited-count": 0,
"published-print": {
"date-parts": [
[
2009,
4,
27
]
]
},
"DOI": "10.1021/ci8004379",
"type": "article-journal",
"created": {
"date-parts": [
[
2009,
3,
26
]
],
"date-time": "2009-03-26T12:05:55Z",
"timestamp": 1238069155000
},
"page": "756-766",
"source": "CrossRef",
"title": "Influence Relevance Voting: An Accurate And Interpretable Virtual High Throughput Screening Method",
"prefix": "http://id.crossref.org/prefix/10.1021",
"volume": "49",
"author": [
{
"given": "S. Joshua",
"family": "Swamidass",
"affiliation": []
},
{
"given": "Chloé-Agathe",
"family": "Azencott",
"affiliation": []
},
{
"given": "Ting-Wan",
"family": "Lin",
"affiliation": []
},
{
"given": "Hugo",
"family": "Gramajo",
"affiliation": []
},
{
"given": "Shiou-Chuan",
"family": "Tsai",
"affiliation": []
},
{
"given": "Pierre",
"family": "Baldi",
"affiliation": []
}
],
"member": "http://id.crossref.org/member/316",
"container-title": "Journal of Chemical Information and Modeling",
"original-title": [],
"deposited": {
"date-parts": [
[
2016,
9,
2
]
],
"date-time": "2016-09-02T02:55:56Z",
"timestamp": 1472784956000
},
"score": 1.0,
"subtitle": [],
"short-title": [],
"issued": {
"date-parts": [
[
2009,
4,
27
]
]
},
"alternative-id": [
"10.1021/ci8004379"
],
"URL": "http://dx.doi.org/10.1021/ci8004379",
"citing-count": 0,
"subject": [
"Chemistry(all)",
"Chemical Engineering(all)",
"Library and Information Sciences",
"Computer Science Applications"
],
"id": "ref_120"
},
{
"abstract": "The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.",
"author": [
Expand Down Expand Up @@ -10535,5 +10656,101 @@
},
"title": "Advances in optimizing recurrent networks",
"type": "article-journal"
},
{
"abstract": "Computational approaches to drug discovery can reduce the time and cost associated with experimental assays and enable the screening of novel chemotypes. Structure-based drug design methods rely on scoring functions to rank and predict binding affinities and poses. The ever-expanding amount of protein-ligand binding and structural data enables the use of deep machine learning techniques for protein-ligand scoring. We describe convolutional neural network (CNN) scoring functions that take as input a comprehensive 3D representation of a protein-ligand interaction. A CNN scoring function automatically learns the key features of protein-ligand interactions that correlate with binding. We train and optimize our CNN scoring functions to discriminate between correct and incorrect binding poses and known binders and non-binders. We find that our CNN scoring function outperforms the AutoDock Vina scoring function when ranking poses both for pose prediction and virtual screening.",
"author": [
{
"family": "Ragoza",
"given": "Matthew"
},
{
"family": "Hochuli",
"given": "Joshua"
},
{
"family": "Idrobo",
"given": "Elisa"
},
{
"family": "Sunseri",
"given": "Jocelyn"
},
{
"family": "Koes",
"given": "David Ryan"
}
],
"id": "ref_117",
"issued": {
"date-parts": [
[
2016,
12
]
]
},
"title": "Protein-ligand scoring with convolutional neural networks",
"type": "article-journal"
},
{
"abstract": "Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds. However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the residual LSTM embedding, that, when combined with graph convolutional neural networks, significantly improves the ability to learn meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery.",
"author": [
{
"family": "Altae-Tran",
"given": "Han"
},
{
"family": "Ramsundar",
"given": "Bharath"
},
{
"family": "Pappu",
"given": "Aneesh S."
},
{
"family": "Pande",
"given": "Vijay"
}
],
"id": "ref_118",
"issued": {
"date-parts": [
[
2016,
12
]
]
},
"title": "Low data drug discovery with one-shot learning",
"type": "article-journal"
},
{
"abstract": "Deep learning methods such as multitask neural networks have recently been applied to ligand-based virtual screening and other drug discovery applications. Using a set of industrial ADMET datasets, we compare neural networks to standard baseline models and analyze multitask learning effects with both random cross-validation and a more relevant temporal validation scheme. We confirm that multitask learning can provide modest benefits over single-task models and show that smaller datasets tend to benefit more than larger datasets from multitask learning. Additionally, we find that adding massive amounts of side information is not guaranteed to improve performance relative to simpler multitask learning. Our results emphasize that multitask effects are highly dataset-dependent, suggesting the use of dataset-specific models to maximize overall performance.",
"author": [
{
"family": "Kearnes",
"given": "Steven"
},
{
"family": "Goldman",
"given": "Brian"
},
{
"family": "Pande",
"given": "Vijay"
}
],
"id": "ref_119",
"issued": {
"date-parts": [
[
2016,
6
]
]
},
"title": "Modeling industrial admet data with multitask networks",
"type": "article-journal"
}
]
Loading

0 comments on commit 30efb95

Please sign in to comment.