Drug discovery and high-throughput screening sub-section outline #174

agitter · 2017-01-02T21:13:27Z

I'm working on the drug discovery and high-throughput chemical screening sub-section for the Treat topic. This is not ready to merge, but I created to pull request so that others can follow along or contribute. @kumardeep27 are you still interested in writing this section? You were also interested in other types of molecular binding (e.g. RNA), and we definitely need help with that.

This sub-section will cover the following papers, which I am adding to tags.tsv. This list is also a work in progress that I will edit.

Definitely:

Skip

Deep Learning in Drug Discovery #23
Deep Artificial Neural Networks and Neuromorphic Chips for Big Data Analysis: Pharmaceutical and Bioinformatics Applications #89
The Next Era: Deep Learning in Pharmaceutical Research #111 though it does note the importance of prospective screening
The role of different sampling methods in improving biological activity prediction using deep belief network #179 Merck Kaggle challenge

Later pull request

agitter · 2017-01-03T17:38:13Z

I checked citations of some of the main papers in this area to catch up on work I had missed previously. The papers listed above should be a fairly comprehensive list now.

agitter · 2017-01-03T21:27:26Z

Skipping #179, it is not in scope in my opinion

agitter · 2017-01-06T20:21:58Z

Added #185. Papers are appearing faster than I can write about them!

cgreene · 2017-01-06T22:20:17Z

@agitter : I felt the same way writing the categorize/imaging section! It's amazing how quickly the preprint/publication list is growing.

agitter · 2017-01-06T22:38:30Z

I'm pausing work on this section. @swamidass is an expert in the domain and is going to provide input.

swamidass · 2017-01-06T22:58:58Z

The plan from here is to send you some references, to see what is in/out scope, and to adjust sections accordingly.

rbharath · 2017-01-20T01:32:07Z

This looks like a good list of drug-discovery related papers. Here are a couple preliminary thoughts.

It might be useful to split this list into ligand-based and structural methods. Most of the papers linked above (except #56 and #175) are ligand-centric for small molecules. The progression in ligand space seems to be moving from hand-crafted featurizations (#54, #55) based off circular fingerprints to learned representations (#52, #53). Might be worth making parallels to similar movement to end-to-end elsewhere in deep learning (DNNs for speech started from DNN+HMM models but have started to migrate to end-to-end RNN systems recently).

It's probably worth emphasizing the low-data problem as well. In #55, we found that overfit was a major problem with our networks due to lack of sufficient data, which motivated our work in #148 and #141. Found it's very common to find experimentalists with low-throughput assays and small datasets (10s-100s) who want to know what deep-learning can do for them.

Discussions of low data might provide a nice segue into structural techniques. In a way, deep-docking is a "zero-shot" method since you can apply docking models to protein-ligand pairs that aren't in the training set. Perhaps worth making an analogy to Google's recent zero-shot NMT (neural machine translation) paper

swamidass · 2017-01-20T06:55:30Z

So sorry about the delay here. I've been catching up after PSB. Putting aside sometime over the weekend to take a pass at this.

agitter · 2017-01-21T13:36:45Z

@swamidass I've been really busy too so no problem. The narrative @rbharath presented above is in line with my thoughts. Do you have comments on that and where the missing literature you referred to would fit in?

There is also a fairly distinct class of methods that focus on compound-protein interactions. In my opinion, these would include any methods that featurize the proteins in some manner.

rbharath · 2017-01-21T21:09:52Z

Another thought: For ligand based methods, some people also like to talk about 1D, 2D, and 3D descriptors. 1D would be things like molecular weight, 2D things based off connectivity structure of molecule, 3D would depend on spatial conformation of molecule. Don't know if we would want to structure section this way though.

There are also a few papers on computational antibody design, but this field is very underdeveloped compared to computational small molecule design. I'd imagine the biggest players in this space use Rosetta, but don't know of much that's deep-learning related yet.

swamidass · 2017-01-24T04:07:48Z

In terms of references, I did have a few additions.

There is some early work worth mentioning (even in passing because in 1992 it was certainly ahead of its time)...
http://pubs.acs.org/doi/abs/10.1021/ci00010a023

The work of on using Recursive NNs is important:
http://pubs.acs.org/doi/abs/10.1021/ci400187y
http://link.springer.com/chapter/10.1007/978-3-642-04031-3_34#page-1
http://pubs.acs.org/doi/abs/10.1021/ci200207y

Also there is quite a bit of work in metabolism prediction by my group (full disclosure). This work uses a "convolutional" network, often in a hierarchical way with both atom/bond and molecule level outputs.

http://pubs.acs.org/doi/abs/10.1021/acs.chemrestox.5b00017
http://pubs.acs.org/doi/abs/10.1021/acs.chemrestox.6b00385
http://pubs.acs.org/doi/full/10.1021/acscentsci.5b00131
http://pubs.acs.org/doi/full/10.1021/acscentsci.6b00162
http://pubs.acs.org/doi/abs/10.1021/ci400518g
https://academic.oup.com/bioinformatics/article/31/7/1136/180576/XenoSite-server-a-web-available-site-of-metabolism
https://academic.oup.com/bioinformatics/article-abstract/32/20/3183/2196454/A-simple-model-predicts-UGT-mediated-metabolism
http://pubs.acs.org/doi/abs/10.1021/ci5005652

There is also the IRV architecture by Baldi/Swamidass...

http://pubs.acs.org/doi/abs/10.1021/ci8004379
https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0110-6

I'll give some context on this in the next comment.

swamidass · 2017-01-24T04:38:30Z

So I want to make a couple comments. You can consider this draft text, but it needs to be integrated with what has already been written:

==========

There has been a recent increase in studies examining Deep Learning in chemistry. In most cases where they have been applied (except reaction prediction, metabolism and IRVs) deep learning has NOT reliably produced better performance than fingerprint-based methods in predicting bioactivity. This is a very important result of this early work.

There are a couple exceptions to this rule, and they are important to understand so that a path forward can be charted. In chemical reaction prediction and solubility, (Baldi group) and metabolism (my group), deep learning approaches are reliably outperforming other methods. I think part of the reason is that the current methods are good at picking up local features that work well for predicting reactions and physiochemical properties. In metabolism tasks, networks are replicated across the molecule so that, effectively, the training examples are atoms (or bonds) instead of molecules. This dramatically increases the amount of data to train on, and reduces the risk of overfitting on complex networks. Notably, in all these tasks local substructures have all proven very effective.

However, larger scale "shape" still seems to be a problem for most deep learning methods that use 2D graphs as an input. It is possible that better methods could improve this situation, but the fact that a simple non-parameter method (circular fingerprints) outperforms high-parameters deep learning architectures is not encouraging.

Currently, the highest performing architectures for bioactivity (see the two IRV papers) for bioactivity prediction build on top of the success of fingerprints, and are very low parameter to enable application to the low amounts of data available in many chemical training tasks.

The potency sensitive IRVs present an important opportunity for Deep Learning, by leveraging the superior ability of neural networks to fuse information from complex data. While the bioactivity screening task is usually framed as a binary classification, the training also includes potency information that denotes the strength of interactions. A well designed deep learning architecture can take advantage of this information in the training data to improve the binary classification ability on the test data. This approach reliably outperforms other state of the art methods (including fingerprints). Similar improvements were seen in a metabolism that made use of fingerprints at the input layer to improve predictions.

Likewise, one major advantage of Deep Learning Approaches is that they can exploit information from related tasks by learning to solve them in concert. An example of this is http://pubs.acs.org/doi/full/10.1021/acscentsci.6b00162 , where a multitask network substantially outperformed individually constructed networks. This had a particularly strong effect in improving predictions of protein-conjugation reactions in a dataset of very limited size.

From the studies currently published, there are patterns about how Deep Learning can be utilized in ligand-based chemistry.

First, there is value in exploiting the unique strengths of data fusion and multitask learning, as this can yield improvements over standard approaches. By incorporating data or combining tasks, deep learning can make use of otherwise discarded information to make better predictions. Second, in chemistry the number of parameters still matters, and the best methods are tuning model complexity to the size of the dataset. While deep learning in imaging analysis can tolerate many more parameters than training examples, it is still advisable to limit the number of parameters in chemistry. Third, hybrid approaches that build low-parameter recursive networks on top of fingerprints (e.g. IRV) appear to be the most accurate at the current moment.

swamidass · 2017-01-24T04:45:01Z

And final comment for tonight. I think the references you have are good, but I will do a final pass on the text to add more. E.g. there are more in metabolism, but I have to dig them up.

Also, I did include several from my group (and did disclose this). So I think someone with less conflicts of interest should (1) do the first pass on the text and (2) make a fair independent assessment of my work should be included. If, also, you feel I should drop off the author list because of this, please let me know. Thanks.

@cgreene thoughts here?

agitter · 2017-01-24T12:54:57Z

@rbharath do the 1D, 2D, and 3D descriptors have any special relationship with different neural network architectures, or do people mostly just throw them into a feature vector? Some of the work I've seen takes all decriptors from a program like Dragon and treats them as unordered features. If that's the case, we could briefly mention that approach, state how it relates to fingerprints as alternative features, and later contrast it with the graph convolutional approaches.

Antibody design is an interesting idea. Even if the area is underdeveloped currently, we'd like to present future deep learning opportunities in this manuscript and this could be a good one.

agitter · 2017-01-24T13:10:54Z

Thanks @swamidass for all of the suggestions. I have a lot of catching up to do, so I'm only leaving cursory thoughts for now. If you're proposing this as draft text for the section, we may need a separate pull request so that we can review it like we have for other contributions and your text is attributed to you.

Overall, the papers you suggested and narrative are fairly disjoint from the list I had above and initial thoughts from @rbharath. We'll need to work on merging these threads. For example the sites of metabolism prediction problem is in-scope in my opinion but is distinct from the high-throughput screening problem.

When thinking about the scope, we ultimately want to relate these methods back to medicine and disease, even if that's in the very long term. For some applications, it would be helpful to guide readers and make the connection between things like chemical reaction prediction and disease more explicit with examples.

I don't see a problem with writing about your own methods as long as you do your best to be objective and also include relevant work from other groups. We want experts writing about these topics. Inevitably experts will have their own work in the domain.

agitter · 2017-01-24T15:01:49Z

To rephrase, instead of "merging these threads" the better goal would be to determine how the various drug- and chemistry-related tasks fit together in the Treat section. It seems like we have a few distinct topics in drug discovery, sites of metabolism, etc. that could become different sub-sections but should flow together.

swamidass · 2017-01-24T18:35:43Z

I do agree that is a good plan. You are right that metabolism is distinct from HTS. Though please do remember that the IRV is for HTS.

Regarding metabolism, it is actually quite connected to human disease, specifically drug reactions. I think it is certainly in scope and is good to include as one of those areas where Deep Learning has been producing superior performance in chemistry over other methods.

I think so my observations, though, apply more broadly than just metabolism. DL has an advantage over other methods in its ability to multitask (transfer learn) and to data fuse heterogenous datasets.

Regrading using Dragon for the input vector to a NN. This approach was recently revived by a few, by using an autoencoder to map molecules onto a 2D space. Its a nice visualization, but it is not clear the practical utility. More significant is the history here. The general poor behavior of using NNs directly on fingerprint vectors (like Dragon) is exactly what led many to conclude about a decade ago that neural networks were not useful in chemistry. Networks with very large input spaces and only a small number of training examples are very high parameter and prone to overfit. They just did not work terribly well. For this reason, fingerprint based methods (e.g. SVMs, and similarity searching) have been dominant, with people also using decision trees and statistical approaches (like Naive Bayes). The built-in regularization of this approaches were particularly important in chemistry.

Though neural networks were used by Baldi and myself in the 2000s for HTS, that was not the dominant approach. And we were only able to get it to work by building on top of fingerprints, and using weight replication (a classic DL technique) to dramatically reduce the number of parameters (often to less than 10).

swamidass · 2017-01-24T18:36:29Z

From here, some one with a bigger view of the paper should decide how to organize this all together. From there I can form these comments into some text, and make a pull request. What do you think?

rbharath · 2017-01-24T22:02:01Z

@agitter The 1D, 2D, 3D descriptors can be handled in a couple ways. Sometimes, like the Dragon vectors, they're just flattened and fed into a fully connected network. Other times (mainly for 2D and 3D), the network architecture itself is redesigned to make use of the spatial structure. The graph-convolutional networks are basically 2D architectures, and the atomnet paper is a 3D architecture (likely a 3D convolutional kernel from a DNN package under the hood).

Antibody design is a very cool area :-). I'd love to see more work here, but it might be best posed in a challenges for deep-learning subsection. I think one of the big issues is actually GPU memory (fitting protein-protein complexes into today's deep architectures will swamp GPU memory on today's cards)

@swamidass I really like the links to your metabolism work. Metabolism is certainly very linked to disease, and it's worth a careful discussion here. Adding discussion of the convolutional/IRV architectures from your/Baldi's group also add a lot of value here.

I don't know if I'd agree entirely that deep networks haven't been proven better than fingerprint-based methods. For a number of tasks (especially with big data), I think the existing studies do indicate superiority. However, in lower data regimes, I can certainly see the argument.

I think that the jury's still out on low-parameter vs. high-parameter networks. Depending on the dataset (especially on the amount of data available), I think there are certainly cases where either can win.

swamidass · 2017-01-25T03:25:35Z

Regarding our disagreement about deep networks vs. fingerprint networks, I should be clear that I bracket that as a "usually" in "bioactivity prediction." Though, it is possible that I am out of date. Things are moving fast. I think it would of high value to make a publication-date sorted table that examines that question.

swamidass · 2017-01-25T03:50:32Z

@agitter wrote, "The progression in ligand space seems to be moving from hand-crafted featurizations (#54, #55) based off circular fingerprints to learned representations (#52, #53)."

This is a bit too imprecise for the field. Let me explain some back story here. In chemoinformatics, hand-crafted featurizations are called "structural keys" and they do not usually work terribly well, even though (for example) PubChem and a few other software packages still compute them. After that, path fingerprints (e.g. Daylight) were introduced and they work much better predictions, but have some problems. Then came circular fingerprints (e.g. ECFP), which resolved many of the problems with path based fingerprints, and these work extremely well.

To be clear, these are not learned representations (so far) and fingerprints has a fairly specific meaning in chemical informatics: a modulus bit vector of one-hot encoded substructures. It refers to this datatype, and there are modifications that improve the compression, the similarity calculations, and extend it into counts. All of these "fingerprints" are just number vectors.

Right now, however, this is a bit of confusion about the term "circular fingerprints." It is sometimes (in my view incorrectly) used to refer several things that are quite distinct...

Circular fingerprints as first defined above (the correct description), which are NOT hand coded.
Numerical descriptions (e.g. counting the number of oxygen 1 step away) of the chemical graph around a specific atom, as is used in XenoSite and MetaPrint. These are more correct termed "topological descriptors". These too are not exactly hand coded.
Deep Learning architectures that integrate information over an atoms neighbors, as is done in the Pande group papers. This is what I think you mean by "learned circular fingerprints," but this is somewhat an abuse of terminology. At the very least it is neologism. If this is what the field settles on that is fine, but it needs to be kept a separate concept than 1. above.

So with that background, it is not correct (in my opinion) to say that "the move is from hand coded to features to learned circular fingerprints." Many papers on DL for chemistry do not actually benchmark against circular fingerprints. When they do, so far they have usually performed worse (though please show me some examples otherwise if you have them). Even then, they do not usually benchmark against IRVs which has been shown to be superior to circular fingerprints alone. It would be great to see a solid benchmark of these methods, but I currently do not know of one.

I think the current phase of inquiry is "creative experimentation." DL is new in the field and it has great promise. A lot of new people are starting to publish here, which is great. Creativity is being rewarded right now, and correctly reviewers are not preventing the publication of interesting architectures that are not yet proven (by the usual standards of the field) to bring strong performance gains. In this phase, I do not think we yet know what will ultimately work best or gain wide adoption. We probably have a few more years of this before we really understand what is working and what should move to wide adoption. Soon, I expect, reviewers will begin raising bar on new publications so as to expect better benchmarking, more in line with the field before this burst of attention.

cgreene · 2017-01-25T21:44:36Z

I'm going to pop in here and make a quick comment. @agitter - it might be helpful to reduce the scope of this PR a bit. What do you think about targeting individual subsections with each PR. I've found with categorize that the larger the pull request the longer it takes to merge. Outstanding PRs also make it tough to contribute new text as it creates an interesting dependency problem when integrating text.

agitter · 2017-01-25T22:15:38Z

@cgreene Yes, that's a very good idea. I'll work on that, but probably not in the next day or two. I do see some value in keeping this pull request broad as a place to host our ongoing discussion. We're still trying to define which distinct sub-sections are needed and how they fit together.

@rbharath @swamidass Lots of great comments, which I'll respond to ASAP. This is going to be a very strong section thanks to your contributions.

A quick remark, I think some of the references to "learned fingerprints" come from #52 (Convolutional Networks on Graphs for Learning Molecular Fingerprints). If that is an abuse of terminology, we can clarify or push back in the review.

swamidass · 2017-02-09T00:05:11Z

So, please let me know when a reasonable draft on the chemical related sections (HTS and metabolism) is done. I'll edit and add a lot from there. I do not want to be the one writing the first draft, because I am pretty "close" to this area and do not want to bias things too much towards my work.

agitter · 2017-02-09T11:46:25Z

@swamidass I still plan to write a first draft as soon as I can. But I've delayed the writing so that I can go back and read more first, e.g. about IRV.

agitter · 2017-02-18T23:18:54Z

I wrote a new outline based on feedback from @rbharath and @swamidass. I'll wait a couple days for further comments and am happy to make revisions. If this looks okay, I'll work on the first complete draft.

rbharath · 2017-02-20T00:59:47Z

@agitter The outline looks good to me!

agitter · 2017-02-26T21:04:06Z

I changed the scope of this pull request to include the outline only. Now that the outline is finished, I'll create a new pull request when I have a full draft to merge. I'll look at the new issues #250 and #251 before doing that.

The tags.tsv file on this branch had diverged considerably from the master branch version, so I manually merged them.

agitter · 2017-02-26T21:06:46Z

@cgreene The outline content has already been discussed above. My review request is primarily to have one more person confirm this will be a clean merge before squashing and merging.

dhimmel · 2017-02-26T21:25:57Z

@agitter looks clean to me. And the build succeeded (https://travis-ci.org/greenelab/deep-review/builds/205601924).

agitter · 2017-02-26T21:32:18Z

@dhimmel I was grateful for the integration test as I was cleaning up the references. A lot had changed.

Now that you've reviewed this, I'll merge.

This build is based on 855a2cb. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/205610680 https://travis-ci.org/greenelab/deep-review/jobs/205610681 [ci skip] The full commit message that triggered this build is copied below: Drug discovery and high-throughput screening sub-section outline (#174) Outline high throughput screening sub-section in the Treat section and add tags for related references

Outline high throughput screening sub-section in the Treat section

bcbc739

agitter added the treat label Jan 2, 2017

agitter added 2 commits January 2, 2017 15:14

Add tags for drug discovery references

c099b6b

Add tags for drug discovery references

21c9ac9

agitter mentioned this pull request Jan 6, 2017

Do you want to include chemistry/drugs? #187

Closed

cgreene mentioned this pull request Jan 7, 2017

Current Section Status #188

Closed

agitter mentioned this pull request Jan 19, 2017

Massively Multitask Networks for Drug Discovery #55

Closed

agitter mentioned this pull request Feb 9, 2017

Accurate and efficient target prediction using a potency-sensitive influence-relevance voter #229

Open

agitter mentioned this pull request Feb 18, 2017

Molecular Graph Convolutions: Moving Beyond Fingerprints #53

Closed

Rewrite outline

0fed16a

Fix typo

037d6fb

dhimmel force-pushed the master branch 6 times, most recently from bd3cb76 to 9178a88 Compare February 26, 2017 01:50

agitter changed the title ~~[WIP] Drug discovery and high-throughput screening sub-section~~ Drug discovery and high-throughput screening sub-section outline Feb 26, 2017

agitter and others added 3 commits February 26, 2017 14:37

Update formatting before merge

7909d35

Merge reference tags with master

a24be53

Merge branch 'master' into treat-screening

e1608f7

agitter requested a review from cgreene February 26, 2017 21:04

agitter removed the request for review from cgreene February 26, 2017 21:32

agitter merged commit 855a2cb into master Feb 26, 2017

agitter deleted the treat-screening branch February 26, 2017 21:38

agitter mentioned this pull request Feb 26, 2017

How should we set up Continuous Integration / Analysis #255

Closed

agitter mentioned this pull request Apr 16, 2017

Ligand-based virtual screening #313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drug discovery and high-throughput screening sub-section outline #174

Drug discovery and high-throughput screening sub-section outline #174

agitter commented Jan 2, 2017 •

edited

Loading

agitter commented Jan 3, 2017

agitter commented Jan 3, 2017

agitter commented Jan 6, 2017

cgreene commented Jan 6, 2017

agitter commented Jan 6, 2017

swamidass commented Jan 6, 2017

rbharath commented Jan 20, 2017

swamidass commented Jan 20, 2017

agitter commented Jan 21, 2017

rbharath commented Jan 21, 2017

swamidass commented Jan 24, 2017

swamidass commented Jan 24, 2017

swamidass commented Jan 24, 2017

agitter commented Jan 24, 2017

agitter commented Jan 24, 2017

agitter commented Jan 24, 2017

swamidass commented Jan 24, 2017

swamidass commented Jan 24, 2017

rbharath commented Jan 24, 2017

swamidass commented Jan 25, 2017

swamidass commented Jan 25, 2017

cgreene commented Jan 25, 2017

agitter commented Jan 25, 2017

swamidass commented Feb 9, 2017

agitter commented Feb 9, 2017

agitter commented Feb 18, 2017

rbharath commented Feb 20, 2017

agitter commented Feb 26, 2017

agitter commented Feb 26, 2017

dhimmel commented Feb 26, 2017

agitter commented Feb 26, 2017 •

edited

Loading

Drug discovery and high-throughput screening sub-section outline #174

Drug discovery and high-throughput screening sub-section outline #174

Conversation

agitter commented Jan 2, 2017 • edited Loading

agitter commented Jan 3, 2017

agitter commented Jan 3, 2017

agitter commented Jan 6, 2017

cgreene commented Jan 6, 2017

agitter commented Jan 6, 2017

swamidass commented Jan 6, 2017

rbharath commented Jan 20, 2017

swamidass commented Jan 20, 2017

agitter commented Jan 21, 2017

rbharath commented Jan 21, 2017

swamidass commented Jan 24, 2017

swamidass commented Jan 24, 2017

swamidass commented Jan 24, 2017

agitter commented Jan 24, 2017

agitter commented Jan 24, 2017

agitter commented Jan 24, 2017

swamidass commented Jan 24, 2017

swamidass commented Jan 24, 2017

rbharath commented Jan 24, 2017

swamidass commented Jan 25, 2017

swamidass commented Jan 25, 2017

cgreene commented Jan 25, 2017

agitter commented Jan 25, 2017

swamidass commented Feb 9, 2017

agitter commented Feb 9, 2017

agitter commented Feb 18, 2017

rbharath commented Feb 20, 2017

agitter commented Feb 26, 2017

agitter commented Feb 26, 2017

dhimmel commented Feb 26, 2017

agitter commented Feb 26, 2017 • edited Loading

agitter commented Jan 2, 2017 •

edited

Loading

agitter commented Feb 26, 2017 •

edited

Loading