-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drug discovery and high-throughput screening sub-section outline #174
Conversation
I checked citations of some of the main papers in this area to catch up on work I had missed previously. The papers listed above should be a fairly comprehensive list now. |
Skipping #179, it is not in scope in my opinion |
Added #185. Papers are appearing faster than I can write about them! |
@agitter : I felt the same way writing the categorize/imaging section! It's amazing how quickly the preprint/publication list is growing. |
I'm pausing work on this section. @swamidass is an expert in the domain and is going to provide input. |
The plan from here is to send you some references, to see what is in/out scope, and to adjust sections accordingly. |
This looks like a good list of drug-discovery related papers. Here are a couple preliminary thoughts. It might be useful to split this list into ligand-based and structural methods. Most of the papers linked above (except #56 and #175) are ligand-centric for small molecules. The progression in ligand space seems to be moving from hand-crafted featurizations (#54, #55) based off circular fingerprints to learned representations (#52, #53). Might be worth making parallels to similar movement to end-to-end elsewhere in deep learning (DNNs for speech started from DNN+HMM models but have started to migrate to end-to-end RNN systems recently). It's probably worth emphasizing the low-data problem as well. In #55, we found that overfit was a major problem with our networks due to lack of sufficient data, which motivated our work in #148 and #141. Found it's very common to find experimentalists with low-throughput assays and small datasets (10s-100s) who want to know what deep-learning can do for them. Discussions of low data might provide a nice segue into structural techniques. In a way, deep-docking is a "zero-shot" method since you can apply docking models to protein-ligand pairs that aren't in the training set. Perhaps worth making an analogy to Google's recent zero-shot NMT (neural machine translation) paper |
So sorry about the delay here. I've been catching up after PSB. Putting aside sometime over the weekend to take a pass at this. |
@swamidass I've been really busy too so no problem. The narrative @rbharath presented above is in line with my thoughts. Do you have comments on that and where the missing literature you referred to would fit in? There is also a fairly distinct class of methods that focus on compound-protein interactions. In my opinion, these would include any methods that featurize the proteins in some manner. |
Another thought: For ligand based methods, some people also like to talk about 1D, 2D, and 3D descriptors. 1D would be things like molecular weight, 2D things based off connectivity structure of molecule, 3D would depend on spatial conformation of molecule. Don't know if we would want to structure section this way though. There are also a few papers on computational antibody design, but this field is very underdeveloped compared to computational small molecule design. I'd imagine the biggest players in this space use Rosetta, but don't know of much that's deep-learning related yet. |
So I want to make a couple comments. You can consider this draft text, but it needs to be integrated with what has already been written: ========== There has been a recent increase in studies examining Deep Learning in chemistry. In most cases where they have been applied (except reaction prediction, metabolism and IRVs) deep learning has NOT reliably produced better performance than fingerprint-based methods in predicting bioactivity. This is a very important result of this early work. There are a couple exceptions to this rule, and they are important to understand so that a path forward can be charted. In chemical reaction prediction and solubility, (Baldi group) and metabolism (my group), deep learning approaches are reliably outperforming other methods. I think part of the reason is that the current methods are good at picking up local features that work well for predicting reactions and physiochemical properties. In metabolism tasks, networks are replicated across the molecule so that, effectively, the training examples are atoms (or bonds) instead of molecules. This dramatically increases the amount of data to train on, and reduces the risk of overfitting on complex networks. Notably, in all these tasks local substructures have all proven very effective. However, larger scale "shape" still seems to be a problem for most deep learning methods that use 2D graphs as an input. It is possible that better methods could improve this situation, but the fact that a simple non-parameter method (circular fingerprints) outperforms high-parameters deep learning architectures is not encouraging. Currently, the highest performing architectures for bioactivity (see the two IRV papers) for bioactivity prediction build on top of the success of fingerprints, and are very low parameter to enable application to the low amounts of data available in many chemical training tasks. The potency sensitive IRVs present an important opportunity for Deep Learning, by leveraging the superior ability of neural networks to fuse information from complex data. While the bioactivity screening task is usually framed as a binary classification, the training also includes potency information that denotes the strength of interactions. A well designed deep learning architecture can take advantage of this information in the training data to improve the binary classification ability on the test data. This approach reliably outperforms other state of the art methods (including fingerprints). Similar improvements were seen in a metabolism that made use of fingerprints at the input layer to improve predictions. Likewise, one major advantage of Deep Learning Approaches is that they can exploit information from related tasks by learning to solve them in concert. An example of this is http://pubs.acs.org/doi/full/10.1021/acscentsci.6b00162 , where a multitask network substantially outperformed individually constructed networks. This had a particularly strong effect in improving predictions of protein-conjugation reactions in a dataset of very limited size. From the studies currently published, there are patterns about how Deep Learning can be utilized in ligand-based chemistry. First, there is value in exploiting the unique strengths of data fusion and multitask learning, as this can yield improvements over standard approaches. By incorporating data or combining tasks, deep learning can make use of otherwise discarded information to make better predictions. Second, in chemistry the number of parameters still matters, and the best methods are tuning model complexity to the size of the dataset. While deep learning in imaging analysis can tolerate many more parameters than training examples, it is still advisable to limit the number of parameters in chemistry. Third, hybrid approaches that build low-parameter recursive networks on top of fingerprints (e.g. IRV) appear to be the most accurate at the current moment. |
And final comment for tonight. I think the references you have are good, but I will do a final pass on the text to add more. E.g. there are more in metabolism, but I have to dig them up. Also, I did include several from my group (and did disclose this). So I think someone with less conflicts of interest should (1) do the first pass on the text and (2) make a fair independent assessment of my work should be included. If, also, you feel I should drop off the author list because of this, please let me know. Thanks. @cgreene thoughts here? |
@rbharath do the 1D, 2D, and 3D descriptors have any special relationship with different neural network architectures, or do people mostly just throw them into a feature vector? Some of the work I've seen takes all decriptors from a program like Dragon and treats them as unordered features. If that's the case, we could briefly mention that approach, state how it relates to fingerprints as alternative features, and later contrast it with the graph convolutional approaches. Antibody design is an interesting idea. Even if the area is underdeveloped currently, we'd like to present future deep learning opportunities in this manuscript and this could be a good one. |
Thanks @swamidass for all of the suggestions. I have a lot of catching up to do, so I'm only leaving cursory thoughts for now. If you're proposing this as draft text for the section, we may need a separate pull request so that we can review it like we have for other contributions and your text is attributed to you. Overall, the papers you suggested and narrative are fairly disjoint from the list I had above and initial thoughts from @rbharath. We'll need to work on merging these threads. For example the sites of metabolism prediction problem is in-scope in my opinion but is distinct from the high-throughput screening problem. When thinking about the scope, we ultimately want to relate these methods back to medicine and disease, even if that's in the very long term. For some applications, it would be helpful to guide readers and make the connection between things like chemical reaction prediction and disease more explicit with examples. I don't see a problem with writing about your own methods as long as you do your best to be objective and also include relevant work from other groups. We want experts writing about these topics. Inevitably experts will have their own work in the domain. |
To rephrase, instead of "merging these threads" the better goal would be to determine how the various drug- and chemistry-related tasks fit together in the Treat section. It seems like we have a few distinct topics in drug discovery, sites of metabolism, etc. that could become different sub-sections but should flow together. |
I do agree that is a good plan. You are right that metabolism is distinct from HTS. Though please do remember that the IRV is for HTS. Regarding metabolism, it is actually quite connected to human disease, specifically drug reactions. I think it is certainly in scope and is good to include as one of those areas where Deep Learning has been producing superior performance in chemistry over other methods. I think so my observations, though, apply more broadly than just metabolism. DL has an advantage over other methods in its ability to multitask (transfer learn) and to data fuse heterogenous datasets. Regrading using Dragon for the input vector to a NN. This approach was recently revived by a few, by using an autoencoder to map molecules onto a 2D space. Its a nice visualization, but it is not clear the practical utility. More significant is the history here. The general poor behavior of using NNs directly on fingerprint vectors (like Dragon) is exactly what led many to conclude about a decade ago that neural networks were not useful in chemistry. Networks with very large input spaces and only a small number of training examples are very high parameter and prone to overfit. They just did not work terribly well. For this reason, fingerprint based methods (e.g. SVMs, and similarity searching) have been dominant, with people also using decision trees and statistical approaches (like Naive Bayes). The built-in regularization of this approaches were particularly important in chemistry. Though neural networks were used by Baldi and myself in the 2000s for HTS, that was not the dominant approach. And we were only able to get it to work by building on top of fingerprints, and using weight replication (a classic DL technique) to dramatically reduce the number of parameters (often to less than 10). |
From here, some one with a bigger view of the paper should decide how to organize this all together. From there I can form these comments into some text, and make a pull request. What do you think? |
@agitter The 1D, 2D, 3D descriptors can be handled in a couple ways. Sometimes, like the Dragon vectors, they're just flattened and fed into a fully connected network. Other times (mainly for 2D and 3D), the network architecture itself is redesigned to make use of the spatial structure. The graph-convolutional networks are basically 2D architectures, and the atomnet paper is a 3D architecture (likely a 3D convolutional kernel from a DNN package under the hood). Antibody design is a very cool area :-). I'd love to see more work here, but it might be best posed in a challenges for deep-learning subsection. I think one of the big issues is actually GPU memory (fitting protein-protein complexes into today's deep architectures will swamp GPU memory on today's cards) @swamidass I really like the links to your metabolism work. Metabolism is certainly very linked to disease, and it's worth a careful discussion here. Adding discussion of the convolutional/IRV architectures from your/Baldi's group also add a lot of value here. I don't know if I'd agree entirely that deep networks haven't been proven better than fingerprint-based methods. For a number of tasks (especially with big data), I think the existing studies do indicate superiority. However, in lower data regimes, I can certainly see the argument. I think that the jury's still out on low-parameter vs. high-parameter networks. Depending on the dataset (especially on the amount of data available), I think there are certainly cases where either can win. |
Regarding our disagreement about deep networks vs. fingerprint networks, I should be clear that I bracket that as a "usually" in "bioactivity prediction." Though, it is possible that I am out of date. Things are moving fast. I think it would of high value to make a publication-date sorted table that examines that question. |
@agitter wrote, "The progression in ligand space seems to be moving from hand-crafted featurizations (#54, #55) based off circular fingerprints to learned representations (#52, #53)." This is a bit too imprecise for the field. Let me explain some back story here. In chemoinformatics, hand-crafted featurizations are called "structural keys" and they do not usually work terribly well, even though (for example) PubChem and a few other software packages still compute them. After that, path fingerprints (e.g. Daylight) were introduced and they work much better predictions, but have some problems. Then came circular fingerprints (e.g. ECFP), which resolved many of the problems with path based fingerprints, and these work extremely well. To be clear, these are not learned representations (so far) and fingerprints has a fairly specific meaning in chemical informatics: a modulus bit vector of one-hot encoded substructures. It refers to this datatype, and there are modifications that improve the compression, the similarity calculations, and extend it into counts. All of these "fingerprints" are just number vectors. Right now, however, this is a bit of confusion about the term "circular fingerprints." It is sometimes (in my view incorrectly) used to refer several things that are quite distinct...
So with that background, it is not correct (in my opinion) to say that "the move is from hand coded to features to learned circular fingerprints." Many papers on DL for chemistry do not actually benchmark against circular fingerprints. When they do, so far they have usually performed worse (though please show me some examples otherwise if you have them). Even then, they do not usually benchmark against IRVs which has been shown to be superior to circular fingerprints alone. It would be great to see a solid benchmark of these methods, but I currently do not know of one. I think the current phase of inquiry is "creative experimentation." DL is new in the field and it has great promise. A lot of new people are starting to publish here, which is great. Creativity is being rewarded right now, and correctly reviewers are not preventing the publication of interesting architectures that are not yet proven (by the usual standards of the field) to bring strong performance gains. In this phase, I do not think we yet know what will ultimately work best or gain wide adoption. We probably have a few more years of this before we really understand what is working and what should move to wide adoption. Soon, I expect, reviewers will begin raising bar on new publications so as to expect better benchmarking, more in line with the field before this burst of attention. |
I'm going to pop in here and make a quick comment. @agitter - it might be helpful to reduce the scope of this PR a bit. What do you think about targeting individual subsections with each PR. I've found with |
@cgreene Yes, that's a very good idea. I'll work on that, but probably not in the next day or two. I do see some value in keeping this pull request broad as a place to host our ongoing discussion. We're still trying to define which distinct sub-sections are needed and how they fit together. @rbharath @swamidass Lots of great comments, which I'll respond to ASAP. This is going to be a very strong section thanks to your contributions. A quick remark, I think some of the references to "learned fingerprints" come from #52 (Convolutional Networks on Graphs for Learning Molecular Fingerprints). If that is an abuse of terminology, we can clarify or push back in the review. |
So, please let me know when a reasonable draft on the chemical related sections (HTS and metabolism) is done. I'll edit and add a lot from there. I do not want to be the one writing the first draft, because I am pretty "close" to this area and do not want to bias things too much towards my work. |
@swamidass I still plan to write a first draft as soon as I can. But I've delayed the writing so that I can go back and read more first, e.g. about IRV. |
I wrote a new outline based on feedback from @rbharath and @swamidass. I'll wait a couple days for further comments and am happy to make revisions. If this looks okay, I'll work on the first complete draft. |
@agitter The outline looks good to me! |
bd3cb76
to
9178a88
Compare
I changed the scope of this pull request to include the outline only. Now that the outline is finished, I'll create a new pull request when I have a full draft to merge. I'll look at the new issues #250 and #251 before doing that. The |
@cgreene The outline content has already been discussed above. My review request is primarily to have one more person confirm this will be a clean merge before squashing and merging. |
@agitter looks clean to me. And the build succeeded (https://travis-ci.org/greenelab/deep-review/builds/205601924). |
@dhimmel I was grateful for the integration test as I was cleaning up the references. A lot had changed. Now that you've reviewed this, I'll merge. |
This build is based on 855a2cb. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/205610680 https://travis-ci.org/greenelab/deep-review/jobs/205610681 [ci skip] The full commit message that triggered this build is copied below: Drug discovery and high-throughput screening sub-section outline (#174) Outline high throughput screening sub-section in the Treat section and add tags for related references
This build is based on 855a2cb. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/205610680 https://travis-ci.org/greenelab/deep-review/jobs/205610681 [ci skip] The full commit message that triggered this build is copied below: Drug discovery and high-throughput screening sub-section outline (#174) Outline high throughput screening sub-section in the Treat section and add tags for related references
I'm working on the drug discovery and high-throughput chemical screening sub-section for the Treat topic. This is not ready to merge, but I created to pull request so that others can follow along or contribute. @kumardeep27 are you still interested in writing this section? You were also interested in other types of molecular binding (e.g. RNA), and we definitely need help with that.
This sub-section will cover the following papers, which I am adding to
tags.tsv
. This list is also a work in progress that I will edit.Definitely:
Skip
Later pull request