Large language models (LLMs) have showcased remarkable capabilities across a vast array of domains. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4. Our investigation spans a diverse range of scientific areas encompassing:
- Drug discovery
- Biology
- Computational chemistry (density functional theory (DFT) and molecular dynamics (MD))
- Materials design
- Partial differential equations (PDE)
Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.
Though we have initially conducted some preliminary experiments to explore the application prospects of LLM models in scientific discovery (case analyses can be found in our report). However, we recognize that this is still far from enough, and the potential of LLM models in this field remains to be tapped.
Therefore, we hope to advance the development of LLM in scientific discovery through the joint efforts of our community. We have created this GitHub repository to collect interesting findings and feedback on LLM's current capabilities from community members in their exploration process.
We invite all researchers interested in advancing the capabilities of large language models for scientific discovery applications. Whether you have relevant expertise or not, as long as you think LLM may have value in scientific discovery and hope this area can be further optimized, you are welcome to join us. Every sharing and suggestion will provide reference value for LLM applications in this field. Let us work together to explore new frontiers of LLM applications in scientific discovery!
Some goals of this ongoing project include:
- Reporting novel findings or use cases uncovered through rigorous testing and experimentation
- Providing feedback to help prioritize model enhancements that address key limitations
- Proposing new evaluation benchmarks and experimental protocols
- Discussing interdisciplinary research opportunities and challenges
- Connecting domain experts with NLP researchers to foster collaboration
To contribute your discovery, for a template and the details on how to structure your contributions, please see the Discussions channel and here is a specific case a template.
We look forward to your participation and sharing!
We summarize what we have observed in our report to let you have some initial understanding of our evaluation. ๐represents strength of the abilities, ๐ต represents the abilities need to be improved.
- ๐Broad Knowledge: GPT-4 demonstrates a wide-ranging understanding of key concepts in drug discovery, including individual drugs, target proteins, general principles for small-molecule drugs, and the challenges faced in various stages of the drug discovery process.
- ๐Versatility in Key Tasks: LLMs, such as GPT-4, can help in several essential tasks in drug discovery, including Molecule Manipulation, Drug-Target Binding Prediction, Molecule Property Prediction, Retrosynthesis Prediction and so on.
- ๐Novel Molecule Generation: GPT-4 can be used to generate novel molecules following text instruction. This de novo molecule generation capability can be a valuable tool for identifying new drug candidates with the potential to address unmet medical needs.
- ๐Coding capability: GPT-4 can provide help in coding for drug discovery, offering large benefits in data downloading, processing, and so on. The strong coding capability of GPT-4 can greatly ease human efforts in the future.
- ๐ตSMILES Sequence Processing Challenges: GPT-4 may struggle with directly processing SMILES sequences. To improve the modelโs understanding and output, it is better to provide the names of drug molecules along with their descriptions, if possible.
- ๐ตLimitations in Quantitative Tasks: GPT-4 may face limitations when it comes to quantitative tasks, such as predicting numerical values for molecular properties and drug-target binding. Researchers are advised to take GPT-4โs output as a reference and perform verification using dedicated AI models or scientific computational tools to ensure reliable conclusions.
- ๐ตDouble-Check Generated Molecules: When generating novel molecules with GPT-4, it is essential to verify the validity and chemical properties of the generated structures.
- ๐Bioinformation Processing: GPT-4 displays its understanding of information processing from specialized files in biological domains, such as MEME format, FASTQ format, and VCF format. Furthermore, it is adept at performing bioinformatic analysis with given tasks and data, exemplified by predicting the signaling peptides for a provided sequence.
- ๐Biological Understanding: GPT-4 demonstrates a broad understanding of various biological topics, encompassing consensus sequences, PPI, signaling pathways, and evolutionary concepts.
- ๐Biological Reasoning: GPT-4 possesses the ability to reason about plausible mechanisms from biological observations using its built-in biological knowledge.
- ๐Biological Assisting: GPT-4 demonstrates its potential as a scientific assistant in the realm of protein design tasks, and in wet lab experiments by translating experimental protocols for automation purposes.
- ๐ตFASTA Sequence Understanding: A notable challenge for GPT-4 is the direct processing of FASTA sequences. It is preferable to supply the names of biomolecules in conjunction with their sequences when possible.
- ๐ตInconsistent Result: GPT-4โs performance on tasks related to biological entities is influenced by the abundance of information pertaining to the entities. Analysis of under-studied entities, such as transcription factors, may yield inconsistent results.
- ๐ตArabic Number Understanding: GPT-4 struggles to directly handle Arabic numerals; converting Arabic numerals to text is recommended.
- ๐ตQuantitative Calculation: While GPT-4 excels in biological language understanding and processing, it encounters limitations in quantitative tasks (Fig. 3.7). Manual verification or validation with alternative computational tools is advisable to obtain reliable conclusions.
- ๐Literature Review: GPT-4 possesses extensive knowledge of computational chemistry, covering topics such as density functional theory, Feynman diagrams, and fundamental concepts in electronic structure theory, molecular dynamics simulations, and molecular conformation generation.
- ๐Code Development: GPT-4 is able to assist with the implementation of novel algorithms or functionality in existing computational chemistry and physics software packages.
- ๐Method Selection: GPT-4 is able to recommend suitable computational methods and software packages for specific research problems, taking into account factors such as system size, timescales, and level of theory.
- ๐Simulation Setup: GPT-4 is able to aid in preparing simple molecular-input structures, establishing and suggesting simulation parameters, including specific symmetry, density functional, time step, ensemble.
- ๐Experimental, Computational, and Theoretical Guidance: GPT-4 is able to assist researchers by providing experimental, computational, and theoretical guidance.
- ๐ตHallucinations: GPT-4 may occasionally generate incorrect information. It may struggle with complex logic reasoning. Researchers need to independently verify and validate outputs and suggestions from GPT-4.
- ๐ตRaw Atomic Coordinates: GPT-4 is not adept at generating or processing raw atomic coordinates of complex molecules or materials. However, with proper prompts that include molecular formula, name, or other supporting information, GPT-4 may still work for simple systems.
- ๐ตPrecise Computation: GPT-4 is not proficient in precise calculations in our evaluated benchmarks and usually ignores physical priors such as symmetry and equivariance/invariance. Currently, the quantitative numbers returned by GPT-4 may come from a literature search or few-shot examples. It is better to combine GPT-4 with specifically designed scientific computation packages or machine learning models, such as Graphormer and DiG.
- ๐ตHands-on Experience: GPT-4 can only provide guidance and suggestions but cannot directly perform experiments or run simulations. Researchers will need to set up and execute simulations or experiments by themselves or leverage other frameworks based on GPT-4, such as AutoGPT , HuggingGPT, AutoGen and so on.
- ๐Information memorization: Excels in memorizing information and suggesting design principles for inorganic crystals and polymers. Its understanding of basic rules for materials design in textual form is remarkable. For instance, when designing solid-state electrolyte materials, it can competently propose ways to increase ionic conductivity and provide accurate examples.
- ๐Composition Creation: Proficient in generating feasible chemical compositions for new inorganic materials.
- ๐Synthesis Planning: Exhibits satisfactory performance for synthesis planning of inorganic materials.
- ๐Coding Assistance: Provides generally helpful coding assistance for materials tasks. It can generate molecular dynamics and DFT inputs for numerous property calculations and can correctly utilize many computational packages and construct automatic processing pipelines. Iterative feedback and manual adjustments may be needed to fine-tune the generated code.
- ๐ตRepresentation: Encounters challenges in representing and proposing organic polymers and MOFs.
- ๐ตStructure Generation: Limited capability for structure generation, particularly when generating accurate atomic coordinates.
- ๐ตPredictions: Falls short in providing precise quantitative predictions in property prediction. For instance, when predicting whether a material is metallic or semi-conducting, its accuracy is only slightly better than a random guess.
- ๐ตSynthesis Route: Struggles to propose synthesis routes for organic polymeric materials not present in the training set without additional guidance.
- ๐PDE Concepts: GPT-4 demonstrates its awareness of fundamental PDE concepts, thereby enabling researchers to gain a deeper understanding of the PDEs they are working with. It can serve as a helpful resource for teaching or mentoring students, enabling them to better understand and appreciate the importance of PDEs in their academic pursuits and research endeavors.
- ๐Concept Relationships: The model is capable of discerning relationships between concepts, which may aid mathematicians in broadening their perspectives and intuitively grasping connections across different subfields.
- ๐Solution Recommendations: GPT-4 can recommend appropriate analytical and numerical methods for addressing various types and complexities of PDEs. Depending on the specific problem, it can suggest suitable techniques for obtaining either exact or approximate solutions..
- ๐Code Generation: The model is capable of generating code in different programming languages, such as MATLAB and Python, for numerical solution of PDEs, thus facilitating the implementation of computational solutions.
- ๐ตOutput Verification: While GPT-4 exhibits human-like capabilities in solving partial differential equations and providing explicit solutions, there might be instances of incorrect derivation. Researchers should exercise caution and verify the modelโs output when using GPT-4 to solve PDEs.
- ๐ตHallucinations Awareness: GPT-4 may occasionally erroneously cite non-existent references. Researchers should cross-check citations and be aware of this limitation to ensure the accuracy and reliability of the information provided by the model.
-
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang
NeurIPS Datasets and Benchmarks Track, December 2023.
-
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, Wei Wang
Arxiv, July 2023.
-
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, Kai Yu
Arxiv, August 2023.
-
Do Large Language Models Understand Chemistry? A Conversation with ChatGPT
Cayque Monteiro Castro Nascimento, Andrรฉ Silva Pimentel
J. Chem. Inf. Model, March 2023.
-
Language models in molecular discovery
Nikita Janakarajan, Tim Erdmann, Sarath Swaminathan, T. Laino, Jannis Born
Arxiv, September 2023.
-
ChemCrow: Augmenting large-language models with chemistry tools
Andrรฉs M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, P. Schwaller
Arxiv, April 2023.
-
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis
Zhiling Zheng, Oufan Zhang, Christian Borgs, Jennifer T. Chayes, and Omar M. Yaghi
J. Am. Chem. Soc., August 2023.
-
A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks
Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang
Arxiv, October 2023.
-
BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology
Odhran O'Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, Samuel G Rodriques
EMNLP, October 2023.
-
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon
K. Jablonka, Qianxiang Ai, Alexander H Al-Feghali, S. Badhwar, Joshua D. Bocarsly Andres M Bran, S. Bringuier, L. Brinson, K. Choudhary, Defne รirci, Sam Cox, W. D. Jong, Matthew L. Evans, Nicolas Gastellu, Jรฉrรดme Genzling, M. Gil, Ankur Gupta, Zhi Hong, A. Imran, S. Kruschwitz, A. Labarre, Jakub L'ala, Tao Liu, Steven Ma, Sauradeep Majumdar, G. Merz, N. Moitessier, E. Moubarak, B. Mouriรฑo, Brenden G. Pelkie, M. Pieler, Mayk C. Ramos, Bojana Rankovi'c, Samuel G. Rodriques, J. N. Sanders, P. Schwaller, Marcus Schwarting, Jia-Xin Shi, B. Smit, Benn Smith, J. V. Heck, C. Volker, Logan T. Ward, S. Warren, B. Weiser, Sylvester Zhang, Xiaoqi Zhang, Ghezal Ahmad Jan Zia, A. Scourtas, K. Schmidt, Ian T. Foster, Andrew D. White, B. Blaiszik.
Digital Discovery, June 2023.
This repo is MIT-licensed.
If you have any questions or suggestions, look forward to your messages through the discussion channel or email llm4sciencediscovery@microsoft.com.
Contact people: Tao Qin (taoqin@microsoft.com), Lijun Wu (lijuwu@microsoft.com).
@misc{ai4science2023impact,
title={The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4},
author={Microsoft Research AI4Science and Microsoft Azure Quantum},
year={2023},
eprint={2311.07361},
archivePrefix={arXiv},
primaryClass={cs.CL}
}