GlyBot

A Domain-Specific LLM Assistant for Glycans and Glycobiology

Overview

This project aims to develop an AI-assistant capable of querying a bioinformatics knowledgebase and providing assistance navigating and utilizing the informatics tools that are available. An AI-assistant provides a flexible platform for users with varying ranges of experience and domain knowledge to interact with in order to become familiar with a new domain-specific informatics tool.

The assistant is currently in a prototype stage and current efforts are aimed at enhancing the assistant with various API tools for retrieving information from GlyGen, UniProt, and other relevant resources for glycoproteomics and achieving coherent integration of this information for assisting researchers.

If you would like to contribute to this project, reach out to me via email at ek990@georgetown.edu

Run with Docker

To experiment with the prototype assistant yourself, this repository is configured to use docker compose to run the application and vector database as a pair of networked containers. You will need docker compose installed appropriately for your machine, then follow the steps below.

Steps:

Clone the repository
Add your API keys:

You will need to supply api keys for OpenAI and Google, which, you can get here:

OpenAI Developer Portal | Google Custom Search

Put these into a folder labelled SENSITIVE as text files,

GlyBot/SENSITIVE/openai_api_key.txt GlyBot/SENSITIVE/google_api_key.txt
Run the config file with docker compose:

docker compose up

This step will build an image for the application using your api keys as environment variables, install the qdrant database and embed the data, and leave it running in your terminal. This may take several minutes the first time you start the application.

Access the application by going to http://localhost:8501/

When you are done, use docker compose down to terminate the containers.

Initial Feasability Analysis

Early work to evaluate the efficacy of RAG and its effects on LLM output in this domain can be found in its own branch of this repository. If you are interested in that work, it can be found in the RAG_feasability_eval branch.

References

Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology [Internet]. 4th edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2022. Available from: https://www.ncbi.nlm.nih.gov/books/NBK579918/ doi: 10.1101/9781621824213

Yuanjie Lyu and Zhiyu Li and Simin Niu and Feiyu Xiong and Bo Tang and Wenjin Wang and Hao Wu and Huanyong Liu and Tong Xu and Enhong Chen and Yi Luo and Peng Cheng and Haiying Deng and Zhonghao Wang and Zijia Lu. 2024. CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2401.17043

Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara. 2024. Large language models are inconsistent and biased evaluators. arXiv:2405.01724. 2 May 2024. https://doi.org/10.48550/arXiv.2405.01724

Soudani, H., Kanoulas, E., & Hasibi, F. (2024). Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. ArXiv, abs/2403.01432.

Markus J. Buehler. 2024. Accelerating scientific discovery with generative knowledge extraction, graph-based representation, and multimodal intelligent graph reasoning. arXiv:2403.11996v3. 10 Jun 2024.

Gao Silin. 2024. Efficient Tool Use with Chain-of-Abstraction Reasoning. arXiv:2401.17464. https://doi.org/10.48550/arXiv.2401.17464

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
assistant_tools		assistant_tools
openai_vector_data/qdrant.db		openai_vector_data/qdrant.db
pipelines		pipelines
qdrant		qdrant
scraping		scraping
textbook_text_data		textbook_text_data
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yml		compose.yml
glygen_logo.png		glygen_logo.png
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlyBot

Initial Feasability Analysis

About

Releases

Packages

Contributors 2

Languages

License

ek775/GlyBot

Folders and files

Latest commit

History

Repository files navigation

GlyBot

Initial Feasability Analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages