Skip to content

Scientific Research Assistant built with LLMs, Retrieval Augmented Generation, and Semantic Search.

License

Notifications You must be signed in to change notification settings

dcarpintero/athena

Repository files navigation

Open_inStreamlit Python License

🦉 Athena - Research Companion

Athena is an AI-Assist protoype powered by Cohere-AI and Embed-v3 to faciliate scientific Research. Its key differentiating features include:

  • Advanced Semantic Search: Outperforms traditional keyword searches with state-of-the-art embeddings, offering a more nuanced and effective data retrieval experience that understands the complex nature of scientific queries.
  • Human-AI Collaboration: Enables easier review of research literature, highlighting key topics, and augmenting human understanding.
  • Admin Support: Provides assistance with tasks such as categorization of research articles, e-mail drafting, and tweets generation.

📚 Overview

Data Pipeline

As part of this project we have created two datasets of 50.000 arXiv articles related to AI and NLP using Cohere Embedv3:

Steps:

  1. Retrieve Articles' Metadata from ArXiv. See ./data_pipeline/retrieve_arxiv.py
  2. Embed Articles' Title and Abstract using Embedv3. See ./data_pipeline/embed_arxiv.py
  3. Store Articles' Metadata and Embeddings in Weaviate. See ./data_pipeline/index_arxiv.py

Prompt Templates, Output Formatting, and Validation

Some of our tasks such as enriching abstracts with Wikipedia Links, crafting a glossary, composing e-mails and tweeting rely on a set of:

Those prompts are then composed into a LangChain chain as in the following code snippets:

Weaviate Schema

See ArxivArticle Class.

Cohere Engine

The coral.py class provides an abstraction layer over Cohere endpoints.

Streamlit App

See app.py

🚀 Quickstart

  1. Clone the repository:
git@github.com:dcarpintero/athena.git
  1. Create and Activate a Virtual Environment:
Windows:

py -m venv .venv
.venv\scripts\activate

macOS/Linux

python3 -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Run Data Pipeline (optional)
python retrieve_arxiv.py
python embed_arxiv.py
python index_arxiv.py
  1. Launch Web Application
streamlit run ./app.py

🔗 References