Scrape and chat with repositories

This project is designed to allow you to plug in a GitHub repository URL (like https://github.com/soos3d/chatgpt-plugin-development-quickstart-express) and then engage with OpenAI's Chat GPT models to gain a better understanding of the repository's codebase.

To handle the 'AI' part, the program utilizes the Langchain framework and Deep Lake following this logic:

The repository is scraped, and each file's content is saved in a txt format.
Langchain is used to load this data, break it down into chunks, and create embedding vectors using the OpenAI embedding model.
Langchain then helps to build a vector database using Deep Lake.
Lastly, Langchain spins up a chat bot with the help of a Chat GPT model.

With this setup, you can interact with 'chat' with any repository.

Project structure

This project only has three files! Langchain really allows to simply the logic.

chat-with-repo-langchain
  │
  ├── main.py
  ├── chat.py
  ├── src
  │   └── scraper.py
  └── .env

main.py: This is the entry point and the part responsible for ingesting a repository URL, generating embedding vectors, and indexing it in a vector database.
chat.py: This module starts the chat functionality, accepts user queries, and gets the context from the Vectore database; it also stores the chat history for the time it's running.
src/scraper.py: This file holds the scraping logic, and this module is called during the execution of main.py.
.env: This is where environment variables are stored; it also holds the configuration of the vector database.

Requirements

Before getting started, ensure you have the following:

Python - Version 3.7 or newer is required.
An active account on OpenAI, along with an OpenAI API key.
A Deep Lake account, complete with a Deep Lake API key.

Getting Started

ℹ️ It's strongly advised to create a new Python virtual environment to run this program. It helps maintain a tidy workspace by keeping dependencies in one place.

Create a Python virtual environment with:

python3 -m venv repo-ai

Then activate it with:

source repo-ai/bin/activate

Clone the repository:

git clone https://github.com/soos3d/chat-with-repo-langchain-openai.git

Then:

cd chat-with-repo-langchain-openai

Install the Python dependencies:

pip install -r requirements.txt

This will install all of the required Langchain, OpenAI, and Deep Lake dependencies.

Edit the .env.sample file with your information, specifically the API keys:

# Scraper config
FILES_TO_IGNORE='"package-lock.json", "LICENSE", ".gitattributes", ".gitignore", "yarn.lock"'
SAVE_PATH="./repos_content"     # Save the scraped data in a directory called repos-content in the root
MAX_ATTEMPTS=3

# Repository to scrape if the hardcoded section is active.
REPO_URL="https://github.com/soos3d/chatgpt-plugin-development-quickstart-express"

# OpenAI 
OPENAI_API_KEY="YOUR_KEY"
EMBEDDINGS_MODEL="text-embedding-ada-002"
LANGUAGE_MODEL="gpt-3.5-turbo" # gpt-4

# Deeplake vector DB
ACTIVELOOP_TOKEN="YOUR_KEY"
DATASET_PATH="./local_vector_db" # "hub://USER_ID/custom_dataset"  # Edit with your user id if you want to use the cloud db.

Here is where you select which Chat GPT model to use; gpt-3.5-turbo it the default model, and the path to the cloud vector dataset if you don't want to store it locally; it is set up locally by default.

Run the main.py file:

python3 main.py

Input a repository URL

Input the repository you want to index: https://github.com/soos3d/chatgpt-plugin-development-quickstart-express

This will scrape the repository, load the files, split it in chunks, generate embedding vectors, create a local vector database and store the embeddings.

You will see the following response:

Scraping the repository...

====================================================================================================
Repository contents written to ./repos_content/soos3d_chatgpt-plugin-development-quickstart-express.
====================================================================================================
List of file paths written to ./repos_content/soos3d_chatgpt-plugin-development-quickstart-express/soos3d_chatgpt-plugin-development-quickstart-express_file_paths.txt.

Time needed to pull the data: 10.23s.
====================================================================================================
Loading docs...
 88%|███████████████████████████████████████████████████████████████████████████████████████████▉             | 7/8 [00:01<00:00,  5.79it/s]
Loaded 7 documents.
====================================================================================================
Splitting documents...
Generated 25 chunks.
====================================================================================================
Creating vector DB...
./local_vector_db loaded successfully.
Evaluating ingest: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00
Dataset(path='./local_vector_db', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (25, 1536)  float32   None   
    ids      text     (25, 1)      str     None   
 metadata    json     (25, 1)      str     None   
   text      text     (25, 1)      str     None   
Vector database updated.

Chat with the repository:

python3 chat.py

This will start the chat model and you can leverage it's full power, I recomend to use the GPT 4 model if possible. The followinf is an example response based on my ChatGPT plugins boilerplate repository using the gpt4 model:

./local_vector_db loaded successfully.

Deep Lake Dataset in ./local_vector_db already exists, loading from the storage
Dataset(path='./local_vector_db', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (25, 1536)  float32   None   
    ids      text     (25, 1)      str     None   
 metadata    json     (25, 1)      str     None   
   text      text     (25, 1)      str     None   

Please enter your question (or 'quit' to stop): Can you explain how the index.js file in the ChatpGTP plugin quickstart repo works?

Question: Can you explain how the index.js file in the ChatpGTP plugin quickstart repo works?
Answer: Certainly! The `index.js` file in the ChatGPT plugin quickstart repository serves as the entry point for the Express.js server application. Here's a breakdown of its functionality:

1. Import required modules: The necessary modules are imported, including `express`, `path`, `cors`, `fs`, and `body-parser`. Additionally, the custom module `getAirportData` is imported from `./src/app`.

```javascript
const express = require('express');
const path = require('path');
const cors = require('cors');
const fs = require('fs');
const bodyParser = require('body-parser');
require('dotenv').config();

const { getAirportData } = require('./src/app');


2. Initialize Express application: The Express application is initialized and stored in the `app` variable.

```javascript
const app = express();

Set the port number: The port number is set based on the environment variable PORT or defaults to 3000 if PORT is not set.

const PORT = process.env.PORT || 3000;

Configure Express to parse JSON: The application is configured to parse JSON in the body of incoming requests using bodyParser.json().

app.use(bodyParser.json());

Configure CORS options: CORS (Cross-Origin Resource Sharing) is configured to allow requests from https://chat.openai.com and to send a 200 status code for successful preflight requests for compatibility with some older browsers.

const corsOptions = {
  origin: 'https://chat.openai.com',
  optionsSuccessStatus: 200
};

app.use(cors(corsOptions));

The rest of the index.js file sets up the server to listen on the specified port and handles the routes for the plugin. The server starts listening for incoming requests on the specified port, and the plugin is ready to be used with ChatGPT.

Tokens Used: 2214 Prompt Tokens: 1817 Completion Tokens: 397 Successful Requests: 1 Total Cost (USD): $0.07833

> ℹ️ Note that it also prints how many tokens were used and an estimate cost for the OpenAI API.

## Use a cloud vector database

By default this project creates a local vector database using [Deep Lake](https://app.activeloop.ai/?utm_source=referral&utm_medium=platform&utm_campaign=signup_promo_settings&utm_id=plg), but you can also use a cloud based DB. 

> Note that e local database will be faster.

In `main.py` uncomment the following section:

```py
    # Enable the following to create a cloud vector DB using Deep Lake
    """
    deeplake_path = os.getenv('DATASET_PATH')
    ds = deeplake.empty(deeplake_path)
    db = DeepLake(dataset_path=deeplake_path, embedding_function=embeddings, overwrite=True, public=True)
    """

Remember to edit the environment variable for the dataset path and add the USER_ID you have in your Deep Lake account, and to remove or comment the code to create the local DB.

DATASET_PATH="hub://USER_ID/custom_dataset"  # Edit with your user id if you want to use the cloud db.

    # Set the deeplake_path to the repository name
    deeplake_path = os.getenv('DATASET_PATH')
    db = DeepLake(dataset_path=deeplake_path, embedding_function=embeddings, overwrite=True)

Configuration

The entire app is configured from the .env file so you don't have to actually change the code if you don't want to.

FILES_TO_IGNORE is a list of files that will not be scraped. This is to reduce clutter and save some resources.

# Scraper config
FILES_TO_IGNORE='"package-lock.json", "LICENSE", ".gitattributes", ".gitignore", "yarn.lock"'
SAVE_PATH="./repos_content"     # Save the scraped data in a directory called repos-content in the root
MAX_ATTEMPTS=3

# Repository to scrape if the hardcoded section is active.
REPO_URL="https://github.com/soos3d/chatgpt-plugin-development-quickstart-express"

# OpenAI 
OPENAI_API_KEY="YOUR_KEY"
EMBEDDINGS_MODEL="text-embedding-ada-002"
LANGUAGE_MODEL="gpt-3.5-turbo" # gpt-4

# Deeplake vector DB
ACTIVELOOP_TOKEN="YOUR_KEY"
DATASET_PATH="./local_vector_db" # "hub://USER_ID/custom_dataset"  # Edit with your user id if you want to use the cloud db.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape and chat with repositories

Table of contents

Project structure

Requirements

Getting Started

Configuration

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat.py		chat.py
main.py		main.py
requirements.txt		requirements.txt

License

soos3d/chat-with-repo-langchain-openai

Folders and files

Latest commit

History

Repository files navigation

Scrape and chat with repositories

Table of contents

Project structure

Requirements

Getting Started

Configuration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages