VC_assistant is an adaptive intelligence chatbot designed to facilitate the creation of enterprise applications with advanced conversational capabilities for Venture Capital info scraping , embedding , storing in vectordatabase, extract special content in JSON format and find similarity. It leverages a tool-using agentic architecture and retrieval augmented generation (RAG) to integrate state-of-the-art open-source frameworks such as LangChain, FastAPI and etc. This document outlines the architecture, components, and use cases of VC_Assistant, providing a comprehensive understanding of its functionality and deployment.
Extract the following information and show it to the user as a JSON object: VC name,contacts, industries that they invest in, investment rounds that they participate/lead.
To use this project, you will need:
- Docker and Docker Compose installed
- Python 3.7+
- An OpenAI API key
To set up the project:
-
Clone this repository to your local machine.
-
Rename
key.env.example
tokey.env
and add your OpenAI API key. -
Review
config.yml
and chooseopenai
orlocal
for yourEmbedding_Type
-
In
docker-compose.yml
, update thevolumes
path forVC_ASSISTANT_QDRANT
to a local folder where you want persistent storage for the vector database. -
Create needed directories for persistant storage
mkdir -p .data/qdrant/
-
Build the Docker images:
docker-compose build
-
Start the services:
docker-compose up -d
The services will now be running.
Once the Docker containers are up and running, you can start interacting::
- The interactive Swagger docs at http://localhost:8000/docs
- The Qdrant Web Interface at http://localhost:6333/dashboard
-
Scrape Documents:
- Go to the FastAPI server by navigating to the interactive Swagger docs at http://localhost:8000/docs.
- Use the
scraper
endpoint to scrape content from a specified URL. You will need to provide the URL you want to scrape in the request body. - The
scraper
endpoint will return the scraped content which will be processed in the next step.
-
Create a Vector Index:
- Use the
process-scrape2vector
endpoint to create a vector index from the scraped content. - In case the
default
collection does not exist, theprocess-scrape2vector
endpoint will create one for you. - This endpoint will process the scraped documents, create a vector index, and load it into Qdrant.
- Use the
The VC_Assistant architecture consists of the following key components:
- FastAPI - High performance REST API framework. Handles HTTP requests and routes them to application logic.
- Qdrant - Vector database for storing document embeddings and enabling similarity search.
- AgentHandler - Orchestrates the initialization and execution of the conversational agent.
- Scraper - A tool that scrapes a web page and converts it to markdown.
- Loader - A tool that loads content from the scrped_data directory to a VectorStoreIndex
- Tools - Custom tools that extend the capabilities of the agent.
Let's take a closer look at some of the key VC_assistant infrastructure components:
FastAPI provides a robust web framework for handling the API routes and HTTP requests/responses.
Some key advantages:
- Built on modern Python standards like type hints and ASGI.
- Extremely fast - benchmarked to be faster than NodeJS and Go.
- Automatic interactive docs using OpenAPI standards.
Qdrant is a vector database optimized for ultra-fast similarity search across large datasets. It is used in this project to store and index document embeddings, enabling the bot to quickly find relevant documents based on a search query or conversation context.
The scraper
module, located in /app/src/scraper/scraper_main.py
, serves as a robust utility for extracting content from web pages and converting it into structured markdown format. This module is integral for enabling the framework to access and utilize information from a plethora of web sources. Below is a succinct overview focusing on its core functionalities and workflow for developers aiming to integrate and leverage this module effectively.
-
WebScraper Class:
- Inherits from the base Scraper class and implements the Singleton pattern to ensure a unique instance.
- Orchestrates the entire scraping process, from fetching to parsing, and saving the content.
- Leverages
ContentParser
to extract and convert meaningful data from HTML tags into markdown format.
-
ContentParser Class:
- Designed to parse and convert meaningful content from supported HTML tags into markdown format.
- Supports a variety of HTML tags including paragraphs, headers, list items, links, inline code, and code blocks.
-
URL Validation:
- The provided URL undergoes validation to ensure its correctness and accessibility.
- If the URL is invalid, the process is terminated, and an error message is logged.
-
Content Fetching:
- Content from the validated URL is fetched using HTTP requests.
- Utilizes random user agents to mimic genuine user activity and avoid potential blocking by web servers.
- If the content fetching fails, the process is halted, and an error message is logged.
-
Content Parsing:
- The fetched content is parsed using BeautifulSoup, and the
ContentParser
class is employed to extract meaningful data. - The parsed data includes the title, metadata, and the content in markdown format.
- The fetched content is parsed using BeautifulSoup, and the
-
File Saving:
- The parsed content is saved to a file, the filename is generated using a hash of the URL.
- The file is stored in a pre-configured data directory.
- If the file saving fails, an error message is logged.
-
Result Return:
- Upon the successful completion of the scraping process, a success message and the filepath of the saved content are returned.
- If any step in the process fails, an appropriate error message is returned.
Developers can initiate the scraping process by invoking the run_web_scraper(url)
function with the desired URL. This function initializes a WebScraper
instance and triggers the scraping process, returning a dictionary containing the outcome of the scraping process, including messages indicating success or failure and the location where the scraped data has been saved.
result = run_web_scraper("http://example.com")
if result and result.get("message") == "Scraping completed successfully":
print(f"Scraping complete! Saved to {result['data']}")
else:
print(result["message"])