🤖 VC_assistant (Venture Capital Assistant)

Overview

VC_assistant is an adaptive intelligence chatbot designed to facilitate the creation of enterprise applications with advanced conversational capabilities for Venture Capital info scraping , embedding , storing in vectordatabase, extract special content in JSON format and find similarity. It leverages a tool-using agentic architecture and retrieval augmented generation (RAG) to integrate state-of-the-art open-source frameworks such as LangChain, FastAPI and etc. This document outlines the architecture, components, and use cases of VC_Assistant, providing a comprehensive understanding of its functionality and deployment.

Main Features:

Given VC website URL as an input by the user, your Generative AI assistant should be able to:

Scrape Home page of the website and store the content in vectorDB of your choice,

Extract the following information and show it to the user as a JSON object: VC name,contacts, industries that they invest in, investment rounds that they participate/lead.

Compare VC website content to the other VC website contents in the database.

Prerequisites

To use this project, you will need:

Docker and Docker Compose installed
Python 3.7+
An OpenAI API key

Setup

To set up the project:

Clone this repository to your local machine.
Rename key.env.example to key.env and add your OpenAI API key.
Review config.yml and choose openai or local for your Embedding_Type
In docker-compose.yml, update the volumes path for VC_ASSISTANT_QDRANT to a local folder where you want persistent storage for the vector database.
Create needed directories for persistant storage
```
mkdir -p .data/qdrant/
```
Build the Docker images:
```
 docker-compose build
```
Start the services:
```
docker-compose up -d
```

The services will now be running.

Deployment and Usage

Once the Docker containers are up and running, you can start interacting::

The interactive Swagger docs at http://localhost:8000/docs
The Qdrant Web Interface at http://localhost:6333/dashboard

Build the default Collection

Scrape Documents:
- Go to the FastAPI server by navigating to the interactive Swagger docs at http://localhost:8000/docs.
- Use the scraper endpoint to scrape content from a specified URL. You will need to provide the URL you want to scrape in the request body.
- The scraper endpoint will return the scraped content which will be processed in the next step.
Create a Vector Index:
- Use the process-scrape2vector endpoint to create a vector index from the scraped content.
- In case the default collection does not exist, the process-scrape2vector endpoint will create one for you.
- This endpoint will process the scraped documents, create a vector index, and load it into Qdrant.

Architecture Overview

The VC_Assistant architecture consists of the following key components:

FastAPI - High performance REST API framework. Handles HTTP requests and routes them to application logic.
Qdrant - Vector database for storing document embeddings and enabling similarity search.
AgentHandler - Orchestrates the initialization and execution of the conversational agent.
Scraper - A tool that scrapes a web page and converts it to markdown.
Loader - A tool that loads content from the scrped_data directory to a VectorStoreIndex
Tools - Custom tools that extend the capabilities of the agent.

Infrastructure

Let's take a closer look at some of the key VC_assistant infrastructure components:

FastAPI

FastAPI provides a robust web framework for handling the API routes and HTTP requests/responses.

Some key advantages:

Built on modern Python standards like type hints and ASGI.
Extremely fast - benchmarked to be faster than NodeJS and Go.
Automatic interactive docs using OpenAPI standards.

Qdrant

Qdrant is a vector database optimized for ultra-fast similarity search across large datasets. It is used in this project to store and index document embeddings, enabling the bot to quickly find relevant documents based on a search query or conversation context.

Document Scraping Section

The scraper module, located in /app/src/scraper/scraper_main.py, serves as a robust utility for extracting content from web pages and converting it into structured markdown format. This module is integral for enabling the framework to access and utilize information from a plethora of web sources. Below is a succinct overview focusing on its core functionalities and workflow for developers aiming to integrate and leverage this module effectively.

Components:

WebScraper Class:
- Inherits from the base Scraper class and implements the Singleton pattern to ensure a unique instance.
- Orchestrates the entire scraping process, from fetching to parsing, and saving the content.
- Leverages ContentParser to extract and convert meaningful data from HTML tags into markdown format.
ContentParser Class:
- Designed to parse and convert meaningful content from supported HTML tags into markdown format.
- Supports a variety of HTML tags including paragraphs, headers, list items, links, inline code, and code blocks.

Workflow:

URL Validation:
- The provided URL undergoes validation to ensure its correctness and accessibility.
- If the URL is invalid, the process is terminated, and an error message is logged.
Content Fetching:
- Content from the validated URL is fetched using HTTP requests.
- Utilizes random user agents to mimic genuine user activity and avoid potential blocking by web servers.
- If the content fetching fails, the process is halted, and an error message is logged.
Content Parsing:
- The fetched content is parsed using BeautifulSoup, and the ContentParser class is employed to extract meaningful data.
- The parsed data includes the title, metadata, and the content in markdown format.
File Saving:
- The parsed content is saved to a file, the filename is generated using a hash of the URL.
- The file is stored in a pre-configured data directory.
- If the file saving fails, an error message is logged.
Result Return:
- Upon the successful completion of the scraping process, a success message and the filepath of the saved content are returned.
- If any step in the process fails, an appropriate error message is returned.

Usage:

Developers can initiate the scraping process by invoking the run_web_scraper(url) function with the desired URL. This function initializes a WebScraper instance and triggers the scraping process, returning a dictionary containing the outcome of the scraping process, including messages indicating success or failure and the location where the scraped data has been saved.

Example:

result = run_web_scraper("http://example.com")
if result and result.get("message") == "Scraping completed successfully":
    print(f"Scraping complete! Saved to {result['data']}")
else:
    print(result["message"])

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
img		img
src		src
.gitignore		.gitignore
README.md		README.md
config.yml		config.yml
docker-compose.yml		docker-compose.yml
dockerfile-fastapi		dockerfile-fastapi
dockerfile-gradio		dockerfile-gradio
key.env.example		key.env.example
requirements-fastapi.txt		requirements-fastapi.txt
requirements-gradio.txt		requirements-gradio.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 VC_assistant (Venture Capital Assistant)

Overview

Main Features:

Given VC website URL as an input by the user, your Generative AI assistant should be able to:

Scrape Home page of the website and store the content in vectorDB of your choice,

Extract the following information and show it to the user as a JSON object: VC name,contacts, industries that they invest in, investment rounds that they participate/lead.

Compare VC website content to the other VC website contents in the database.

Prerequisites

Setup

Deployment and Usage

Build the default Collection

Architecture Overview

Infrastructure

FastAPI

Qdrant

Document Scraping Section

Components:

Workflow:

Usage:

Example:

About

Releases

Packages

Languages

ideavision/vc_assistant

Folders and files

Latest commit

History

Repository files navigation

🤖 VC_assistant (Venture Capital Assistant)

Overview

Main Features:

Given VC website URL as an input by the user, your Generative AI assistant should be able to:

Scrape Home page of the website and store the content in vectorDB of your choice,

Extract the following information and show it to the user as a JSON object: VC name,contacts, industries that they invest in, investment rounds that they participate/lead.

Compare VC website content to the other VC website contents in the database.

Prerequisites

Setup

Deployment and Usage

Build the default Collection

Architecture Overview

Infrastructure

FastAPI

Qdrant

Document Scraping Section

Components:

Workflow:

Usage:

Example:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages