This project was built as part of the Data-Driven VC Hackathon organized by Red River West & Bivwak! by BNP Paribas
This project aims to tackle the lack of granularity and static nature of current market segmentations done by most data providers or VCs.
We propose a dynamic data-driven approach to segment companies and identify markets without the need for manual intervention. This approach is based on the analysis of enriched textual data from companies retrieved from different sources including (Website, Github, PredictLads, ...) and the use of NLP techniques (embedding clustering) to classify them into different topics.
We then give a brief overview of those markets by aggregating the data from multiple companies and reprensenting the market trends as a whole.
Here is a brief overview of the workflow
brew install llvm
You can use rye to automatically manage your Python environment otherwise you require Python 3.12.x
brew install rye
rye sync
Need Node.js v20.11.1^
brew install bun
We're using Postgres v15.8^
We're using Supabase as our database, you'll need to create a project and get the database URL.
See docs/schema.md
- src/collect: Data collection scripts
- src/nlp_pipeline: NLP pipeline scripts
- src/utils: Utility scripts (e.g. database connection)
- front: Frontend code
Add a .env
file in the root of the project based on .env.example
and a front/.env
in the front
directory based on front/.env.example
Run the following commands to collect data:
rye run harmonic
rye run predictleads
rye run predictleads_news
rye run pdl_headcount_sales_eng
rye run similarweb
rye run nlp_pipeline
cd front
bun install
bun dev
- Expand the dataset by including more companies and enhancing clustering with additional media sources (e.g., news articles, website content, etc.).
- Introduce user-created taxonomies for a personalized and tailored experience.
- Enrich cluster information with more data to deterministically identify whether a segment is trending.
- Implement embedding pipelines in Airflow for optimized orchestration and scalability.
- Deploy the application online for broader accessibility.
- Improve database latency by hosting the database for faster and more reliable performance.