Pulse

This project was built as part of the Data-Driven VC Hackathon organized by Red River West & Bivwak! by BNP Paribas

General idea

This project aims to tackle the lack of granularity and static nature of current market segmentations done by most data providers or VCs.

We propose a dynamic data-driven approach to segment companies and identify markets without the need for manual intervention. This approach is based on the analysis of enriched textual data from companies retrieved from different sources including (Website, Github, PredictLads, ...) and the use of NLP techniques (embedding clustering) to classify them into different topics.

We then give a brief overview of those markets by aggregating the data from multiple companies and reprensenting the market trends as a whole.

Here is a brief overview of the workflow

Prerequisites (assuming Mac environment, with Homebrew installed)

LLVM

brew install llvm

Python/Rye

You can use rye to automatically manage your Python environment otherwise you require Python 3.12.x

brew install rye

rye sync

Node.js/Bun

Need Node.js v20.11.1^

brew install bun

Database

We're using Postgres v15.8^

We're using Supabase as our database, you'll need to create a project and get the database URL.

Table Schema

See docs/schema.md

Code Structure

src/collect: Data collection scripts
src/nlp_pipeline: NLP pipeline scripts
src/utils: Utility scripts (e.g. database connection)
front: Frontend code

Data collection

Add a .env file in the root of the project based on .env.example and a front/.env in the front directory based on front/.env.example

Run the following commands to collect data:

rye run harmonic
rye run predictleads
rye run predictleads_news
rye run pdl_headcount_sales_eng
rye run similarweb

NLP pipeline

See docs/nlp_pipeline.md

rye run nlp_pipeline

Frontend

cd front
bun install
bun dev

Next Steps for a Comprehensive Product

Features

Expand the dataset by including more companies and enhancing clustering with additional media sources (e.g., news articles, website content, etc.).
Introduce user-created taxonomies for a personalized and tailored experience.
Enrich cluster information with more data to deterministically identify whether a segment is trending.

Technical Enhancements

Implement embedding pipelines in Airflow for optimized orchestration and scalability.
Deploy the application online for broader accessibility.
Improve database latency by hosting the database for faster and more reliable performance.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
front		front
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE.md		LICENSE.md
README.md		README.md
icon.png		icon.png
package.json		package.json
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pulse

General idea

Prerequisites (assuming Mac environment, with Homebrew installed)

LLVM

Python/Rye

Node.js/Bun

Database

Table Schema

Code Structure

Data collection

NLP pipeline

Frontend

Next Steps for a Comprehensive Product

Features

Technical Enhancements

About

Releases

Packages

Contributors 5

Languages

License

heringo/ddvc

Folders and files

Latest commit

History

Repository files navigation

Pulse

General idea

Prerequisites (assuming Mac environment, with Homebrew installed)

LLVM

Python/Rye

Node.js/Bun

Database

Table Schema

Code Structure

Data collection

NLP pipeline

Frontend

Next Steps for a Comprehensive Product

Features

Technical Enhancements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages