YouTube Semantic Search

OpenAI-powered semantic search for any YouTube playlist — featuring the Launch School Capstone Presentations

Intro
How to get started
Example Queries
Screenshots
How It Works
Feedback
Credit
License

Intro

Derived from Travis Fischer's https://github.com/transitive-bullshit/yt-semantic-search, adapted to search the Launch School Capstone Presentations.

Complementary to this is an app David Dickinson put together to view summaries of the Capstone Projects made from the video transcripts with GPT-3.5 and GPT-4. Check it out here.

This project uses the latest models from OpenAI to build a semantic search index across every Capstone presentation from 2022 and 2023. It allows you to find the exact moments in each video where a topic was discussed with Google-level accuracy and find the exact clips you're interested in.

You can use it to power advanced search across any YouTube channel or playlist. In this case, we're focused on the Launch School Capstone Presentations, but you can easily adapt this project to search any YouTube channel or playlist.

How to get started

Clone the repository to your local machine.
Navigate to the root directory of the repository in your terminal.
Run the command npm install to install all the necessary dependencies.
Run the command npx tsx src/bin/resolve-yt-playlist.ts to download the English transcripts for each episode of the target playlist (in this case, the All-In Podcast Episodes Playlist).
Run the command npx tsx src/bin/process-yt-playlist.ts to pre-process the transcripts and fetch embeddings from OpenAI, then insert them into a Pinecone search index.
You can now run the command npx tsx src/bin/query.ts to query the Pinecone search index. (Optional) Run the command npx tsx src/bin/generate-thumbnails.ts to generate timestamped thumbnails of each video in the playlist. This step takes ~2 hours and requires a stable internet connection.
The frontend of the project is a Next.js webapp deployed to Vercel that uses the Pinecone index as a primary data store. You can run the command npm run dev to start the development server and view the webapp locally.

Note that a few episodes may not have automated English transcriptions available, and that the project uses a hacky HTML scraping solution for this, so a better solution would be to use Whisper to transcribe the episode's audio. Also, the project support sorting by recency vs relevancy.

Example Queries

Screenshots

How It Works

Under the hood, it uses:

OpenAI - We're using the brand new text-embedding-ada-002 embedding model, which captures deeper information about text in a latent space with 1536 dimensions
- This allows us to go beyond keyword search and search by higher-level topics.
Pinecone - Hosted vector search which enables us to efficiently perform k-NN searches across these embeddings
Vercel - Hosting and API functions
Next.js - React web framework

We use Node.js and the YouTube API v3 to fetch the videos of our target playlist. In this case, we're focused on the Launch School Capstone Presentations from 2020-2023, which contains 58 videos at the time of writing.

npx tsx src/bin/resolve-yt-playlist.ts

We download the English transcripts for each episode using a hacky HTML scraping solution, since the YouTube API doesn't allow non-OAuth access to captions. Note that a few episodes don't have automated English transcriptions available, so we're just skipping them at the moment. A better solution would be to use Whisper to transcribe each episode's audio.

Once we have all of the transcripts and metadata downloaded locally, we pre-process each video's transcripts, breaking them up into reasonably sized chunks of ~100 tokens and fetch it's text-embedding-ada-002 embedding from OpenAI. This results in ~200 embeddings per episode.

All of these embeddings are then upserted into a Pinecone search index with a dimensionality of 1536. There are ~17,575 embeddings in total across ~58 Capstone Presentations.

npx tsx src/bin/process-yt-playlist.ts

Once our Pinecone search index is set up, we can start querying it either via the webapp or via the example CLI:

npx tsx src/bin/query.ts

We also support generating timestamp-based thumbnails of every YouTube video in the playlist. Thumbnails are generated using headless Puppeteer and are uploaded to Google Cloud Storage. We also post-process each thumbnail with lqip-modern to generate nice preview placeholder images.

If you want to generate thumbnails (optional), run:

npx tsx src/bin/generate-thumbnails.ts

Note that thumbnail generation takes ~2 hours and requires a pretty stable internet connection.

The frontend is a Next.js webapp deployed to Vercel that uses our Pinecone index as a primary data store.

Feedback

Have an idea on how this webapp could be improved? Find a particularly fun search query?

Feel free to send me feedback, either on GitHub or Twitter. 💯

Credit

Inspired by Riley Tomasek's project for searching the Huberman YouTube Channel
Translated from Transitive Bullshit's All-In Podcast Semantic Search project
Note that this project is not affiliated with Launch School. It just pulls data from their vidoes YouTube channel via a playlist I set up and processes it using AI.

License

The API and server costs add up over time, so if you can spare it, sponsoring on Github is greatly appreciated. 💕

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github		.github
.husky		.husky
Assets		Assets
public		public
src		src
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.npmrc		.npmrc
.prettierignore		.prettierignore
.prettierrc.cjs		.prettierrc.cjs
create_namespace.py		create_namespace.py
license		license
next.config.js		next.config.js
out.html		out.html
out.zip		out.zip
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
postcss.config.cjs		postcss.config.cjs
readme.md		readme.md
tailwind.config.cjs		tailwind.config.cjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Semantic Search

Intro

How to get started

Example Queries

Screenshots

How It Works

Feedback

Credit

License

About

Releases

Sponsor this project

Packages

Contributors 3

Languages

License

davidrd123/Launch-SemanticSearch-Capstone-YT

Folders and files

Latest commit

History

Repository files navigation

YouTube Semantic Search

Intro

How to get started

Example Queries

Screenshots

How It Works

Feedback

Credit

License

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 3

Languages

Packages