Skip to content

sujee/data-prep-kit-examples

Repository files navigation

Data Prep Kit Examples

About

bit.ly/dpk-examples

Sujee Maniyam (AI Engineer and Developer Advocate)
sujee@node51.com   •   Portfolio

Introducing Data Prep Kit (DPK)

Whether you're performing RAG (Retrieval-Augmented Generation) or fine-tuning a model, a significant portion of your time will be dedicated to cleaning (de-duping, removing markups, etc.) and shaping the data.

Data Prep Kit can help you with wrangling data.

Noteworthy features:

  • de-duping documents (exact dedupe and fuzzy dedupe)
  • can handle documents and code
  • extract text from PDFs
  • language detection (spoken languages and programming languages)
  • malware detection
  • document quality checking
  • tokenizing and chunking
  • generating embeddings

Getting Ready

Getting Ready guide

Events

How to Run the Code

Some notebooks can be run on Google colab.

But it is recommended you setup local python dev environment.

Instructions for setting up dev environment

Labs

Data Prep Kit Examples

➡️ Data prep kit demos - Get to know data prep kit features

Milvus - Vector Database

Milvus is a popular vector database that is open source

➡️ A quick start of Milvus - Run an embedded milvus

➡️ Vector search of movie plots using Milvus - load movie data, index it with embeddings, upload the data into milvus and run semantic queries

RAG Pipeline

➡️ End to end RAG

About

Examples of using IBM data prep kit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published