Sujee Maniyam (AI Engineer and Developer Advocate)
sujee@node51.com • Portfolio
Whether you're performing RAG (Retrieval-Augmented Generation) or fine-tuning a model, a significant portion of your time will be dedicated to cleaning (de-duping, removing markups, etc.) and shaping the data.
Data Prep Kit can help you with wrangling data.
Noteworthy features:
- de-duping documents (exact dedupe and fuzzy dedupe)
- can handle documents and code
- extract text from PDFs
- language detection (spoken languages and programming languages)
- malware detection
- document quality checking
- tokenizing and chunking
- generating embeddings
Getting Ready guide
-
2024-10-21: Workshop @ IBM Tech XChange , Las Vegas, NV
-
2024-09-21: Hands on RAG workshop @ Data Riders meetup - Hacker Dojo, Mountain View, CA
Some notebooks can be run on Google colab.
But it is recommended you setup local python dev environment.
Instructions for setting up dev environment
➡️ Data prep kit demos - Get to know data prep kit features
Milvus is a popular vector database that is open source
➡️ A quick start of Milvus - Run an embedded milvus
➡️ Vector search of movie plots using Milvus - load movie data, index it with embeddings, upload the data into milvus and run semantic queries