Welcome to the official repository of RetClean, a cutting-edge tool developed by QCRI for data repair tasks that demand world knowledge. RetClean combines the power of large language models (LLMs) with indexed CSV datalakes to clean and repair datasets efficiently.
- CSV Upload: Upload CSV files for cleaning.
- LLM-Driven Cleaning: Perform standalone data repair using LLMs.
- Data Lake Support: Upload and index CSV-based datalakes for contextual cleaning.
- Custom Configurations: Choose indices, reasoner models, and re-ranker settings for tailored cleaning workflows.
RetClean is built using modern tools and frameworks:
- Frontend: React
- Backend: FastAPI
- LLM Serving: Ollama
- Indexing and Search: Elasticsearch and Qdrant
- Containerization: Docker
Set up RetClean effortlessly using Docker Compose.
- Clone this repository and navigate to the project root.
- Build the application using:
docker-compose build
This will install the necessary services as well as the dependencies for the client and server.
For the local model example available in the application, we use a quantized LLaMA 3.1 model pulled from Ollama. This may require you to increase the allocated max disk space from the default in Docker.
To get the application running, just run the following command:
docker-compose up
The application hosted locally will run on port 3000: http://localhost:3000/
-
Upload Your Data
- Upload your CSV file.
- Select the target column to repair.
- Choose optional pivot/context columns.
-
Use a Data Lake for Cleaning
- Upload a folder of CSVs to create an indexed datalake.
- Indices are managed using FAISS and Elasticsearch.
-
Configure and Start Repair
- Choose the reasoner model, index, and re-ranker settings.
- Start a repair job.
-
Review and Confirm Changes
- View the suggested repairs for the target column.
- Confirm or adjust changes as needed.
We welcome contributions! Feel free to submit issues or pull requests. For significant changes, please open a discussion to ensure alignment with project goals.
This project is licensed under the MIT License. See the LICENSE file for details.