Skip to content

qcri/RetClean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RetClean

Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes

VLDB 2024

Welcome to the official repository of RetClean, a cutting-edge tool developed by QCRI for data repair tasks that demand world knowledge. RetClean combines the power of large language models (LLMs) with indexed CSV datalakes to clean and repair datasets efficiently.


Features

  • CSV Upload: Upload CSV files for cleaning.
  • LLM-Driven Cleaning: Perform standalone data repair using LLMs.
  • Data Lake Support: Upload and index CSV-based datalakes for contextual cleaning.
  • Custom Configurations: Choose indices, reasoner models, and re-ranker settings for tailored cleaning workflows.

Tech Stack

RetClean is built using modern tools and frameworks:


Getting Started

Installation

Set up RetClean effortlessly using Docker Compose.

  1. Clone this repository and navigate to the project root.
  2. Build the application using:
docker-compose build
This will install the necessary services as well as the dependencies for the client and server.
Note

For the local model example available in the application, we use a quantized LLaMA 3.1 model pulled from Ollama. This may require you to increase the allocated max disk space from the default in Docker.

Start

To get the application running, just run the following command:

docker-compose up

The application hosted locally will run on port 3000: http://localhost:3000/


How to Use

  1. Upload Your Data

    • Upload your CSV file.
    • Select the target column to repair.
    • Choose optional pivot/context columns.

    Before Repair

  2. Use a Data Lake for Cleaning

    • Upload a folder of CSVs to create an indexed datalake.
    • Indices are managed using FAISS and Elasticsearch.

    Create Index

  3. Configure and Start Repair

    • Choose the reasoner model, index, and re-ranker settings.
    • Start a repair job.

    Loading Repair

  4. Review and Confirm Changes

    • View the suggested repairs for the target column.
    • Confirm or adjust changes as needed.

    After Repair


Contribution

We welcome contributions! Feel free to submit issues or pull requests. For significant changes, please open a discussion to ensure alignment with project goals.


License

This project is licensed under the MIT License. See the LICENSE file for details.

Releases

No releases published

Packages

No packages published