Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

DVC Sync S3 API is a minimal S3 interface that bridges the gap between data labeling tools and the data science pipeline

Notifications You must be signed in to change notification settings

swiss-ai-center/s3-api-dvc-sync

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DVC Sync S3 API

Bridging the Gap between Annotation Tools and Data Science Pipelines

Data stored in labeling tools, such as Label Studio, is not directly accessible in the data science pipeline. This project aims to address this issue by providing a minimal S3 API that can serve as a bridge between the annotation tools and Data Versioning Control (DVC).

The API will be integrated with labeling tools, such as Label Studio, as a cloud storage for annotations. This will allow annotations to be automatically pushed to DVC, making them easily accessible to the data science team in their machine learning operations pipeline.

Working principles

  • Implements a minimal subset of S3 commands to behave like an S3 API
  • Stores the objects in a local folder
  • Works with the same Git repository as the data science team, with the option to configure a separate branch
  • Project repository is cloned with sparse checkout to include only necessary meta files
  • Behaves like a team member, updating the dataset file and pushing changes to both DVC and Git

Limitations

  • The API is not a complete S3 implementation, only providing the necessary commands for Label Studio Sync functionality
  • Currently designed for single-tenant use, only working with one project at a time. However, multiple instances of the API can be run for multiple projects.

This project has been tested with Label Studio and its configuration allows for the setup of a cloud storage solution to store annotations. The cloud storage can be configured to use this S3 API with a custom endpoint.

Configuration

The configuration is done in the .env file. You can base your configuration on the .env.example file.

Run in development mode

Setup the virtual environment

Create a virtual environment

# create a virtual environment
python3 -m venv .venv

Activate the virtual environment

Windows

.venv\Scripts\activate.bat

Linux

source .venv/bin/activate

Install the requirements

pip install -r requirements.txt

Run the application

uvicorn main:app --reload --port 8000

Run using docker compose

Build the docker image

docker compose up

About

DVC Sync S3 API is a minimal S3 interface that bridges the gap between data labeling tools and the data science pipeline

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published