A cloud agnostic data science productivity tool that will:
- help streamline the data science experimentation process
- allow data scientists to manage models and experiment data
- seamlessly integrate with different cloud storage providers such as S3, Google Cloud Storage, Azure Cloud Storage
pip install tagr
Tagr uses the Python SDK of each cloud provider to handle serialization and retrieval. This means any authentication methods supported by the respective SDK will be compatible with Tagr. The Azure SDK is dissimilar from AWS and GCP. It does not lookup configurations in the system. Rather credentials must be provided in the client constructor. As a result, supplying credentials via env vars is the currently only supported authentication method for Azure (I don't want to setup AD). See .env.sample
for the necessary creds.
- Import tagr
from tagr.tagging.artifacts import Tagr
- Tagr provides a declaritive interface. Mark objs for serialization as you instantiate them
tag = Tagr()
x = tag.save(artifact=df, obj_name="X_train")
y = tag.save(artifact=2.5, obj_name="float1", dtype="primitive")
model = tag.save(artifact=RandomForestClassifier(max_depth=30), obj_name="model")
y_pred = tag.save(artifact=plt.plot([1, 2, 3, 4]), obj_name'viz', dtype="other")
- View what artifacts you have tagged so far
tag.inspect()
- Push all your tagged artifacts to a cloud storage solution of your choice
# S3
tag.flush(proj="tagr-dev", experiment="dev/sunrise", storage="aws")
# Google Cloud Storage
tag.flush(proj="tagr-dev", experiment="dev/sunrise", storage="gcp")
# Azure Storage Blob
tag.flush(proj="tagr-dev", experiment="dev/sunrise", storage="azure")
- Build the container
make build
- Set env vars
source .env.test
- Spin up a jupyter notebook in the container (for manual debugging)
jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root &
- Test
python -m unittest discover test/