Provides programmatic access to trainML platform.
Python 3.8 or above is required.
pip install trainml
You must have a valid trainML account. On the account settings page click the Create
button in the API Keys
section. This will automatically download a credentials.json
file. This file can only be generated once per API key. Treat this file as a password, as anyone with access to your API key will have the ability to create and control resources in your trainML account. You can deactivate any API key by clicking the Remove
button.
Creating resources on the trainML platform requires a non-zero credit balance. To purchase credits or sign-up for automatic credit top-ups, visit the billing page.
The easiest way to authenticate is to place the credentials file downloaded into the .trainml
folder of your home directory and ensure only you have access to it. From the directory that the credentials.json
file was downloaded, run the following command:
mkdir -p ~/.trainml
mv credentials.json ~/.trainml/credentials.json
chmod 600 ~/.trainml/credentials.json
You can also use environment variables TRAINML_USER
and TRAINML_KEY
and set them to their respective values from the credentials.json
file.
export TRAINML_USER=<'user' field from credentials.json>
export TRAINML_KEY=<'key' field from credentials.json>
python create_job.py
Environment variables will override any credentials stored in ~/.trainml/credentials.json
API credentials can also be passed directly to the TrainML object constructor at runtime.
import trainml
trainml = trainml.TrainML(user="user field from credentials.json",key="key field from credentials.json>")
await trainml.jobs.create(...)
Passing credentials to the TrainML constructor will override all other methods for setting credentials.
By default, all operations using the trainML SDK/CLI will use the Personal project for trainML account the API keys were generated from. To change the active project, run the configure command:
trainml configure
This command will output the currently configured active project (UNSET
defaults to Personal
) and allows you to specify any project you have access to as the new active project.
Current Active Project: Personal
Select Active Project: (My Other Project, Personal, Project Shared With Me) [Personal]:
Once you select a project, it will store the results of your selection in the config.json
file in the TRAINML_CONFIG_DIR
folder (~/.trainml
by default). Once the active project is set, all subsequent operations will use the selected project.
This setting can also be overridden at runtime using the environment variable TRAINML_PROJECT
:
TRAINML_PROJECT=<PROJECT ID> python create_job.py
or by instantiating the trainml client with the project
keyword argument:
import trainml
trainml = trainml.TrainML(project="PROJECT ID")
await trainml.jobs.create(...)
You must specify the project ID (not name) when using the runtime options. The project ID can be found by running
trainml project list
.
The trainML SDK utilizes the asyncio library to ease the concurrent execution of long running tasks. An example of how to create a dataset from an S3 bucket and immediately run a training job on that dataset is the following:
from trainml.trainml import TrainML
import asyncio
trainml_client = TrainML()
# Create the dataset
dataset = asyncio.run(
trainml_client.datasets.create(
name="Example Dataset",
source_type="aws",
source_uri="s3://trainml-examples/data/cifar10",
)
)
print(dataset)
# Watch the log output, attach will return when data transfer is complete
asyncio.run(dataset.attach())
# Create the job
job = asyncio.run(
trainml_client.jobs.create(
name="Example Training Job",
type="training",
gpu_type="GTX 1060",
gpu_count=1,
disk_size=10,
workers=[
"PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINML_DATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024",
],
data=dict(
datasets=[dict(id=dataset.id, type="existing")],
output_uri="s3://trainml-examples/output/resnet_cifar10",
output_type="aws",
),
model=dict(git_uri="git@github.com:trainML/test-private.git"),
)
)
print(job)
# Watch the log output, attach will return when the training job stops
asyncio.run(job.attach())
# Cleanup job and dataset
asyncio.run(job.remove())
asyncio.run(dataset.remove())
See more examples in the examples folder
The command line interface is rooted in the trainml
command. To see the available options, run:
trainml --help
To list all jobs:
trainml job list
To list all datasets:
trainml dataset list
To connect to a job that requires the connection capability:
trainml job connect <job ID or name>
To watch the realtime job logs:
trainml job attach <job ID or name>
To create and open a notebook job:
trainml job create notebook "My Notebook Job"
To create a multi-GPU notebook job on a specific GPU type with larger scratch directory space:
trainml job create notebook --gpu-type "RTX 3090" --gpu-count 4 --disk-size 50 "My Notebook Job"
To run the model training code in the train.py
file in your local ~/model-code
directory on the training data in your local ~/data
directory:
trainml job create training --model-dir ~/model-code --data-dir ~/data "My Training Job" "python train.py"
Stop a job by job ID:
trainml job stop fe52527c-1f4b-468f-b57d-86db864cc089
Stop a job by name:
trainml job stop "My Notebook Job"
Restart a notebook job:
trainml job start "My Notebook Job"
Remove a job by job ID:
trainml job remove fe52527c-1f4b-468f-b57d-86db864cc089