Tutorial for Machine Learning with Large Datasets

Problem

Data can easily exceed memory, and we cannot load it all at once. Training a machine learning algorithm requires only one batch of data at a time, a tiny fraction of the overall dataset. Therefore, it can be efficient to load data only when needed. This repository collects notebooks for a demo machine learning algorithm using a large dataset. We cover the following frameworks:

Pytorch Lightning 2.4.0
Tensorflow
Keras

In addition, we demonstrate how efficient data loading can be achieved with zarr data stores. We show how the data needs to be preprocessed and saved on disk in such a way that random access of small chunks of data during training is fast.

Install

Create a virtual environment

module load python3
python3 -m venv .venv

Install the machine learning packages via pip

source .venv/bin/activate
pip install -r requirements.txt

Create Jupyterhub kernel

python -m ipykernel install --user --name tutorial_ml --display-name="Tutorial Machine Learning"

Use this kernel ("Tutorial Machine Learning") to run the notebook corresponding to your framework of choice.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Pytorch.ipynb		Pytorch.ipynb
README.md		README.md
Tensorflow.ipynb		Tensorflow.ipynb
requirements.txt		requirements.txt
zarr_chunking.ipynb		zarr_chunking.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorial for Machine Learning with Large Datasets

Problem

Install

About

Releases

Packages

Languages

License

Hereon-KSN/tutorial-large-datasets

Folders and files

Latest commit

History

Repository files navigation

Tutorial for Machine Learning with Large Datasets

Problem

Install

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages