- Contributing to datasetinsights
- Developing datasetinsights
- Codebase structure
- Train Model
- Unit testing
- Style Guide
- Writing documentation
If you are interested in contributing to datasetinsights, your contributions will fall into two categories:
- You want to propose a new models/datasets/evaluation metrics and implement it.
- You want to implement a feature or bug-fix for an outstanding issue.
Here are some steps to setup datasetinsights virtual environment with on your machine:
-
Install poetry, git and pre-commit
-
Create a virtual environment. We recommend using miniconda
conda create -n dins-dev python=3.7
conda activate dins-dev
- Clone a copy of datasetinsights from source:
git clone https://github.com/Unity-Technologies/datasetinsights.git
cd datasetinsights
Note: clone the repo from git@gitlab.internal.unity3d.com:machine-learning/thea.git
before datasetinsights source are available on public github.
- Install datasetinsights in
develop
mode:
poetry install
This will symlink the Python files from the current local source tree into the installed virtual environment install.
The develop
mode also includes Python packages such as pytest and black.
- Install pre-commit hook to
.git
folder.
pre-commit install
# pre-commit installed at .git/hooks/pre-commit
Adding new Python dependencies to datasetinsights environment using poetry like:
poetry add numpy@^1.18.4
Make sure you only add the desired packages instead of adding all dependencies. Let package management system resolve for dependencies. See poetry add for detail instructions.
The datasetinsights contains the following modules.
datasetinsights
-
commands This module contains the cli commands.
-
configs This module contains estimator configuration files.
-
datasets This module contains different datasets. The dataset classes contain knowledge on how the dataset should be loaded into memory.
-
estimators This module contain estimatos are used for training and evaluating models on the datasets.
-
evaluation_metrics This module contains metrics used by the different estimators and are specific in the estimator config file.
-
io This module contains functionality that relates to writing/downloading/uploading to/from different sources.
-
stats This module contains code for visualizing and gathering statistics on the dataset
We use pytest to run tests located under tests/
. Run the entire test suite with
pytest
or run individual test files, like:
pytest tests/test_visual.py
for individual test suites.
We follow Black code style for this repository. The max line length is set at 80. We enforce this code style using Black to format Python code. In addition to Black, we use isort to sort Python imports.
Before submitting a pull request, run:
pre-commit run --all-files
Fix all issues that were highlighted by flake8. If you want to skip exceptions such as long url lines in docstring, add # noqa: E501 <describe reason>
for the specific line violation. See this to learn more about how to ignore flake8 errors.
Some editors support automatically formatting on save. For example, in vscode
Datasetinsights uses Google style for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters with exceptions such as long urls or tables.
Follow instructions here.