Skip to content

Commit

Permalink
Merge pull request #182 from gAldeia/docker-compose-documentation
Browse files Browse the repository at this point in the history
Documentation for new docker features. Implementation example for `get_population` and tests for the feature
  • Loading branch information
lacava authored Sep 25, 2024
2 parents 60f90a0 + e944fa3 commit e817917
Show file tree
Hide file tree
Showing 7 changed files with 127 additions and 7 deletions.
11 changes: 8 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,14 @@ You can leverage this code base and previous experimental results to do so.
- If your method uses a random seed, it should have a `random_state` attribute that can be set.
- Methods must have their own folders in the `algorithms` directory (e.g., `algorithms/feat`).
This folder should contain:
1. `metadata.yml` (**required**): A file describing your submission, following the descriptions in [submission/feat-example/metadata.yml][metadata].
2. `regressor.py` (**required**): a Python file that defines your method, named appropriately. See [submission/feat-example/regressor.py][regressor] for complete documentation.
1. `metadata.yml` (**required**): A file describing your submission, following the descriptions in [algorithms/feat/metadata.yml][metadata].
2. `regressor.py` (**required**): a Python file that defines your method, named appropriately. See [algorithms/feat/regressor.py][regressor] for complete documentation.
It should contain:
- `est`: a sklearn-compatible `Regressor` object.
- `model(est, X=None)`: a function that returns a [**sympy-compatible**](https://www.sympy.org) string specifying the final model. It can optionally take the training data as an input argument. See [guidance below](###-returning-a-sympy-compatible-model-string).
- `eval_kwargs` (optional): a dictionary that can specify method-specific arguments to `evaluate_model.py`.
- `get_population(est) --> List[RegressorMixin]`: a function that return a list of at most 100 expressions, if using pareto front, population-based optimization, beam search, or any strategy that allows your algorithm to explore several expressions. If this is not valid for your algorithm, you can just wrap the estimator in a list (_i.e._, `return [est]`). Every element from the returned list must be a compatible `Regressor`, meaning that calling `predict(X)` should work, as well as your custom `model(est, X=None)` method for getting a string representation.
- `get_best_solution(est)`: should provide an easy way of accessing the best solution from the current population, if this feature is valid for your algorithm. If not, then return the estimator itself `return est`.
3. `LICENSE` *(optional)* A license file
4. `environment.yml` *(optional)*: a [conda environment file](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) that specifies dependencies for your submission.
It will be used to update the baseline environment (`environment.yml` in the root directory).
Expand All @@ -43,7 +45,8 @@ This folder should contain:
5. `requirements.txt` *(optional)*: a pypi requirements file. The script will run `pip install -r requirements.txt` if this file is found, before proceeding.
5. `install.sh` *(optional)*: a bash script that installs your method.
**Note: scripts should not require sudo permissions. The library and include paths should be directed to conda environment; the environmental variable `$CONDA_PREFIX` specifies the path to the environment.
6. **do not include your source code**. use `install.sh` to pull it from a stable source repository.
6. `Dockerfile` *(optional)*: we will try to dockerize all algorithms. You can optionally have a `Dockerfile` inside your `algorithms/your-submission` folder to describe specific images for running your algorithm. If no file is provided, it will use `alg-Dockerfile` for your container. You can specify the image as you like, as long as you have as minimal dependences the python packages described in `base_environment.yml`, as they are used to run the experiment scripts. See [this example](algorithms/tir/Dockerfile) in case you want to use a custom image. *Notice that there is a workflow to build the docker images and push them to dockerhub*.
7. **do not include your source code**. use `install.sh` to pull it from a stable source repository.

### model compatibility with sympy

Expand All @@ -63,3 +66,5 @@ def model(est, X):
```

2. The operators/functions in the model are available in [sympy's function set](https://docs.sympy.org/latest/modules/functions/index.html).

### using populations
1 change: 1 addition & 0 deletions algorithms/tir/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
git clone https://github.com/folivetti/tir.git

cd tir
git checkout fead6fedd139eb5bb3da496d3b1cb2557a2aafda

# WGL NOTE: this is a temp fix until PR https://github.com/folivetti/ITEA/pull/12 is merged
# install ghcup
Expand Down
12 changes: 11 additions & 1 deletion docs/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,16 @@ done

**Output**: next to each `.json` file, an additional file named `.json.updated` is saved with the symbolic assessment included.

### For docker users

When a new algorithm is submitted to SRBench, a GitHub workflow will generate a docker image and push it to [Docker Hub](hub.docker.com). Ths means that you can also easily pull the images, without having to deal with local installations.

To use docker, you first run `scripts/make_docker_compose_file.sh`. Then `docker compose up` should create the images.

You can now submit arbitrary python commands to the image, _e.g._ `docker compose run feat bash test.sh`

Or you can enter bash mode using an image with `docker compose run feat bash`

### Post-processing

Navigate to the [postprocessing](postprocessing) folder to begin postprocessing the experiment results.
Expand All @@ -142,5 +152,5 @@ python collate_groundtruth_results.py

To use your own datasets, you want to check out / modify read_file in read_file.py: https://github.com/cavalab/srbench/blob/4cc90adc9c450dad3cb3f82c93136bc2cb3b1a0a/experiment/read_file.py

If your datasets follow the convention of https://github.com/EpistasisLab/pmlb/tree/master/datasets, i.e. they are in a pandas DataFrame with the target column labelled "targert", you can call `read_file` directly just passing the filename like you would with any of the PMLB datasets.
If your datasets follow the convention of https://github.com/EpistasisLab/pmlb/tree/master/datasets, i.e. they are in a pandas DataFrame with the target column labelled "target", you can call `read_file` directly just passing the filename like you would with any of the PMLB datasets.
The file should be stored and compressed as a `.tsv.gz` file.
36 changes: 33 additions & 3 deletions experiment/methods/feat/regressor.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# This example submission shows the submission of FEAT (cavalab.org/feat).
from feat import FeatRegressor
from sklearn.base import BaseEstimator, RegressorMixin

"""
est: a sklearn-compatible regressor.
if you don't have one they are fairly easy to create.
see https://scikit-learn.org/stable/developers/develop.html
"""
est = FeatRegressor(
est:RegressorMixin = FeatRegressor(
pop_size=100,
gens=100,
max_time=8*60*60, # 8 hrs
max_time=8*60*60, # 8 hrs. Your algorithm should have this feature
max_depth=6,
verbosity=2,
batch_size=100,
Expand All @@ -18,7 +19,7 @@
)
# want to tune your estimator? wrap it in a sklearn CV class.

def model(est, X=None):
def model(est, X=None) -> str:
"""
Return a sympy-compatible string of the final model.

Expand Down Expand Up @@ -66,6 +67,35 @@ def model(est, X):

return model_str

def get_population(est) -> list[RegressorMixin]:
"""
Return the final population of the model. This final population should
be a list with at most 100 individuals. Each of the individuals must
be compatible with scikit-learn, so they should have a predict method.

Also, it is expected that the `model()` function can operate with them,
so they should have a way of getting a simpy string representation.

Returns
-------
A list of scikit-learn compatible estimators
"""

return [est]


def get_best_solution(est) -> RegressorMixin:
"""
Return the best solution from the final model.

Returns
-------
A scikit-learn compatible estimator
"""

return est


################################################################################
# Optional Settings
################################################################################
Expand Down
1 change: 1 addition & 0 deletions experiment/test.sh
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
python -m pytest -v test_algorithm.py --ml ${ALGORITHM}
python -m pytest -v test_population.py --ml ${ALGORITHM}
72 changes: 72 additions & 0 deletions experiment/test_population.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import sys
import os
import types
import numpy as np
from os.path import dirname as d
from os.path import abspath
from sklearn.model_selection import train_test_split

root_dir = d(abspath(__file__))
sys.path.append(root_dir)
print('appended',root_dir,'to sys.path')

import importlib
from read_file import read_file

if 'OMP_NUM_THREADS' not in os.environ.keys():
os.environ['OMP_NUM_THREADS'] = '1'
if 'OPENBLAS_NUM_THREADS' not in os.environ.keys():
os.environ['OPENBLAS_NUM_THREADS'] = '1'
if 'MKL_NUM_THREADS' not in os.environ.keys():
os.environ['MKL_NUM_THREADS'] = '1'


def test_population(ml):
"""Sympy compatibility of model string"""

dataset = 'test/192_vineyard_small.tsv.gz'
random_state = 42

algorithm = importlib.__import__(f'methods.{ml}.regressor',globals(),
locals(),
['est','hyper_params','complexity'])

algorithm.get_population,

features, labels, feature_names = read_file(
dataset,
use_dataframe=True
)
print('feature_names:',feature_names)

# generate train/test split
X_train, X_test, y_train, y_test = train_test_split(features, labels,
train_size=0.75,
test_size=0.25,
random_state=random_state)

# Few samples to try to make it quick
sample_idx = np.random.choice(np.arange(len(X_train)), size=10)

y_train = y_train[sample_idx]
X_train = X_train.loc[sample_idx]

algorithm.est.fit(X_train, y_train)

if 'get_population' not in dir(algorithm):
algorithm.get_population = lambda est: [est]
if 'get_best_solution' not in dir(algorithm):
algorithm.get_best_solution = lambda est: est

population = algorithm.get_population(algorithm.est)

best_model = algorithm.get_best_solution(algorithm.est)
print(algorithm.model(best_model))
print(algorithm.est.predict(X_train))

# assert that population has at least 1 and no more than 100 individuals
assert 1 <= len(population) <= 100, "Population size is not within the expected range"

for p in population:
print(algorithm.model(p))
print(p.predict(X_train))
1 change: 1 addition & 0 deletions local_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ conda activate $SUBENV
conda env list
conda info
python -m pytest -v test_algorithm.py --ml $SUBNAME
python -m pytest -v test_population.py --ml $SUBNAME

# Store Competitor
# cd ..
Expand Down

0 comments on commit e817917

Please sign in to comment.