Merge pull request #182 from gAldeia/docker-compose-documentation

Documentation for new docker features. Implementation example for `get_population` and tests for the feature
cavalab · Sep 25, 2024 · e817917 · e817917
2 parents 60f90a0 + e944fa3
commit e817917
Show file tree

Hide file tree

Showing 7 changed files with 127 additions and 7 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -29,12 +29,14 @@ You can leverage this code base and previous experimental results to do so.
 - If your method uses a random seed, it should have a `random_state` attribute that can be set.
 - Methods must have their own folders in the `algorithms` directory (e.g., `algorithms/feat`). 
 This folder should contain:
-  1. `metadata.yml` (**required**): A file describing your submission, following the descriptions in [submission/feat-example/metadata.yml][metadata]. 
-  2. `regressor.py` (**required**): a Python file that defines your method, named appropriately. See [submission/feat-example/regressor.py][regressor] for complete documentation. 
+  1. `metadata.yml` (**required**): A file describing your submission, following the descriptions in [algorithms/feat/metadata.yml][metadata]. 
+  2. `regressor.py` (**required**): a Python file that defines your method, named appropriately. See [algorithms/feat/regressor.py][regressor] for complete documentation. 
       It should contain:
       -   `est`: a sklearn-compatible `Regressor` object. 
       -   `model(est, X=None)`: a function that returns a [**sympy-compatible**](https://www.sympy.org) string specifying the final model. It can optionally take the training data as an input argument. See [guidance below](###-returning-a-sympy-compatible-model-string). 
       -   `eval_kwargs` (optional): a dictionary that can specify method-specific arguments to `evaluate_model.py`.
+      -   `get_population(est) --> List[RegressorMixin]`: a function that return a list of at most 100 expressions, if using pareto front, population-based optimization, beam search, or any strategy that allows your algorithm to explore several expressions. If this is not valid for your algorithm, you can just wrap the estimator in a list (_i.e._, `return [est]`). Every element from the returned list must be a compatible `Regressor`, meaning that calling `predict(X)` should work, as well as your custom `model(est, X=None)` method for getting a string representation.
+      -   `get_best_solution(est)`: should provide an easy way of accessing the best solution from the current population, if this feature is valid for your algorithm. If not, then return the estimator itself `return est`.
   3. `LICENSE` *(optional)* A license file
   4. `environment.yml` *(optional)*: a [conda environment file](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) that specifies dependencies for your submission. 
   It will be used to update the baseline environment (`environment.yml` in the root directory). 
@@ -43,7 +45,8 @@ This folder should contain:
   5. `requirements.txt` *(optional)*: a pypi requirements file. The script will run `pip install -r requirements.txt` if this file is found, before proceeding.
   5. `install.sh` *(optional)*: a bash script that installs your method. 
   **Note: scripts should not require sudo permissions. The library and include paths should be directed to conda environment; the environmental variable `$CONDA_PREFIX` specifies the path to the environment.
-  6. **do not include your source code**. use `install.sh` to pull it from a stable source repository. 
+  6. `Dockerfile` *(optional)*: we will try to dockerize all algorithms. You can optionally have a `Dockerfile` inside your `algorithms/your-submission` folder to describe specific images for running your algorithm. If no file is provided, it will use `alg-Dockerfile` for your container. You can specify the image as you like, as long as you have as minimal dependences the python packages described in `base_environment.yml`, as they are used to run the experiment scripts. See [this example](algorithms/tir/Dockerfile) in case you want to use a custom image. *Notice that there is a workflow to build the docker images and push them to dockerhub*.
+  7. **do not include your source code**. use `install.sh` to pull it from a stable source repository. 
 
 ### model compatibility with sympy
 
@@ -63,3 +66,5 @@ def model(est, X):
 ```
 
 2. The operators/functions in the model are available in [sympy's function set](https://docs.sympy.org/latest/modules/functions/index.html). 
+
+### using populations
diff --git a/algorithms/tir/install.sh b/algorithms/tir/install.sh
@@ -4,6 +4,7 @@
 git clone https://github.com/folivetti/tir.git
 
 cd tir
+git checkout fead6fedd139eb5bb3da496d3b1cb2557a2aafda
 
 # WGL NOTE: this is a temp fix until PR https://github.com/folivetti/ITEA/pull/12 is merged
 # install ghcup

diff --git a/docs/user_guide.md b/docs/user_guide.md
@@ -118,6 +118,16 @@ done
 
 **Output**: next to each `.json` file, an additional file named `.json.updated` is saved with the symbolic assessment included. 
 
+### For docker users
+
+When a new algorithm is submitted to SRBench, a GitHub workflow will generate a docker image and push it to [Docker Hub](hub.docker.com). Ths means that you can also easily pull the images, without having to deal with local installations.
+
+To use docker, you first run `scripts/make_docker_compose_file.sh`. Then `docker compose up` should create the images.
+
+You can now submit arbitrary python commands to the image, _e.g._ `docker compose run feat bash test.sh`
+
+Or you can enter bash mode using an image with `docker compose run feat bash`
+
 ### Post-processing
 
 Navigate to the [postprocessing](postprocessing) folder to begin postprocessing the experiment results. 
@@ -142,5 +152,5 @@ python collate_groundtruth_results.py
 
 To use your own datasets, you want to check out / modify read_file in read_file.py: https://github.com/cavalab/srbench/blob/4cc90adc9c450dad3cb3f82c93136bc2cb3b1a0a/experiment/read_file.py
 
-If your datasets follow the convention of https://github.com/EpistasisLab/pmlb/tree/master/datasets, i.e. they are in a pandas DataFrame with the target column labelled "targert", you can call `read_file` directly just passing the filename like you would with any of the PMLB datasets. 
+If your datasets follow the convention of https://github.com/EpistasisLab/pmlb/tree/master/datasets, i.e. they are in a pandas DataFrame with the target column labelled "target", you can call `read_file` directly just passing the filename like you would with any of the PMLB datasets. 
 The file should be stored and compressed as a `.tsv.gz` file. 
diff --git a/experiment/methods/feat/regressor.py b/experiment/methods/feat/regressor.py
@@ -1,15 +1,16 @@
 # This example submission shows the submission of FEAT (cavalab.org/feat). 
 from feat import FeatRegressor
+from sklearn.base import BaseEstimator, RegressorMixin
 
 """
 est: a sklearn-compatible regressor. 
     if you don't have one they are fairly easy to create. 
     see https://scikit-learn.org/stable/developers/develop.html
 """
-est = FeatRegressor(
+est:RegressorMixin = FeatRegressor(
                     pop_size=100,
                     gens=100,
-                    max_time=8*60*60,  # 8 hrs
+                    max_time=8*60*60,  # 8 hrs. Your algorithm should have this feature
                     max_depth=6,
                     verbosity=2,
                     batch_size=100,
@@ -18,7 +19,7 @@
                    )
 # want to tune your estimator? wrap it in a sklearn CV class. 
 
-def model(est, X=None):
+def model(est, X=None) -> str:
     """
     Return a sympy-compatible string of the final model. 
 
@@ -66,6 +67,35 @@ def model(est, X):
 
     return model_str
 
+def get_population(est) -> list[RegressorMixin]:
+    """
+    Return the final population of the model. This final population should
+    be a list with at most 100 individuals. Each of the individuals must
+    be compatible with scikit-learn, so they should have a predict method.
+
+    Also, it is expected that the `model()` function can operate with them,
+    so they should have a way of getting a simpy string representation.
+
+    Returns
+    -------
+    A list of scikit-learn compatible estimators
+    """
+
+    return [est]
+
+
+def get_best_solution(est) -> RegressorMixin:
+    """
+    Return the best solution from the final model. 
+
+    Returns
+    -------
+    A scikit-learn compatible estimator
+    """
+
+    return est
+
+
 ################################################################################
 # Optional Settings
 ################################################################################

diff --git a/experiment/test.sh b/experiment/test.sh
@@ -1 +1,2 @@
 python -m pytest -v test_algorithm.py --ml ${ALGORITHM}
+python -m pytest -v test_population.py --ml ${ALGORITHM}
diff --git a/experiment/test_population.py b/experiment/test_population.py
@@ -0,0 +1,72 @@
+import sys
+import os
+import types
+import numpy as np
+from os.path import dirname as d
+from os.path import abspath
+from sklearn.model_selection import train_test_split
+
+root_dir = d(abspath(__file__))
+sys.path.append(root_dir)
+print('appended',root_dir,'to sys.path')
+
+import importlib
+from read_file import read_file
+
+if 'OMP_NUM_THREADS' not in os.environ.keys():
+    os.environ['OMP_NUM_THREADS'] = '1'
+if 'OPENBLAS_NUM_THREADS' not in os.environ.keys():
+    os.environ['OPENBLAS_NUM_THREADS'] = '1'
+if 'MKL_NUM_THREADS' not in os.environ.keys():
+    os.environ['MKL_NUM_THREADS'] = '1'
+
+
+def test_population(ml):
+    """Sympy compatibility of model string"""
+
+    dataset = 'test/192_vineyard_small.tsv.gz'
+    random_state = 42
+
+    algorithm = importlib.__import__(f'methods.{ml}.regressor',globals(),
+                                        locals(),
+                                    ['est','hyper_params','complexity'])
+
+    algorithm.get_population,
+
+    features, labels, feature_names =  read_file(
+        dataset, 
+        use_dataframe=True
+    )
+    print('feature_names:',feature_names)
+
+    # generate train/test split
+    X_train, X_test, y_train, y_test = train_test_split(features, labels,
+                                                    train_size=0.75,
+                                                    test_size=0.25,
+                                                    random_state=random_state)
+
+    # Few samples to try to make it quick
+    sample_idx = np.random.choice(np.arange(len(X_train)), size=10)
+
+    y_train = y_train[sample_idx]
+    X_train = X_train.loc[sample_idx]
+
+    algorithm.est.fit(X_train, y_train)
+
+    if 'get_population' not in dir(algorithm):
+        algorithm.get_population = lambda est: [est]
+    if 'get_best_solution' not in dir(algorithm):
+        algorithm.get_best_solution = lambda est: est
+
+    population = algorithm.get_population(algorithm.est)
+
+    best_model = algorithm.get_best_solution(algorithm.est)
+    print(algorithm.model(best_model))
+    print(algorithm.est.predict(X_train))
+
+    # assert that population has at least 1 and no more than 100 individuals
+    assert 1 <= len(population) <= 100, "Population size is not within the expected range"
+
+    for p in population:
+        print(algorithm.model(p))
+        print(p.predict(X_train))
diff --git a/local_ci.sh b/local_ci.sh
@@ -59,6 +59,7 @@ conda activate $SUBENV
 conda env list 
 conda info 
 python -m pytest -v test_algorithm.py --ml $SUBNAME
+python -m pytest -v test_population.py --ml $SUBNAME
 
 # Store Competitor
 # cd ..
Original file line number	Diff line number	Diff line change
		@@ -1 +1,2 @@
		python -m pytest -v test_algorithm.py --ml ${ALGORITHM}
		python -m pytest -v test_population.py --ml ${ALGORITHM}