diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 5b5591aec..562dbe07f 100755 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -29,12 +29,14 @@ You can leverage this code base and previous experimental results to do so. - If your method uses a random seed, it should have a `random_state` attribute that can be set. - Methods must have their own folders in the `algorithms` directory (e.g., `algorithms/feat`). This folder should contain: - 1. `metadata.yml` (**required**): A file describing your submission, following the descriptions in [submission/feat-example/metadata.yml][metadata]. - 2. `regressor.py` (**required**): a Python file that defines your method, named appropriately. See [submission/feat-example/regressor.py][regressor] for complete documentation. + 1. `metadata.yml` (**required**): A file describing your submission, following the descriptions in [algorithms/feat/metadata.yml][metadata]. + 2. `regressor.py` (**required**): a Python file that defines your method, named appropriately. See [algorithms/feat/regressor.py][regressor] for complete documentation. It should contain: - `est`: a sklearn-compatible `Regressor` object. - `model(est, X=None)`: a function that returns a [**sympy-compatible**](https://www.sympy.org) string specifying the final model. It can optionally take the training data as an input argument. See [guidance below](###-returning-a-sympy-compatible-model-string). - `eval_kwargs` (optional): a dictionary that can specify method-specific arguments to `evaluate_model.py`. + - `get_population(est) --> List[RegressorMixin]`: a function that return a list of at most 100 expressions, if using pareto front, population-based optimization, beam search, or any strategy that allows your algorithm to explore several expressions. If this is not valid for your algorithm, you can just wrap the estimator in a list (_i.e._, `return [est]`). Every element from the returned list must be a compatible `Regressor`, meaning that calling `predict(X)` should work, as well as your custom `model(est, X=None)` method for getting a string representation. + - `get_best_solution(est)`: should provide an easy way of accessing the best solution from the current population, if this feature is valid for your algorithm. If not, then return the estimator itself `return est`. 3. `LICENSE` *(optional)* A license file 4. `environment.yml` *(optional)*: a [conda environment file](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) that specifies dependencies for your submission. It will be used to update the baseline environment (`environment.yml` in the root directory). @@ -43,7 +45,8 @@ This folder should contain: 5. `requirements.txt` *(optional)*: a pypi requirements file. The script will run `pip install -r requirements.txt` if this file is found, before proceeding. 5. `install.sh` *(optional)*: a bash script that installs your method. **Note: scripts should not require sudo permissions. The library and include paths should be directed to conda environment; the environmental variable `$CONDA_PREFIX` specifies the path to the environment. - 6. **do not include your source code**. use `install.sh` to pull it from a stable source repository. + 6. `Dockerfile` *(optional)*: we will try to dockerize all algorithms. You can optionally have a `Dockerfile` inside your `algorithms/your-submission` folder to describe specific images for running your algorithm. If no file is provided, it will use `alg-Dockerfile` for your container. You can specify the image as you like, as long as you have as minimal dependences the python packages described in `base_environment.yml`, as they are used to run the experiment scripts. See [this example](algorithms/tir/Dockerfile) in case you want to use a custom image. *Notice that there is a workflow to build the docker images and push them to dockerhub*. + 7. **do not include your source code**. use `install.sh` to pull it from a stable source repository. ### model compatibility with sympy @@ -63,3 +66,5 @@ def model(est, X): ``` 2. The operators/functions in the model are available in [sympy's function set](https://docs.sympy.org/latest/modules/functions/index.html). + +### using populations diff --git a/algorithms/tir/install.sh b/algorithms/tir/install.sh index 9418e7f90..782721de2 100755 --- a/algorithms/tir/install.sh +++ b/algorithms/tir/install.sh @@ -4,6 +4,7 @@ git clone https://github.com/folivetti/tir.git cd tir +git checkout fead6fedd139eb5bb3da496d3b1cb2557a2aafda # WGL NOTE: this is a temp fix until PR https://github.com/folivetti/ITEA/pull/12 is merged # install ghcup diff --git a/docs/user_guide.md b/docs/user_guide.md index e90d5c4d2..e8550a8d2 100644 --- a/docs/user_guide.md +++ b/docs/user_guide.md @@ -118,6 +118,16 @@ done **Output**: next to each `.json` file, an additional file named `.json.updated` is saved with the symbolic assessment included. +### For docker users + +When a new algorithm is submitted to SRBench, a GitHub workflow will generate a docker image and push it to [Docker Hub](hub.docker.com). Ths means that you can also easily pull the images, without having to deal with local installations. + +To use docker, you first run `scripts/make_docker_compose_file.sh`. Then `docker compose up` should create the images. + +You can now submit arbitrary python commands to the image, _e.g._ `docker compose run feat bash test.sh` + +Or you can enter bash mode using an image with `docker compose run feat bash` + ### Post-processing Navigate to the [postprocessing](postprocessing) folder to begin postprocessing the experiment results. @@ -142,5 +152,5 @@ python collate_groundtruth_results.py To use your own datasets, you want to check out / modify read_file in read_file.py: https://github.com/cavalab/srbench/blob/4cc90adc9c450dad3cb3f82c93136bc2cb3b1a0a/experiment/read_file.py -If your datasets follow the convention of https://github.com/EpistasisLab/pmlb/tree/master/datasets, i.e. they are in a pandas DataFrame with the target column labelled "targert", you can call `read_file` directly just passing the filename like you would with any of the PMLB datasets. +If your datasets follow the convention of https://github.com/EpistasisLab/pmlb/tree/master/datasets, i.e. they are in a pandas DataFrame with the target column labelled "target", you can call `read_file` directly just passing the filename like you would with any of the PMLB datasets. The file should be stored and compressed as a `.tsv.gz` file. diff --git a/experiment/methods/feat/regressor.py b/experiment/methods/feat/regressor.py index 87402a35a..56fe27f53 100644 --- a/experiment/methods/feat/regressor.py +++ b/experiment/methods/feat/regressor.py @@ -1,15 +1,16 @@ # This example submission shows the submission of FEAT (cavalab.org/feat). from feat import FeatRegressor +from sklearn.base import BaseEstimator, RegressorMixin """ est: a sklearn-compatible regressor. if you don't have one they are fairly easy to create. see https://scikit-learn.org/stable/developers/develop.html """ -est = FeatRegressor( +est:RegressorMixin = FeatRegressor( pop_size=100, gens=100, - max_time=8*60*60, # 8 hrs + max_time=8*60*60, # 8 hrs. Your algorithm should have this feature max_depth=6, verbosity=2, batch_size=100, @@ -18,7 +19,7 @@ ) # want to tune your estimator? wrap it in a sklearn CV class. -def model(est, X=None): +def model(est, X=None) -> str: """ Return a sympy-compatible string of the final model. @@ -66,6 +67,35 @@ def model(est, X): return model_str +def get_population(est) -> list[RegressorMixin]: + """ + Return the final population of the model. This final population should + be a list with at most 100 individuals. Each of the individuals must + be compatible with scikit-learn, so they should have a predict method. + + Also, it is expected that the `model()` function can operate with them, + so they should have a way of getting a simpy string representation. + + Returns + ------- + A list of scikit-learn compatible estimators + """ + + return [est] + + +def get_best_solution(est) -> RegressorMixin: + """ + Return the best solution from the final model. + + Returns + ------- + A scikit-learn compatible estimator + """ + + return est + + ################################################################################ # Optional Settings ################################################################################ diff --git a/experiment/test.sh b/experiment/test.sh index 86aa9debf..44a17a274 100644 --- a/experiment/test.sh +++ b/experiment/test.sh @@ -1 +1,2 @@ python -m pytest -v test_algorithm.py --ml ${ALGORITHM} +python -m pytest -v test_population.py --ml ${ALGORITHM} diff --git a/experiment/test_population.py b/experiment/test_population.py new file mode 100644 index 000000000..6cba3412e --- /dev/null +++ b/experiment/test_population.py @@ -0,0 +1,72 @@ +import sys +import os +import types +import numpy as np +from os.path import dirname as d +from os.path import abspath +from sklearn.model_selection import train_test_split + +root_dir = d(abspath(__file__)) +sys.path.append(root_dir) +print('appended',root_dir,'to sys.path') + +import importlib +from read_file import read_file + +if 'OMP_NUM_THREADS' not in os.environ.keys(): + os.environ['OMP_NUM_THREADS'] = '1' +if 'OPENBLAS_NUM_THREADS' not in os.environ.keys(): + os.environ['OPENBLAS_NUM_THREADS'] = '1' +if 'MKL_NUM_THREADS' not in os.environ.keys(): + os.environ['MKL_NUM_THREADS'] = '1' + + +def test_population(ml): + """Sympy compatibility of model string""" + + dataset = 'test/192_vineyard_small.tsv.gz' + random_state = 42 + + algorithm = importlib.__import__(f'methods.{ml}.regressor',globals(), + locals(), + ['est','hyper_params','complexity']) + + algorithm.get_population, + + features, labels, feature_names = read_file( + dataset, + use_dataframe=True + ) + print('feature_names:',feature_names) + + # generate train/test split + X_train, X_test, y_train, y_test = train_test_split(features, labels, + train_size=0.75, + test_size=0.25, + random_state=random_state) + + # Few samples to try to make it quick + sample_idx = np.random.choice(np.arange(len(X_train)), size=10) + + y_train = y_train[sample_idx] + X_train = X_train.loc[sample_idx] + + algorithm.est.fit(X_train, y_train) + + if 'get_population' not in dir(algorithm): + algorithm.get_population = lambda est: [est] + if 'get_best_solution' not in dir(algorithm): + algorithm.get_best_solution = lambda est: est + + population = algorithm.get_population(algorithm.est) + + best_model = algorithm.get_best_solution(algorithm.est) + print(algorithm.model(best_model)) + print(algorithm.est.predict(X_train)) + + # assert that population has at least 1 and no more than 100 individuals + assert 1 <= len(population) <= 100, "Population size is not within the expected range" + + for p in population: + print(algorithm.model(p)) + print(p.predict(X_train)) diff --git a/local_ci.sh b/local_ci.sh index fceb405ff..723281088 100644 --- a/local_ci.sh +++ b/local_ci.sh @@ -59,6 +59,7 @@ conda activate $SUBENV conda env list conda info python -m pytest -v test_algorithm.py --ml $SUBNAME +python -m pytest -v test_population.py --ml $SUBNAME # Store Competitor # cd ..