In this notebook, hyperparameter tuning using grid search algorithm is demonstrated.We have a dataset consisting of amazon product reviews and a sklearn classifier to classiy these reviews. We take advantage of cloud functions to tune this classifier's hyperparameters and show how Lithops can be used for this kind of computations.
In machine learning', hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.
The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
import joblib
from pprint import pprint
from time import time
import click
import bz2
- AWS account
- Lithops 3.1.0
- scikit-learn 1.3.0
Download the dataset from here and extract the zip folder
load_data
function seperates the data as X and Y arrays to prepare them for classifier.
Firt of all, you need to build the runtime from this Dockerfile using:
$ lithops runtime build -f Dockerfile sklearn-aws:01
Your Lithops config file should look something similar to:
lithops:
storage: aws_s3
backend: aws_lambda
data_limit : 6
aws:
access_key_id : <YOUR_AWS_KEY_ID>
secret_access_key : <YOUR_AWS_ACCESS_KEY>
runtime : sklearn-aws:01
runtime_memory : 1024
aws_s3:
region_name : <AWS_REGION>
storage_bucket: <S3_BUCKET>
aws_lambda:
execution_role: <YOUR_AWS_ROLE>
region_name: <AWS_REGION>
You can see options and run the code with the lines below. This application is not meant to be run on the noteebok. You may encounter problems if you try to run it on Jupyter Notebook.
$ python3 gridsearch.py --help
$ python3 gridsearch.py --backend lithops --mib 10
In this block you can edit options and change default values for arguments taken. To run it using Lithops you must use --backend lithops
option.
@click.command()
@click.option('--backend', default='loky', help='Joblib backend to perform grid search '
'(loky | lithops | dask | ray | tune)')
@click.option('--address', default=None, help='Scheduler address (dask) or head node address '
'(ray, ray[tune])')
@click.option('--mib', default=10, type=int, help='Load X MiB from the dataset')
@click.option('--refit', default=False, is_flag=True, help='Fit the final model with the best '
'configuration and print score')
@click.option('--jobs', default=-1, help='Number of jobs to execute the search. -1 means all processors.')
In the main function, grid search is performed using GridSearchCV from sklearn library with different parameters depending on the backend chosen.