page_type | languages | products | description | ||
---|---|---|---|---|---|
sample |
|
|
Tutorials showing how to build high quality machine learning models using Azure Automated Machine Learning. |
- Automated ML Introduction
- Setup using Compute Instances
- Setup using a Local Conda environment
- Setup using Azure Databricks
- Automated ML SDK Sample Notebooks
- Documentation
- Running using python command
- Troubleshooting
Automated machine learning (automated ML) builds high quality machine learning models for you by automating model and hyperparameter selection. Bring a labelled dataset that you want to build a model for, automated ML will give you a high quality machine learning model that you can use for predictions.
If you are new to Data Science, automated ML will help you get jumpstarted by simplifying machine learning model building. It abstracts you from needing to perform model selection, hyperparameter selection and in one step creates a high quality trained model for you to use.
If you are an experienced data scientist, automated ML will help increase your productivity by intelligently performing the model and hyperparameter selection for your training and generates high quality models much quicker than manually specifying several combinations of the parameters and running training jobs. Automated ML provides visibility and access to all the training jobs and the performance characteristics of the models to help you further tune the pipeline if you desire.
Below are the three execution environments supported by automated ML.
- Open the ML Azure portal
- Select Compute
- Select Compute Instances
- Click New
- Type a Compute Name, select a Virtual Machine type and select a Virtual Machine size
- Click Create
To run these notebook on your own notebook server, use these installation instructions. The instructions below will install everything you need and then start a Jupyter notebook.
1. Install mini-conda from here, choose 64-bit Python 3.7 or higher.
- Note: if you already have conda installed, you can keep using it but it should be version 4.4.10 or later (as shown by: conda -V). If you have a previous version installed, you can update it using the command: conda update conda. There's no need to install mini-conda specifically.
- Download the sample notebooks from GitHub as zip and extract the contents to a local directory. The automated ML sample notebooks are in the "automl-with-azureml" folder.
The automl_setup script creates a new conda environment, installs the necessary packages, configures the widget and starts a jupyter notebook. It takes the conda environment name as an optional parameter. The default conda environment name is azure_automl. The exact command depends on the operating system. See the specific sections below for Windows, Mac and Linux. It can take about 10 minutes to execute.
Packages installed by the automl_setup script:
- python
- nb_conda
- matplotlib
- numpy
- cython
- urllib3
- scipy
- scikit-learn
- pandas
- tensorflow
- py-xgboost
- azureml-sdk
- azureml-widgets
- pandas-ml
For more details refer to the automl_env.yml
Start an Anaconda Prompt window, cd to the automl-with-azureml folder where the sample notebooks were extracted and then run:
automl_setup
Install "Command line developer tools" if it is not already installed (you can use the command: xcode-select --install
).
Start a Terminal windows, cd to the how-to-use-azureml/automated-machine-learning folder where the sample notebooks were extracted and then run:
bash automl_setup_mac.sh
cd to the automl-with-azureml folder where the sample notebooks were extracted and then run:
bash automl_setup_linux.sh
4. Running setup-workspace.py
- Before running any samples you next need to create a Workspace by running the setup-workspace.py script.
- Please make sure you use the Python [conda env:azure_automl] kernel when trying the sample Notebooks.
- Follow the instructions in the individual notebooks to explore various features in automated ML.
To start your Jupyter notebook manually, use:
conda activate azure_automl
jupyter notebook
or on Mac or Linux:
source activate azure_automl
jupyter notebook
NOTE: Please create your Azure Databricks cluster as v7.1 (high concurrency preferred) with Python 3 (dropdown). NOTE: You should at least have contributor access to your Azure subscription to run the notebook.
- You can find the detail Readme instructions at GitHub.
- Download the sample notebook automl-databricks-local-01.ipynb from GitHub and import into the Azure databricks workspace.
- Attach the notebook to the cluster.
- Classify Credit Card Fraud
- Dataset: Kaggle's credit card fraud detection dataset
- Jupyter Notebook (remote run)
- run the experiment remotely on AML Compute cluster
- test the performance of the best model in the local environment
- Jupyter Notebook (local run)
- run experiment in the local environment
- use Mimic Explainer for computing feature importance
- deploy the best model along with the explainer to an Azure Kubernetes (AKS) cluster, which will compute the raw and engineered feature importances at inference time
- Jupyter Notebook (remote run)
- Dataset: Kaggle's credit card fraud detection dataset
- Predict Term Deposit Subscriptions in a Bank
- Dataset: UCI's bank marketing dataset
- Jupyter Notebook
- run experiment remotely on AML Compute cluster to generate ONNX compatible models
- view the featurization steps that were applied during training
- view feature importance for the best model
- download the best model in ONNX format and use it for inferencing using ONNXRuntime
- deploy the best model in PKL format to Azure Container Instance (ACI)
- Jupyter Notebook
- Dataset: UCI's bank marketing dataset
- Predict Newsgroup based on Text from News Article
- Dataset: 20 newsgroups text dataset
- Jupyter Notebook
- AutoML highlights here include using deep neural networks (DNNs) to create embedded features from text data
- AutoML will use Bidirectional Encoder Representations from Transformers (BERT) when a GPU compute is used
- Bidirectional Long-Short Term neural network (BiLSTM) will be utilized when a CPU compute is used, thereby optimizing the choice of DNN
- Jupyter Notebook
- Dataset: 20 newsgroups text dataset
- Predict Performance of Hardware Parts
- Dataset: Hardware Performance Dataset
- Jupyter Notebook
- run the experiment remotely on AML Compute cluster
- get best trained model for a different metric than the one the experiment was optimized for
- test the performance of the best model in the local environment
- Jupyter Notebook (advanced)
- run the experiment remotely on AML Compute cluster
- customize featurization: override column purpose within the dataset, configure transformer parameters
- get best trained model for a different metric than the one the experiment was optimized for
- run a model explanation experiment on the remote cluster
- deploy the model along the explainer and run online inferencing
- Jupyter Notebook
- Dataset: Hardware Performance Dataset
- Forecast Energy Demand
- Dataset: NYC energy demand data
- Jupyter Notebook
- run experiment remotely on AML Compute cluster
- use lags and rolling window features
- view the featurization steps that were applied during training
- get the best model, use it to forecast on test data and compare the accuracy of predictions against real data
- Jupyter Notebook
- Dataset: NYC energy demand data
- Forecast Orange Juice Sales (Multi-Series)
- Dataset: Dominick's grocery sales of orange juice
- Jupyter Notebook
- run experiment remotely on AML Compute cluster
- customize time-series featurization, change column purpose and override transformer hyper parameters
- evaluate locally the performance of the generated best model
- deploy the best model as a webservice on Azure Container Instance (ACI)
- get online predictions from the deployed model
- Jupyter Notebook
- Dataset: Dominick's grocery sales of orange juice
- Forecast Demand of a Bike-Sharing Service
- Dataset: Bike demand data
- Jupyter Notebook
- run experiment remotely on AML Compute cluster
- integrate holiday features
- run rolling forecast for test set that is longer than the forecast horizon
- compute metrics on the predictions from the remote forecast
- Jupyter Notebook
- Dataset: Bike demand data
- The Forecast Function Interface
- Dataset: Generated for sample purposes
- Jupyter Notebook
- train a forecaster using a remote AML Compute cluster
- capabilities of forecast function (e.g. forecast farther into the horizon)
- generate confidence intervals
- Jupyter Notebook
- Dataset: Generated for sample purposes
- Forecast Beverage Production
- Dataset: Monthly beer production data
- Jupyter Notebook
- train using a remote AML Compute cluster
- enable the DNN learning model
- forecast on a remote compute cluster and compare different model performance
- Jupyter Notebook
- Dataset: Monthly beer production data
- Hierarchical Time Series Forecasting
- Dataset: HTS dataset
- Jupyter Notebook
- train and forecast using a remote AML Compute cluster with multiple nodes
- multiple AutoML runs trigger in parallel
- data aggregation is performed at train level
- Jupyter Notebook
- Dataset: HTS dataset
- Continuous Retraining with NOAA Weather Data
- Dataset: NOAA weather data from Azure Open Datasets
- Jupyter Notebook
- continuously retrain a model using Pipelines and AutoML
- create a Pipeline to upload a time series dataset to an Azure blob
- create a Pipeline to run an AutoML experiment and register the best resulting model in the Workspace
- publish the training pipeline created and schedule it to run daily
- Jupyter Notebook
- Dataset: NOAA weather data from Azure Open Datasets
- Image Classification Multi-Class
- Dataset: Toy dataset with images of products found in a fridge
- Jupyter Notebook
- train an Image Classification Multi-Class model using AutoML
- tune hyperparameters of the model to optimize model performance
- deploy the model to use in inference scenarios
- Jupyter Notebook
- Dataset: Toy dataset with images of products found in a fridge
- Image Classification Multi-Label
- Dataset: Toy dataset with images of products found in a fridge
- Jupyter Notebook
- train an Image Classification Multi-Label model using AutoML
- tune hyperparameters of the model to optimize model performance
- deploy the model to use in inference scenarios
- Jupyter Notebook
- Dataset: Toy dataset with images of products found in a fridge
- Object Detection
- Dataset: Toy dataset with images of products found in a fridge
- Jupyter Notebook
- train an Object Detection model using AutoML
- tune hyperparameters of the model to optimize model performance
- deploy the model to use in inference scenarios
- Jupyter Notebook
- Dataset: Toy dataset with images of products found in a fridge
- Instance Segmentation
- Dataset: Toy dataset with images of products found in a fridge
- Jupyter Notebook
- train an Instance Segmentation model using AutoML
- tune hyperparameters of the model to optimize model performance
- deploy the model to use in inference scenarios
- Jupyter Notebook
- Dataset: Toy dataset with images of products found in a fridge
- Batch Scoring with an Image Classification Model
- Dataset: Toy dataset with images of products found in a fridge
- Jupyter Notebook
- register an Image Classification Multi-Class model already trained using AutoML
- create an Inference Dataset
- provision compute targets and create a Batch Scoring script
- use ParallelRunStep to do batch scoring
- build, run, and publish a pipeline
- enable a REST endpoint for the pipeline
- Jupyter Notebook
- Dataset: Toy dataset with images of products found in a fridge
See Configure automated machine learning experiments to learn how more about the settings and features available for automated machine learning experiments.
Jupyter notebook provides a File / Download as / Python (.py) option for saving the notebook as a Python file. You can then run this file using the python command. However, on Windows the file needs to be modified before it can be run. The following condition must be added to the main code in the file:
if __name__ == "__main__":
The main code of the file must be indented so that it is under this condition.
- On Windows, make sure that you are running automl_setup from an Anconda Prompt window rather than a regular cmd window. You can launch the "Anaconda Prompt" window by hitting the Start button and typing "Anaconda Prompt". If you don't see the application "Anaconda Prompt", you might not have conda or mini conda installed. In that case, you can install it here
- Check that you have conda 64-bit installed rather than 32-bit. You can check this with the command
conda info
. Theplatform
should bewin-64
for Windows orosx-64
for Mac. - Check that you have conda 4.7.8 or later. You can check the version with the command
conda -V
. If you have a previous version installed, you can update it using the command:conda update conda
. - On Linux, if the error is
gcc: error trying to exec 'cc1plus': execvp: No such file or directory
, install build essentials using the commandsudo apt-get install build-essential
. - Pass a new name as the first parameter to automl_setup so that it creates a new conda environment. You can view existing conda environments using
conda env list
and remove them withconda env remove -n <environmentname>
.
If automl_setup_linux.sh fails on Ubuntu Linux with the error: unable to execute 'gcc': No such file or directory
- Make sure that outbound ports 53 and 80 are enabled. On an Azure VM, you can do this from the Azure Portal by selecting the VM and clicking on Networking.
- Run the command:
sudo apt-get update
- Run the command:
sudo apt-get install build-essential --fix-missing
- Run
automl_setup_linux.sh
again.
If a sample notebook fails with an error that property, method or library does not exist:
- Check that you have selected correct kernel in jupyter notebook. The kernel is displayed in the top right of the notebook page. It can be changed using the
Kernel | Change Kernel
menu option. For Azure Notebooks, it should bePython 3.6
. For local conda environments, it should be the conda environment name that you specified in automl_setup. The default is azure_automl. Note that the kernel is saved as part of the notebook. So, if you switch to a new conda environment, you will have to select the new kernel in the notebook. - Check that the notebook is for the SDK version that you are using. You can check the SDK version by executing
azureml.core.VERSION
in a jupyter notebook cell. You can download previous version of the sample notebooks from GitHub by clicking theBranch
button, selecting theTags
tab and then selecting the version.
Some Windows environments see an error loading numpy with the latest Python version 3.6.8. If you see this issue, try with Python version 3.6.7.
Check the tensorflow version in the automated ml conda environment. Supported versions are < 1.13. Uninstall tensorflow from the environment if version is >= 1.13 You may check the version of tensorflow and uninstall as follows
- start a command shell, activate conda environment where automated ml packages are installed
- enter
pip freeze
and look fortensorflow
, if found, the version listed should be < 1.13 - If the listed version is a not a supported version,
pip uninstall tensorflow
in the command shell and enter y for confirmation.
Automated ML creates files under /tmp/azureml_runs for each iteration that it runs. It creates a folder with the iteration id. For example: AutoML_9a038a18-77cc-48f1-80fb-65abdbc33abe_93. Under this, there is a azureml-logs folder, which contains logs. If you run too many iterations on the same DSVM, these files can fill the disk. You can delete the files under /tmp/azureml_runs or just delete the VM and create a new one. If your get_data downloads files, make sure the delete them or they can use disk space as well. When using DataStore, it is good to specify an absolute path for the files so that they are downloaded just once. If you specify a relative path, it will download a file for each iteration.
This can be caused by insufficient memory on the DSVM. Automated ML loads all training data into memory. So, the available memory should be more than the training data size. If you are using a remote DSVM, memory is needed for each concurrent iteration. The max_concurrent_iterations setting specifies the maximum concurrent iterations. For example, if the training data size is 8Gb and max_concurrent_iterations is set to 10, the minimum memory required is at least 80Gb. To resolve this issue, allocate a DSVM with more memory or reduce the value specified for max_concurrent_iterations.
This can be caused by too many concurrent iterations for a remote DSVM. Each concurrent iteration usually takes 100% of a core when it is running. Some iterations can use multiple cores. So, the max_concurrent_iterations setting should always be less than the number of cores of the DSVM. To resolve this issue, try reducing the value specified for the max_concurrent_iterations setting.