Skip to content

Latest commit

 

History

History
140 lines (94 loc) · 7.43 KB

06_random_forest.md

File metadata and controls

140 lines (94 loc) · 7.43 KB

IPython Cookbook, Second Edition This is one of the 100+ free recipes of the IPython Cookbook, Second Edition, by Cyrille Rossant, a guide to numerical computing and data science in the Jupyter Notebook. The ebook and printed book are available for purchase at Packt Publishing.

Text on GitHub with a CC-BY-NC-ND license
Code on GitHub with a MIT license

Chapter 8 : Machine Learning

8.6. Using a random forest to select important features for regression

Decision trees are frequently used to represent workflows or algorithms. They also form a method for nonparametric supervised learning. A tree mapping observations to target values is learned on a training set and gives the outcomes of new observations.

Random forests are ensembles of decision trees. Multiple decision trees are trained and aggregated to form a model that is more performant than any of the individual trees. This general idea is the purpose of ensemble learning.

There are many types of ensemble methods. Random forests are an instance of bootstrap aggregating, also called bagging, where models are trained on randomly drawn subsets of the training set.

Random forests yield information about the importance of each feature for the classification or regression task. In this recipe, we will find the most influential features of Boston house prices using a classic dataset that contains a range of diverse indicators about the houses' neighborhood.

How to do it...

  1. We import the packages:
import numpy as np
import sklearn as sk
import sklearn.datasets as skd
import sklearn.ensemble as ske
import matplotlib.pyplot as plt
%matplotlib inline
  1. We load the Boston dataset:
data = skd.load_boston()

The details of this dataset can be found in data['DESCR']. Here is the description of some features:

  • CRIM: Per capita crime rate by town
  • NOX: Nitric oxide concentration (parts per 10 million)
  • RM: Average number of rooms per dwelling
  • AGE: Proportion of owner-occupied units built prior to 1940
  • DIS: Weighted distances to five Boston employment centres
  • PTRATIO: Pupil-teacher ratio by town
  • LSTAT: Percentage of lower status of the population
  • MEDV: Median value of owner-occupied homes in $1000s

The target value is MEDV.

  1. We create a RandomForestRegressor model:
reg = ske.RandomForestRegressor()
  1. We get the samples and the target values from this dataset:
X = data['data']
y