Skip to content

How is FastRF implemented?

Fran Supek edited this page Sep 7, 2017 · 2 revisions

In this section we will explain the main classes that are involved in the training of a Random Forest and the interaction between them.

FastRandomForest

This is the "root" class of the forest. When you want to create a random forest you will create an instance of FastRandomForest, pass different options and then build the classifier.

FastRandomForest fastRandomForest = new FastRandomForest();
fastRandomForest.setOptions(stringArrayOptions);
fastRandomForest.buildClassifier(data);

The valid options are:

  • -I -> number of trees in the forest.

  • -K -> number of features to consider in a node.

  • -numFeatTree -> number of features to consider in a tree.

  • -S -> seed for random number generator.

  • -depth -> maximum depth in a tree, 0 unlimited.

  • -threads -> number of simultaneous threads to use (default = 0, autodetect number of cores).

  • -import -> compute RF feature importance.

  • -importNew -> compute RF feature importance using the new method.

  • -interactions -> compute feature interactions.

  • -interactionsNew -> compute feature interactions using the new method.

  • -D -> debug mode

When you call buildclassifier(), the values m_Kvalue (the number of features considered in each node) and m_FeatTree (the number of features considered for each tree) are set to the value passed using setOptions() or to the default value, if they weren't specified in setOptions(). The default value is m_Kvalue = (int) log2(n_features) + 5 and m_FeatTree = (int) pow(n_features, 0.6) + 60. These default values were found using the artificial datasets. We tried to maintain the accuracy and improving the speedup as much as possible for all of the different combinations of number of instances and number of features.

If the default number for m_FeatTree is greater than the total number of features, m_FeatTree is set to m_features and m_Kvalue is set to log2(n_features) + 1 (the same as the Weka implementation or the previous implementations of FRF).

If we want to calculate feature importance using the new method, then the algorithm makes sure there will be some trees with a given feature and some others without that feature. We know that the expected number of trees with a feature is nTrees/(n_features/m_FeatTree), so we modify m_FeatTree, if it's necessary, in order to keep the last expression between two boundaries.

minTrees < nTrees/(n_features/m_FeatTree) < nTrees - minTrees

Then, an object of the class FastRFBagging is created. It will be the one that will create the trees of the forest.

FastRFBagging

This is the class that actually creates the forest. It first process the dataset and converts it from an Instances object to a DataCache object, because the trees will read the data from a DataCache object. Inside the buildClassifier() method, an array of FastRandomTree is created and added to a threadPool. This allows us to execute the construction of the trees in parallel. When the trees are already built, it takes for each tree, an array of booleans that indicates the inBag instances. Then the OOB error is calculated, also in parallel.

After the construction of the tree it calculates the feature importance and the feature interactions if it was specified in the options.

DataCache

This class helps us to access the data from the dataset in a more efficient way. It is the object that will have each tree to access the dataset.

The values of the attributes of all the instances are stored in the float[][] vals matrix. It is indexed first by attribute and then by instance. That means that the number of rows is equal to numAttributes and the number of columns is equal to numInstances. This matrix is built once in the DataCache(Instances instances) constructor from an Instances object containing the dataset and then each new instance from DataCache has a shallow copy of vals.

Another very important matrix in DataCache is the int[][] sortedIndices. It has many rows as number of attributes and many columns as number of inBag instances. Each row contains all the instances sorted accordingly to the attribute that this row represents. That helps a lot in splits with numerical attributes, because the instances are already sorted. Each DataCache object has its own deep copy of sortedIndices, because it will be modified for the tree in its construction.

Other arrays are:

  • 'int[] selectedAttributes` Contains the indices of the attributes used for the tree.

  • int[] attInSortedIndices Contains the indices of the attributes that are represented in sortedIndices.

  • boolean[] inBag Each position represents an instance and the boolean value tells us if that instance is InBag or OOB.

  • double[] instWeights Each position represents an instance and its value represents the weight it has.

It also has some auxiliary arrays that are used multiple times by the tree. This helps to avoid the overhead in the creation and destruction of these arrays if they were created each time they are needed and then immediately destructed. These arrays are:

  • int[] whatGoesWhere Each position represents an instance and the value in a position the branch where this instance will go. It is used in the FastRandomTree.splitDataNew() method.

  • double[][] levelsClasses Each row represents a value of a categorical attribute and a position of this row represents in which class an instance belongs. It is used in FastRandomTree.distributionSequentialAtt() method, for categorical attributes.

As it was said at the beginning, the first DataCache object it's constructed using an Instances object. Then, the other DataCache objects are constructed using this first DataCache. However, each tree must call the method DataCache.resample() in order to select the inBag instances and the features and it also must call the method DataCache.createInBagSortedIndicesNew() to create the sortedIndices` matrix.

FastRandomTree

This tree will always be a binary tree. All the splits are binary, including the splits based on categorical features. In that case it uses a one vs all analysis to split the node. This is the reason why the FRF is quite slow for datasets with lots of features and each feature with lots of different values. We expect to fix this in the future by making a normal split for categorical features.

The calculations for this class start in the run() method. This method creates its own copy of the DataCache object (it only makes a deep copy of the fields that will be modified during the construction of the tree) and makes the first recursive call of the builTree() method.

In buildTree() it first check if the node should be a leaf. If not, it continues analyzing K possible attributes to split the node. Then, if one of these attribute improve the gini score, it splits the node using this attribute and calls recursively the function buildTree() for its two new sons.