-
Notifications
You must be signed in to change notification settings - Fork 1
Important FastRF Parameters
As with all RF implementations, FastRF 2.0 beta is quite robust to choice of parameters. As always, larger forests (more trees) are preferable, as long as increase in execution time can be tolerated.
One parameter that is known to be somewhat important for any RF algorithm is the number of attributes to examine per split (here, m_KValue
); by default in Weka RF it is log2( num_features) + 1. In the FastRF 2, we introduced an additional parameter, which is the number of attributes to which the entire tree will be limited (m_numFeatTree
). We have empirically determined a rule-of-thumb m_KValue
and m_numFeatTree
values that appear to yield good accuracy over a range of synthetic datasets. In particular, we generated such datasets using the RDG1 and the BayesNet generators from Weka 3.8, while varying the different number of instances (from 100 to 25600) and different number of attributes (from 100 to 25600). These generated datasets are included in the Github repository.
The default values that work well for such data were estimated as m_KValue = log2(numAttributes) + 5
and m_numFeatTree = pow(numAttributes, 0.6) + 60
. These exact values are subject to change in future versions, since our preliminary analyses suggest that on many real datasets other settings could be used to extract more speed from FastRF 2.0; the current settings are somewhat conservative to prioritize accuracy. Please see the accuracy benchmarks page for details.
In general, a user should not need to change the values of m_Kvalue
and m_numFeatTree
in most cases, but they should be aware that the defaults will likely change in future FastRF versions.
The m_Kvalue
parameter can be manually specified via the “-K” option in the Weka AbstractClassifier.setOptions()
, and the m_numFeatTree
can be specified via the “-numFeatTree” option. Please see the How is FastRF implemented wiki page for a list of supported options.