Skip to content

ForestLayer: Efficient and scalable deep forest learning library based on Ray

License

Notifications You must be signed in to change notification settings

PasaLab/forestlayer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ForestLayer

ForestLayer is an highly efficient and scalable deep forest learning library based on Ray. It provides rich easy-to-use data processing, model training modules to help researchers and engineers build practical deep forest learning workflows. It internally embeds task parallelization mechanism using Ray, which is a popular flexible, high-performance distributed execution framework proposed by U.C.Berkeley.
ForestLayer aims to enable faster experimentation as possible and reduce the delay from idea to result.

Hope that ForestLayer can bring you good researches and good products.

You can refer to Deep Forest Paper, Ray Project to find more details.

News

  • [24 July] ForestLayer white paper is shared on arXiv.
  • [1 Feb] Forest Splitting mechanism are supported. Now ForestLayer can achieve 2.5x speedup to gcForest v1.0 with 8 nodes.
  • [10 Jan] You can now use ForestLayer for regression task or use it in your data science algorithm competitions! We recommend using small layer of cascade in regression task since it's easy to overfit the data.
  • [8 Jan] You can now use ForestLayer for classification tasks. See examples

Prerequisites

numpy, scikit-learn, keras, ray, joblib, xgboost, psutil, matplotlib, pandas

Installation

ForestLayer has install prerequisites including scikit-learn, keras, numpy, ray and joblib. For GPU support, CUDA and cuDNN are required, but now we have not support GPU yet. The simplest way to install ForestLayer in your python program is:

[for master version] pip install git+https://github.com/whatbeg/forestlayer.git
[for stable version] pip install forestlayer

Alternatively, you can install ForestLayer from the github source:

$ git clone https://github.com/whatbeg/forestlayer.git

$ cd forestlayer
$ python setup.py install

What is deep forest?

The architecture of deep forest. from [1]

[1] Deep Forest: Towards An Alternative to Deep Neural Networks, by Zhi-Hua Zhou and Ji Feng. IJCAI 2017

Getting Started: 30 seconds to ForestLayer

The core data structure of ForestLayer is layers and graph. Layers are basic modules to implement different data processing, and the graph is like a model that organize layers, the basic type of graph is a stacking of layers, and now we only support this type of graph.

Take MNIST classification task as an example.

First, we use the Keras API to load mnist data and do some pre-processing.

(x_train, y_train), (x_test, y_test) = mnist.load_data()
# TODO: preprocessing...

Next, we construct multi-grain scan windows and estimators every window and then initialize a MultiGrainScanLayer. The Window class is lies in forestlayer.layers.window package and the estimators are represented as EstimatorArguments, which will be used later in layers to create actual estimator object.

rf1 = ExtraRandomForestConfig(min_samples_leaf=10, max_features='sqrt')
rf2 = RandomForestConfig(min_samples_leaf=10)

windows = [Window(win_x=7, win_y=7, stride_x=2, stride_y=2, pad_x=0, pad_y=0),
           Window(11, 11, 2, 2)]

est_for_windows = [[rf1, rf2], [rf1, rf2]]

mgs = MultiGrainScanLayer(windows=windows, est_for_windows=est_for_windows, n_class=10)

After multi-grain scan, we consider that building a pooling layer to reduce the dimension of generated feature vectors, so that reduce the computation and storage complexity and risk of overfiting.

pools = [[MaxPooling(2, 2), MaxPooling(2, 2)],
         [MaxPooling(2, 2), MaxPooling(2, 2)]]

pool = PoolingLayer(pools=pools)

And then we add a concat layer to concatenate the output of estimators of the same window.

concatlayer = ConcatLayer()

Then, we construct the cascade part of the model, we use an auto-growing cascade layer to build our deep forest model.

est_configs = [
    ExtraRandomForestConfig(),
    ExtraRandomForestConfig(),
    RandomForestConfig(),
    RandomForestConfig()
]

auto_cascade = AutoGrowingCascadeLayer(est_configs=est_configs, early_stopping_rounds=4,
                                       stop_by_test=True, n_classes=10, distribute=False)

Last, we construct a graph to stack these layers to make them as a complete model.

model = Graph()
model.add(mgs)
model.add(pool)
model.add(concatlayer)
model.add(auto_cascade)

You also can call model.summary() like Keras to see the appearance of the model. Summary info could be like follows table,

____________________________________________________________________________________________________
Layer                    Description                                                        Param #
====================================================================================================
MultiGrainScanLayer      [win/7x7, win/11x11]                                               params
                         [[FLCRF, FLRF][FLCRF, FLRF]]
____________________________________________________________________________________________________
PoolingLayer             [[maxpool/2x2, maxpool/2x2][maxpool/2x2, maxpool/2x2]]             params
____________________________________________________________________________________________________
ConcatLayer              ConcatLayer(axis=-1)                                               params
____________________________________________________________________________________________________
AutoGrowingCascadeLayer  maxlayer=0, esrounds=3                                             params
                         Each Level:
                         [FLCRF, FLCRF, FLRF, FLRF]
====================================================================================================

After building the model, you can fit the model, and then evaluate or predict using the fit model. Because the model is often very large (dozens of trained forests, every of them occupies hundreds of megabytes), so we highly recommend users to use fit_transform to train and test data, thus we can make keep_in_mem=False to avoid the cost of caching model in memory or disk.

# 1
model.fit(x_train, y_train)
model.evaluate(x_test, y_test)
result = model.predict(x_in)

# 2 (recommend)
model.fit_transform(x_train, y_train, x_test, y_test)

For more examples and tutorials, you can refer to examples to find more details.

Examples

See examples.

etc.

Contributions

Welcome contributions, you can contact huqiu00 at 163.com.

License

Please contact the authors for the licence info of the source code.

About

ForestLayer: Efficient and scalable deep forest learning library based on Ray

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published