Initial GPU acceleration support for LightGBM (#368)

* add dummy gpu solver code * initial GPU code * fix crash bug * first working version * use asynchronous copy * use a better kernel for root * parallel read histogram * sparse features now works, but no acceleration, compute on CPU * compute sparse feature on CPU simultaneously * fix big bug; add gpu selection; add kernel selection * better debugging * clean up * add feature scatter * Add sparse_threshold control * fix a bug in feature scatter * clean up debug * temporarily add OpenCL kernels for k=64,256 * fix up CMakeList and definition USE_GPU * add OpenCL kernels as string literals * Add boost.compute as a submodule * add boost dependency into CMakeList * fix opencl pragma * use pinned memory for histogram * use pinned buffer for gradients and hessians * better debugging message * add double precision support on GPU * fix boost version in CMakeList * Add a README * reconstruct GPU initialization code for ResetTrainingData * move data to GPU in parallel * fix a bug during feature copy * update gpu kernels * update gpu code * initial port to LightGBM v2 * speedup GPU data loading process * Add 4-bit bin support to GPU * re-add sparse_threshold parameter * remove kMaxNumWorkgroups and allows an unlimited number of features * add feature mask support for skipping unused features * enable kernel cache * use GPU kernels withoug feature masks when all features are used * REAdme. * REAdme. * update README * fix typos (#349) * change compile to gcc on Apple as default * clean vscode related file * refine api of constructing from sampling data. * fix bug in the last commit. * more efficient algorithm to sample k from n. * fix bug in filter bin * change to boost from average output. * fix tests. * only stop training when all classes are finshed in multi-class. * limit the max tree output. change hessian in multi-class objective. * robust tree model loading. * fix test. * convert the probabilities to raw score in boost_from_average of classification. * fix the average label for binary classification. * Add boost_from_average to docs (#354) * don't use "ConvertToRawScore" for self-defined objective function. * boost_from_average seems doesn't work well in binary classification. remove it. * For a better jump link (#355) * Update Python-API.md * for a better jump in page A space is needed between `#` and the headers content according to Github's markdown format [guideline](https://guides.github.com/features/mastering-markdown/) After adding the spaces, we can jump to the exact position in page by click the link. * fixed something mentioned by @wxchan * Update Python-API.md * add FitByExistingTree. * adapt GPU tree learner for FitByExistingTree * avoid NaN output. * update boost.compute * fix typos (#361) * fix broken links (#359) * update README * disable GPU acceleration by default * fix image url * cleanup debug macro * remove old README * do not save sparse_threshold_ in FeatureGroup * add details for new GPU settings * ignore submodule when doing pep8 check * allocate workspace for at least one thread during builing Feature4 * move sparse_threshold to class Dataset * remove duplicated code in GPUTreeLearner::Split * Remove duplicated code in FindBestThresholds and BeforeFindBestSplit * do not rebuild ordered gradients and hessians for sparse features * support feature groups in GPUTreeLearner * Initial parallel learners with GPU support * add option device, cleanup code * clean up FindBestThresholds; add some omp parallel * constant hessian optimization for GPU * Fix GPUTreeLearner crash when there is zero feature * use np.testing.assert_almost_equal() to compare lists of floats in tests * travis for GPU
microsoft · Apr 9, 2017 · 0bb4a82 · 0bb4a82
1 parent db3d1f8
commit 0bb4a82
Show file tree

Hide file tree

Showing 30 changed files with 4,163 additions and 246 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "include/boost/compute"]
+	path = compute
+	url = https://github.com/boostorg/compute
diff --git a/.travis.yml b/.travis.yml
@@ -11,24 +11,48 @@ before_install:
 - export PATH="$HOME/miniconda/bin:$PATH"
 - conda config --set always_yes yes --set changeps1 no
 - conda update -q conda
+- sudo add-apt-repository ppa:george-edison55/cmake-3.x -y
+- sudo apt-get update -q
+- bash .travis/amd_sdk.sh;
+- tar -xjf AMD-SDK.tar.bz2;
+- AMDAPPSDK=${HOME}/AMDAPPSDK;
+- export OPENCL_VENDOR_PATH=${AMDAPPSDK}/etc/OpenCL/vendors;
+- mkdir -p ${OPENCL_VENDOR_PATH};
+- sh AMD-APP-SDK*.sh --tar -xf -C ${AMDAPPSDK};
+- echo libamdocl64.so > ${OPENCL_VENDOR_PATH}/amdocl64.icd;
+- export LD_LIBRARY_PATH=${AMDAPPSDK}/lib/x86_64:${LD_LIBRARY_PATH};
+- chmod +x ${AMDAPPSDK}/bin/x86_64/clinfo;
+- ${AMDAPPSDK}/bin/x86_64/clinfo;
+- export LIBRARY_PATH="$HOME/miniconda/lib:$LIBRARY_PATH"
+- export LD_RUN_PATH="$HOME/miniconda/lib:$LD_RUN_PATH"
+- export CPLUS_INCLUDE_PATH="$HOME/miniconda/include:$AMDAPPSDK/include/:$CPLUS_INCLUDE_PATH"
 
 install:
 - sudo apt-get install -y libopenmpi-dev openmpi-bin build-essential
+- sudo apt-get install -y cmake
 - conda install --yes atlas numpy scipy scikit-learn pandas matplotlib
+- conda install --yes -c conda-forge boost=1.63.0
 - pip install pep8
 
-
 script:
 - cd $TRAVIS_BUILD_DIR
 - mkdir build && cd build && cmake .. && make -j
 - cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
 - cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
 - cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
-- cd $TRAVIS_BUILD_DIR && pep8 --ignore=E501 .
+- cd $TRAVIS_BUILD_DIR && pep8 --ignore=E501 --exclude=./compute .
 - rm -rf build && mkdir build && cd build && cmake -DUSE_MPI=ON ..&& make -j
 - cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
 - cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
 - cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
+- cd $TRAVIS_BUILD_DIR
+- rm -rf build && mkdir build && cd build && cmake -DUSE_GPU=ON -DBOOST_ROOT="$HOME/miniconda/" -DOpenCL_INCLUDE_DIR=$AMDAPPSDK/include/ ..
+- sed -i 's/std::string device_type = "cpu";/std::string device_type = "gpu";/' ../include/LightGBM/config.h
+- make -j$(nproc)
+- sed -i 's/std::string device_type = "gpu";/std::string device_type = "cpu";/' ../include/LightGBM/config.h
+- cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
+- cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
+- cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
 
 notifications:
   email: false

diff --git a/.travis/amd_sdk.sh b/.travis/amd_sdk.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+# Original script from https://github.com/gregvw/amd_sdk/
+
+# Location from which get nonce and file name from
+URL="http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/amd-accelerated-parallel-processing-app-sdk/"
+URLDOWN="http://developer.amd.com/amd-license-agreement-appsdk/"
+
+NONCE1_STRING='name="amd_developer_central_downloads_page_nonce"'
+FILE_STRING='name="f"'
+POSTID_STRING='name="post_id"'
+NONCE2_STRING='name="amd_developer_central_nonce"'
+
+#For newest FORM=`wget -qO - $URL | sed -n '/download-2/,/64-bit/p'`
+FORM=`wget -qO - $URL | sed -n '/download-5/,/64-bit/p'`
+
+# Get nonce from form
+NONCE1=`echo $FORM | awk -F ${NONCE1_STRING} '{print $2}'`
+NONCE1=`echo $NONCE1 | awk -F'"' '{print $2}'`
+echo $NONCE1
+
+# get the postid
+POSTID=`echo $FORM | awk -F ${POSTID_STRING} '{print $2}'`
+POSTID=`echo $POSTID | awk -F'"' '{print $2}'`
+echo $POSTID
+
+# get file name
+FILE=`echo $FORM | awk -F ${FILE_STRING} '{print $2}'`
+FILE=`echo $FILE | awk -F'"' '{print $2}'`
+echo $FILE
+
+FORM=`wget -qO - $URLDOWN --post-data "amd_developer_central_downloads_page_nonce=${NONCE1}&f=${FILE}&post_id=${POSTID}"`
+
+NONCE2=`echo $FORM | awk -F ${NONCE2_STRING} '{print $2}'`
+NONCE2=`echo $NONCE2 | awk -F'"' '{print $2}'`
+echo $NONCE2
+
+wget --content-disposition --trust-server-names $URLDOWN --post-data "amd_developer_central_nonce=${NONCE2}&f=${FILE}" -O AMD-SDK.tar.bz2;
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -9,6 +9,7 @@ PROJECT(lightgbm)
 
 OPTION(USE_MPI "MPI based parallel learning" OFF)
 OPTION(USE_OPENMP "Enable OpenMP" ON)
+OPTION(USE_GPU "Enable GPU-acclerated training (EXPERIMENTAL)" OFF)
 
 if(APPLE)
     OPTION(APPLE_OUTPUT_DYLIB "Output dylib shared library" OFF)
@@ -34,8 +35,17 @@ else()
     endif()
 endif(USE_OPENMP)
 
+if(USE_GPU)
+    find_package(OpenCL REQUIRED)
+    include_directories(${OpenCL_INCLUDE_DIRS})
+    MESSAGE(STATUS "OpenCL include directory:" ${OpenCL_INCLUDE_DIRS})
+    find_package(Boost 1.56.0 COMPONENTS filesystem system REQUIRED)
+    include_directories(${Boost_INCLUDE_DIRS})
+    ADD_DEFINITIONS(-DUSE_GPU)
+endif(USE_GPU)
+
 if(UNIX OR MINGW OR CYGWIN)
-    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread -O3 -Wall -std=c++11")
+    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread -O3 -Wall -std=c++11 -Wno-ignored-attributes")
 endif()
 
 if(MSVC)
@@ -65,11 +75,13 @@ endif()
 
 
 SET(LightGBM_HEADER_DIR ${PROJECT_SOURCE_DIR}/include)
+SET(BOOST_COMPUTE_HEADER_DIR ${PROJECT_SOURCE_DIR}/compute/include)
 
 SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_SOURCE_DIR})
 SET(LIBRARY_OUTPUT_PATH ${PROJECT_SOURCE_DIR})
 
 include_directories (${LightGBM_HEADER_DIR})
+include_directories (${BOOST_COMPUTE_HEADER_DIR})
 
 if(APPLE)
   if (APPLE_OUTPUT_DYLIB)
@@ -105,6 +117,11 @@ if(USE_MPI)
   TARGET_LINK_LIBRARIES(_lightgbm ${MPI_CXX_LIBRARIES})
 endif(USE_MPI)
 
+if(USE_GPU)
+  TARGET_LINK_LIBRARIES(lightgbm ${OpenCL_LIBRARY} ${Boost_LIBRARIES})
+  TARGET_LINK_LIBRARIES(_lightgbm ${OpenCL_LIBRARY} ${Boost_LIBRARIES})
+endif(USE_GPU)
+
 if(WIN32 AND (MINGW OR CYGWIN))
     TARGET_LINK_LIBRARIES(lightgbm Ws2_32)
     TARGET_LINK_LIBRARIES(_lightgbm Ws2_32)

diff --git a/compute b/compute
diff --git a/include/LightGBM/bin.h b/include/LightGBM/bin.h
@@ -59,7 +59,6 @@ class BinMapper {
   explicit BinMapper(const void* memory);
   ~BinMapper();
 
-  static double kSparseThreshold;
   bool CheckAlign(const BinMapper& other) const {
     if (num_bin_ != other.num_bin_) {
       return false;
@@ -258,6 +257,7 @@ class BinIterator {
   * \return Bin data
   */
   virtual uint32_t Get(data_size_t idx) = 0;
+  virtual uint32_t RawGet(data_size_t idx) = 0;
   virtual void Reset(data_size_t idx) = 0;
   virtual ~BinIterator() = default;
 };
@@ -383,12 +383,13 @@ class Bin {
   * \param num_bin Number of bin
   * \param sparse_rate Sparse rate of this bins( num_bin0/num_data )
   * \param is_enable_sparse True if enable sparse feature
+  * \param sparse_threshold Threshold for treating a feature as a sparse feature
   * \param is_sparse Will set to true if this bin is sparse
   * \param default_bin Default bin for zeros value
   * \return The bin data object
   */
   static Bin* CreateBin(data_size_t num_data, int num_bin,
-    double sparse_rate, bool is_enable_sparse, bool* is_sparse);
+    double sparse_rate, bool is_enable_sparse, double sparse_threshold, bool* is_sparse);
 
   /*!
   * \brief Create object for bin data of one feature, used for dense feature

diff --git a/include/LightGBM/config.h b/include/LightGBM/config.h
@@ -97,6 +97,11 @@ struct IOConfig: public ConfigBase {
   int num_iteration_predict = -1;
   bool is_pre_partition = false;
   bool is_enable_sparse = true;
+  /*! \brief The threshold of zero elements precentage for treating a feature as a sparse feature.
+   *  Default is 0.8, where a feature is treated as a sparse feature when there are over 80% zeros.
+   *  When setting to 1.0, all features are processed as dense features.
+   */
+  double sparse_threshold = 0.8;
   bool use_two_round_loading = false;
   bool is_save_binary_file = false;
   bool enable_load_from_binary_file = true;
@@ -188,6 +193,16 @@ struct TreeConfig: public ConfigBase {
   // max_depth < 0 means no limit
   int max_depth = -1;
   int top_k = 20;
+  /*! \brief OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform.
+   *  Default value is -1, using the system-wide default platform
+   */
+  int gpu_platform_id = -1;
+  /*! \brief OpenCL device ID in the specified platform. Each GPU in the selected platform has a
+   *  unique device ID. Default value is -1, using the default device in the selected platform
+   */
+  int gpu_device_id = -1;
+  /*! \brief Set to true to use double precision math on GPU (default using single precision) */
+  bool gpu_use_dp = false;
   LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override;
 };
 
@@ -216,11 +231,14 @@ struct BoostingConfig: public ConfigBase {
   // only used for the regression. Will boost from the average labels.
   bool boost_from_average = true;
   std::string tree_learner_type = "serial";
+  std::string device_type = "cpu";
   TreeConfig tree_config;
   LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override;
 private:
   void GetTreeLearnerType(const std::unordered_map<std::string,
     std::string>& params);
+  void GetDeviceType(const std::unordered_map<std::string,
+    std::string>& params);
 };
 
 /*! \brief Config for Network */

diff --git a/include/LightGBM/dataset.h b/include/LightGBM/dataset.h
@@ -355,6 +355,9 @@ class Dataset {
   inline int Feture2SubFeature(int feature_idx) const {
     return feature2subfeature_[feature_idx];
   }
+  inline uint64_t GroupBinBoundary(int group_idx) const {
+    return group_bin_boundaries_[group_idx];
+  }
   inline uint64_t NumTotalBin() const {
     return group_bin_boundaries_.back();
   }
@@ -421,19 +424,36 @@ class Dataset {
     const int sub_feature = feature2subfeature_[i];
     return feature_groups_[group]->bin_mappers_[sub_feature]->num_bin();
   }
+
+  inline int FeatureGroupNumBin(int group) const {
+    return feature_groups_[group]->num_total_bin_;
+  }
 
   inline const BinMapper* FeatureBinMapper(int i) const {
     const int group = feature2group_[i];
     const int sub_feature = feature2subfeature_[i];
     return feature_groups_[group]->bin_mappers_[sub_feature].get();
   }
 
+  inline const Bin* FeatureBin(int i) const {
+    const int group = feature2group_[i];
+    return feature_groups_[group]->bin_data_.get();
+  }
+
+  inline const Bin* FeatureGroupBin(int group) const {
+    return feature_groups_[group]->bin_data_.get();
+  }
+
   inline BinIterator* FeatureIterator(int i) const {
     const int group = feature2group_[i];
     const int sub_feature = feature2subfeature_[i];
     return feature_groups_[group]->SubFeatureIterator(sub_feature);
   }
 
+  inline BinIterator* FeatureGroupIterator(int group) const {
+    return feature_groups_[group]->FeatureGroupIterator();
+  }
+
   inline double RealThreshold(int i, uint32_t threshold) const {
     const int group = feature2group_[i];
     const int sub_feature = feature2subfeature_[i];
@@ -461,6 +481,9 @@ class Dataset {
   /*! \brief Get Number of used features */
   inline int num_features() const { return num_features_; }
 
+  /*! \brief Get Number of feature groups */
+  inline int num_feature_groups() const { return num_groups_;}
+
   /*! \brief Get Number of total features */
   inline int num_total_features() const { return num_total_features_; }
 
@@ -516,6 +539,8 @@ class Dataset {
   Metadata metadata_;
   /*! \brief index of label column */
   int label_idx_ = 0;
+  /*! \brief Threshold for treating a feature as a sparse feature */
+  double sparse_threshold_;
   /*! \brief store feature names */
   std::vector<std::string> feature_names_;
   /*! \brief store feature names */

diff --git a/include/LightGBM/feature_group.h b/include/LightGBM/feature_group.h
@@ -25,10 +25,11 @@ class FeatureGroup {
   * \param bin_mappers Bin mapper for features
   * \param num_data Total number of data
   * \param is_enable_sparse True if enable sparse feature
+  * \param sparse_threshold Threshold for treating a feature as a sparse feature
   */
   FeatureGroup(int num_feature,
     std::vector<std::unique_ptr<BinMapper>>& bin_mappers,
-    data_size_t num_data, bool is_enable_sparse) : num_feature_(num_feature) {
+    data_size_t num_data, double sparse_threshold, bool is_enable_sparse) : num_feature_(num_feature) {
     CHECK(static_cast<int>(bin_mappers.size()) == num_feature);
     // use bin at zero to store default_bin
     num_total_bin_ = 1;
@@ -46,7 +47,7 @@ class FeatureGroup {
     }
     double sparse_rate = 1.0f - static_cast<double>(cnt_non_zero) / (num_data);
     bin_data_.reset(Bin::CreateBin(num_data, num_total_bin_,
-      sparse_rate, is_enable_sparse, &is_sparse_));
+      sparse_rate, is_enable_sparse, sparse_threshold, &is_sparse_));
   }
   /*!
   * \brief Constructor from memory
@@ -120,6 +121,18 @@ class FeatureGroup {
     uint32_t default_bin = bin_mappers_[sub_feature]->GetDefaultBin();
     return bin_data_->GetIterator(min_bin, max_bin, default_bin);
   }
+
+  /*!
+   * \brief Returns a BinIterator that can access the entire feature group's raw data.
+   *        The RawGet() function of the iterator should be called for best efficiency.
+   * \return A pointer to the BinIterator object
+   */
+  inline BinIterator* FeatureGroupIterator() {
+    uint32_t min_bin = bin_offsets_[0];
+    uint32_t max_bin = bin_offsets_.back() - 1;
+    uint32_t default_bin = 0;
+    return bin_data_->GetIterator(min_bin, max_bin, default_bin);
+  }
 
   inline data_size_t Split(
     int sub_feature,

diff --git a/include/LightGBM/tree_learner.h b/include/LightGBM/tree_learner.h
@@ -24,8 +24,9 @@ class TreeLearner {
   /*!
   * \brief Initialize tree learner with training dataset
   * \param train_data The used training data
+  * \param is_constant_hessian True if all hessians share the same value
   */
-  virtual void Init(const Dataset* train_data) = 0;
+  virtual void Init(const Dataset* train_data, bool is_constant_hessian) = 0;
 
   virtual void ResetTrainingData(const Dataset* train_data) = 0;
 
@@ -71,10 +72,12 @@ class TreeLearner {
 
   /*!
   * \brief Create object of tree learner
-  * \param type Type of tree learner
+  * \param learner_type Type of tree learner
+  * \param device_type Type of tree learner
   * \param tree_config config of tree
   */
-  static TreeLearner* CreateTreeLearner(const std::string& type,
+  static TreeLearner* CreateTreeLearner(const std::string& learner_type,
+    const std::string& device_type,
     const TreeConfig* tree_config);
 };
 

diff --git a/src/boosting/gbdt.cpp b/src/boosting/gbdt.cpp
@@ -92,10 +92,10 @@ void GBDT::ResetTrainingData(const BoostingConfig* config, const Dataset* train_
 
   if (train_data_ != train_data && train_data != nullptr) {
     if (tree_learner_ == nullptr) {
-      tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(new_config->tree_learner_type, &new_config->tree_config));
+      tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(new_config->tree_learner_type, new_config->device_type, &new_config->tree_config));
     }
     // init tree learner
-    tree_learner_->Init(train_data);
+    tree_learner_->Init(train_data, is_constant_hessian_);
 
     // push training metrics
     training_metrics_.clear();

diff --git a/src/io/bin.cpp b/src/io/bin.cpp
@@ -339,12 +339,10 @@ template class OrderedSparseBin<uint8_t>;
 template class OrderedSparseBin<uint16_t>;
 template class OrderedSparseBin<uint32_t>;
 
-double BinMapper::kSparseThreshold = 0.8f;
-
 Bin* Bin::CreateBin(data_size_t num_data, int num_bin, double sparse_rate, 
-  bool is_enable_sparse, bool* is_sparse) {
+  bool is_enable_sparse, double sparse_threshold, bool* is_sparse) {
   // sparse threshold
-  if (sparse_rate >= BinMapper::kSparseThreshold && is_enable_sparse) {
+  if (sparse_rate >= sparse_threshold && is_enable_sparse) {
     *is_sparse = true;
     return CreateSparseBin(num_data, num_bin);
   } else {