LightGBM4j is a zero-dependency Java wrapper for the LightGBM project. Its main goal is to provide a 1-1 mapping for all LightGBM API methods in a Java-friendly flavor.
LightGBM itself has a SWIG-generated JNI interface, which is possible to use directly from Java. The problem with SWIG wrappers is that they are extremely low-level. For example, to pass a java array thru SWIG, you need to do something horrible:
SWIGTYPE_p_float dataBuffer = new_floatArray(input.length);
for (int i = 0; i < input.length; i++) {
floatArray_setitem(dataBuffer, i, input[i]);
}
int result = <...>
if (result < 0) {
delete_floatArray(dataBuffer);
throw new Exception(LGBM_GetLastError());
} else {
delete_floatArray(dataBuffer);
<...>
}
This wrapper does all the dirty job for you:
- exposes native java types for all supported API methods (so
float[]
insteadSWIGTYPE_p_float
) - handles manual memory management internally (so you don't need to care about JNI memory leaks)
- supports both
float[]
anddouble[]
API flavours. - reduces the amount of boilerplate for basic tasks.
The library is in an early development stage and does not cover all 100% of LightGBM API, but the eventual future goal will be merging with the upstream LightGBM and becoming an official Java binding for the project.
To install, use the following maven coordinates:
<dependency>
<groupId>io.github.metarank</groupId>
<artifactId>lightgbm4j</artifactId>
<version>3.3.2-2</version>
</dependency>
Versioning schema attempts to match the upstream, but with extra -N
suffix, if there were a couple of extra lightgbm4j-specific
changes released on top.
There are two main classes available:
LGBMDataset
to manage input training and validation data.LGBMBooster
to do training and inference.
All the public API methods in these classes should map to the LightGBM C API methods directly.
Note that both LGBMBooster
and LGBMDataset
classes contain handles of native memory
data structures from the LightGBM, so you need to explicitly call .close()
when they are not used. Otherwise, you may catch
a native code memory leak.
To load an existing model and run it:
LGBMBooster loaded = LGBMBooster.loadModelFromString(model);
float[] input = new float[] {1.0f, 1.0f, 1.0f, 1.0f};
double[] pred = booster.predictForMat(input, 2, 2, true);
To load a dataset from a java matrix:
float[] matrix = new float[] {1.0f, 1.0f, 1.0f, 1.0f};
LGBMDataset ds = LGBMDataset.createFromMat(matrix, 2, 2, true, "", null);
There are some rough parts in the LightGBM API in loading the dataset from matrices:
createFromMat
parameters cannot set the label or weight column. So if you doparameters = "label=some_column_name"
, it will be ignored by the LightGBM.- label/weight/group columns are magical and should NOT be included in the input matrix for
createFromMat
- to set these magical columns, you need to explicitly call
LGBMDataset.setField()
method. label
andweight
columns must befloat[]
group
column must beint[]
A full example of loading dataset from a matrix for a cancer dataset:
String[] columns = new String[] {
"Age","BMI","Glucose","Insulin","HOMA","Leptin","Adiponectin","Resistin","MCP.1"
};
double[] values = new double[] {
71,30.3,102,8.34,2.098344,56.502,8.13,4.2989,200.976,
66,27.7,90,6.042,1.341324,24.846,7.652055,6.7052,225.88,
75,25.7,94,8.079,1.8732508,65.926,3.74122,4.49685,206.802,
78,25.3,60,3.508,0.519184,6.633,10.567295,4.6638,209.749,
69,29.4,89,10.704,2.3498848,45.272,8.2863,4.53,215.769,
85,26.6,96,4.462,1.0566016,7.85,7.9317,9.6135,232.006,
76,27.1,110,26.211,7.111918,21.778,4.935635,8.49395,45.843,
77,25.9,85,4.58,0.960273333,13.74,9.75326,11.774,488.829,
45,21.30394858,102,13.852,3.4851632,7.6476,21.056625,23.03408,552.444,
45,20.82999519,74,4.56,0.832352,7.7529,8.237405,28.0323,382.955,
49,20.9566075,94,12.305,2.853119333,11.2406,8.412175,23.1177,573.63,
34,24.24242424,92,21.699,4.9242264,16.7353,21.823745,12.06534,481.949,
42,21.35991456,93,2.999,0.6879706,19.0826,8.462915,17.37615,321.919,
68,21.08281329,102,6.2,1.55992,9.6994,8.574655,13.74244,448.799,
51,19.13265306,93,4.364,1.0011016,11.0816,5.80762,5.57055,90.6,
62,22.65625,92,3.482,0.790181867,9.8648,11.236235,10.69548,703.973
};
float[] labels = new float[] {
0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
};
LGBMDataset dataset = LGBMDataset.createFromMat(values, 16, columns.length, true, "", null);
dataset.setFeatureNames(columns);
dataset.setField("label", labels);
return dataset;
Also, see a working example of different ways to deal with input datasets in the LightGBM4j tests.
// cancer dataset from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra
// with labels altered to fit the [0,1] range
LGBMDataset train = LGBMDataset.createFromFile("cancer.csv", "header=true label=name:Classification", null);
LGBMDataset test = LGBMDataset.createFromFile("cancer-test.csv", "header=true label=name:Classification", train);
LGBMBooster booster = LGBMBooster.create(train, "objective=binary label=name:Classification");
booster.addValidData(test);
for (int i=0; i<10; i++) {
booster.updateOneIter();
double[] evalTrain = booster.getEval(0);
double[] evalTest = booster.getEval(1);
System.out.println("train: " + eval[0] + " test: " + );
}
booster.close();
train.close();
test.close();
This code is tested to work well with Linux (Ubuntu 20.04), Windows (Server 2019) and MacOS 10.15/11. Mac M1 is also supported. Supported Java versions are 8, 11 and 17.
Not all LightGBM API methods are covered in this wrapper. PRs are welcome!
Supported methods:
- LGBM_BoosterAddValidData
- LGBM_BoosterCreate
- LGBM_BoosterCreateFromModelfile
- LGBM_BoosterFree
- LGBM_BoosterGetEval
- LGBM_BoosterGetFeatureNames
- LGBM_BoosterFeatureImportance
- LGBM_BoosterGetEvalNames
- LGBM_BoosterGetNumFeature
- LGBM_BoosterLoadModelFromString
- LGBM_BoosterPredictForMat
- LGBM_BoosterPredictForMatSingleRow
- LGBM_BoosterSaveModel
- LGBM_BoosterSaveModelToString
- LGBM_BoosterUpdateOneIter
- LGBM_DatasetCreateFromFile
- LGBM_DatasetCreateFromMat
- LGBM_DatasetFree
- LGBM_DatasetGetFeatureNames
- LGBM_DatasetGetNumData
- LGBM_DatasetGetNumFeature
- LGBM_GetLastError
- LGBM_DatasetSetFeatureNames
- LGBM_DatasetSetField
- LGBM_DatasetDumpText
Not yet supported:
- LGBM_BoosterCalcNumPredict
- LGBM_BoosterDumpModel
- LGBM_BoosterFreePredictSparse
- LGBM_BoosterGetCurrentIteration
- LGBM_BoosterGetEvalCounts
- LGBM_BoosterGetLeafValue
- LGBM_BoosterGetLowerBoundValue
- LGBM_BoosterGetNumClasses
- LGBM_BoosterGetNumPredict
- LGBM_BoosterGetPredict
- LGBM_BoosterGetUpperBoundValue
- LGBM_BoosterMerge
- LGBM_BoosterNumberOfTotalModel
- LGBM_BoosterNumModelPerIteration
- LGBM_BoosterPredictForCSC
- LGBM_BoosterPredictForCSR
- LGBM_BoosterPredictForCSRSingleRow
- LGBM_BoosterPredictForCSRSingleRowFast
- LGBM_BoosterPredictForCSRSingleRowFastInit
- LGBM_BoosterPredictForFile
- LGBM_BoosterPredictForMats
- LGBM_BoosterPredictForMatSingleRowFast
- LGBM_BoosterPredictForMatSingleRowFastInit
- LGBM_BoosterPredictSparseOutput
- LGBM_BoosterRefit
- LGBM_BoosterResetParameter
- LGBM_BoosterResetTrainingData
- LGBM_BoosterRollbackOneIter
- LGBM_BoosterSetLeafValue
- LGBM_BoosterShuffleModels
- LGBM_BoosterUpdateOneIterCustom
- LGBM_DatasetAddFeaturesFrom
- LGBM_DatasetCreateByReference
- LGBM_DatasetCreateFromCSC
- LGBM_DatasetCreateFromCSR
- LGBM_DatasetCreateFromCSRFunc
- LGBM_DatasetCreateFromMats
- LGBM_DatasetCreateFromSampledColumn
- LGBM_DatasetGetField
- LGBM_DatasetGetSubset
- LGBM_DatasetPushRows
- LGBM_DatasetPushRowsByCSR
- LGBM_DatasetSaveBinary
- LGBM_DatasetUpdateParamChecking
- LGBM_FastConfigFree
As LightGBM4j repackages bits of SWIG wrapper code from original LightGBM authors, it also uses exactly the same license.
The MIT License (MIT)
Copyright (c) Microsoft Corporation
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.