Support training data with dir[mutil libsvm data] in the CLI version #5417

zwqjoy · 2022-08-12T04:57:12Z

I am aware of the MMLSpark solution you referenced.

I use the dist XGBoost(not use spark), It support input the dir[mutil files]. But I want use LGBM to train my model.
To overcome memory bottlenecks with pandas, I utilize the CLI version which is super efficient.
The problem is that the CLI version wants ONE training file as input. If the training data is in many files (libsvm), which it usually is, concatenating them is a huge pain in the neck. If it is all going to end up inside LightGBM anyways, why not reading individual files and concatenating the data inside LightGBM?

Many thanks!

jameslamb · 2022-08-12T19:32:40Z

Thanks for using LightGBM!

Since you mentioned pandas, I'm assuming you are comfortable working in Python.

Option 1 - use Dask

Would you consider the Dask interface in lightgbm.dask? You could construct a dask.Array from a directory of libsvm files, then pass that into LightGBM training.

If you're open to that, I'd be happy to provide a reproducible example showing how to do that.

Option 2 - use `lightgbm.Sequence`

Alternatively, you could try the lightgbm.Sequence interface in the Python package, which allows creating a Dataset from batches of data.

See https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/dataset_from_multi_hdf5.py for an example of how to do this with a directory of hdf5 files. You could try modifying that code to work with libsvm files.

zwqjoy · 2022-08-25T02:39:20Z

jameslamb

@jameslamb
The train data size may 100G or large, so if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm.
but the train data contains many parts.

jameslamb · 2022-08-25T03:48:27Z

if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm

To be clear, I didn't recommend using pandas. The Dask interface performs distributed training the same way that the the CLI does. It's true that reading files into Python data structures like numpy arrays might require more memory than the CLI uses when reading files, but I recommend that you consider trying it before assuming that it definitely won't work with the amount of data you have.

The Sequence interface I recommended also does not require pandas, and allows you to construct a Dataset from a directory of files by reading in one file at a time and incrementally updating the Dataset. With that interface, the entire raw training set never needs to be held in memory at one time.

There is already a feature request in this project's backlog for supporting providing a directory of files as input to training in the CLI (#2031). And linking a few other related conversations:

Just to set the right expectation...I doubt that that feature will be implemented by maintainers soon. There is significant other work that needs to be done in the project to get to its 4.0.0 release (see the conversation in #5153).

So if the Python options I've provided above don't work for your use case, and neither does Spark (as @StrikerRUS recommended to you in the discussion in #2031), then you will either need to watch those issues and wait for them to be implemented, or attempt to implement this support yourself and open a pull request adding it.

github-actions · 2022-09-24T04:33:56Z

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions · 2023-08-15T20:15:24Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

jameslamb added the question label Aug 12, 2022

jameslamb added the awaiting response label Aug 25, 2022

github-actions bot closed this as completed Sep 24, 2022

github-actions bot removed the awaiting response label Aug 15, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support training data with dir[mutil libsvm data] in the CLI version #5417

Support training data with dir[mutil libsvm data] in the CLI version #5417

zwqjoy commented Aug 12, 2022

jameslamb commented Aug 12, 2022

zwqjoy commented Aug 25, 2022

jameslamb commented Aug 25, 2022

github-actions bot commented Sep 24, 2022

github-actions bot commented Aug 15, 2023

Support training data with dir[mutil libsvm data] in the CLI version #5417

Support training data with dir[mutil libsvm data] in the CLI version #5417

Comments

zwqjoy commented Aug 12, 2022

jameslamb commented Aug 12, 2022

Option 1 - use Dask

Option 2 - use lightgbm.Sequence

zwqjoy commented Aug 25, 2022

jameslamb commented Aug 25, 2022

github-actions bot commented Sep 24, 2022

github-actions bot commented Aug 15, 2023

Option 2 - use `lightgbm.Sequence`