Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support training data with dir[mutil libsvm data] in the CLI version #5417

Closed
zwqjoy opened this issue Aug 12, 2022 · 5 comments
Closed

Support training data with dir[mutil libsvm data] in the CLI version #5417

zwqjoy opened this issue Aug 12, 2022 · 5 comments
Labels

Comments

@zwqjoy
Copy link

zwqjoy commented Aug 12, 2022

I am aware of the MMLSpark solution you referenced.

I use the dist XGBoost(not use spark), It support input the dir[mutil files]. But I want use LGBM to train my model.
To overcome memory bottlenecks with pandas, I utilize the CLI version which is super efficient.
The problem is that the CLI version wants ONE training file as input. If the training data is in many files (libsvm), which it usually is, concatenating them is a huge pain in the neck. If it is all going to end up inside LightGBM anyways, why not reading individual files and concatenating the data inside LightGBM?

Many thanks!

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM!

Since you mentioned pandas, I'm assuming you are comfortable working in Python.

Option 1 - use Dask

Would you consider the Dask interface in lightgbm.dask? You could construct a dask.Array from a directory of libsvm files, then pass that into LightGBM training.

If you're open to that, I'd be happy to provide a reproducible example showing how to do that.

Option 2 - use lightgbm.Sequence

Alternatively, you could try the lightgbm.Sequence interface in the Python package, which allows creating a Dataset from batches of data.

See https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/dataset_from_multi_hdf5.py for an example of how to do this with a directory of hdf5 files. You could try modifying that code to work with libsvm files.

@zwqjoy
Copy link
Author

zwqjoy commented Aug 25, 2022

jameslamb

@jameslamb
The train data size may 100G or large, so if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm.
but the train data contains many parts.

@jameslamb
Copy link
Collaborator

if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm

To be clear, I didn't recommend using pandas. The Dask interface performs distributed training the same way that the the CLI does. It's true that reading files into Python data structures like numpy arrays might require more memory than the CLI uses when reading files, but I recommend that you consider trying it before assuming that it definitely won't work with the amount of data you have.

The Sequence interface I recommended also does not require pandas, and allows you to construct a Dataset from a directory of files by reading in one file at a time and incrementally updating the Dataset. With that interface, the entire raw training set never needs to be held in memory at one time.


There is already a feature request in this project's backlog for supporting providing a directory of files as input to training in the CLI (#2031). And linking a few other related conversations:

Just to set the right expectation...I doubt that that feature will be implemented by maintainers soon. There is significant other work that needs to be done in the project to get to its 4.0.0 release (see the conversation in #5153).

So if the Python options I've provided above don't work for your use case, and neither does Spark (as @StrikerRUS recommended to you in the discussion in #2031), then you will either need to watch those issues and wait for them to be implemented, or attempt to implement this support yourself and open a pull request adding it.

@github-actions
Copy link

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants