-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support training data with dir[mutil libsvm data] in the CLI version #5417
Comments
Thanks for using LightGBM! Since you mentioned Option 1 - use DaskWould you consider the Dask interface in If you're open to that, I'd be happy to provide a reproducible example showing how to do that. Option 2 - use
|
@jameslamb |
To be clear, I didn't recommend using The There is already a feature request in this project's backlog for supporting providing a directory of files as input to training in the CLI (#2031). And linking a few other related conversations:
Just to set the right expectation...I doubt that that feature will be implemented by maintainers soon. There is significant other work that needs to be done in the project to get to its 4.0.0 release (see the conversation in #5153). So if the Python options I've provided above don't work for your use case, and neither does Spark (as @StrikerRUS recommended to you in the discussion in #2031), then you will either need to watch those issues and wait for them to be implemented, or attempt to implement this support yourself and open a pull request adding it. |
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM! |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
I am aware of the MMLSpark solution you referenced.
I use the dist XGBoost(not use spark), It support input the dir[mutil files]. But I want use LGBM to train my model.
To overcome memory bottlenecks with pandas, I utilize the CLI version which is super efficient.
The problem is that the CLI version wants ONE training file as input. If the training data is in many files (libsvm), which it usually is, concatenating them is a huge pain in the neck. If it is all going to end up inside LightGBM anyways, why not reading individual files and concatenating the data inside LightGBM?
Many thanks!
The text was updated successfully, but these errors were encountered: