-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi file as input for LightGBM #2031
Comments
good advice,i‘m supporting file folder as input in my private lgb。 |
@andrewliuxxx Great! Would you mind creating a PR? |
I have also been working on fixing pipe read support e.g. |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
One more request for using multiple files (preferably in parquet format) on one machine as input: #2638. |
@StrikerRUS Hi, Does have a detail date to support the training data from multiple parquet files with the CLI version ? |
@zwqjoy Hey! As of my knowledge, no one has picked this feature request up yet. Maybe MMLSpark or dataset creation from multiple files in Python will fit your needs? |
@StrikerRUS Thanks, do you mean if I want train dist lightgbm(data in hadoop: lots of part, I need use MML Spark). not use CLI DIST method. because input not support lots of part file ? |
@zwqjoy Yeah, exactly. CLI distributed version requires that entire file with training data should be presented on each machine. Please try MMLSpark for chunked data. |
Actually, distributed training with CLI supports partitioning data into each machine. See |
@shiyu1994 Ah, forgot about this option, thanks for correcting me! |
@shiyu1994 in CLI mode, both non-pre-partition and pre-partition can work. If I using pre-partition data, set pre_partition=true. for example I use 4 workers, and I store the data in nfs, split the data to data_split1 data_split2 data_split3 data_split4((4 name is diff, still store in nfs, 4 worker can access the NFS storage) worker1 use data_split1, and worker2 use data_split2. |
@StrikerRUS @shiyu1994
|
Hi @zwqjoy
|
And you should be careful that, with |
@shiyu1994 pre_partition https://lightgbm.readthedocs.io/en/latest/Parameters.html#pre_partition from doc "true if training data are pre-partitioned, and different machines use different partitions" And You say Different names for the training data files in different machines are perfectly OK. Do you mean each work local train config file (content train = part-[number]) ? |
@shiyu1994 you say pre_partition=true, each machine only outputs the metric value calculated on the data partitioned to that machine.
|
When I want to use LightGBM on 'Aether' ( a platform in Microsoft), multi files as input will be faster to upload or set up folder, but current LightGBM don't support multi files or dataset as input. Besides that, If we want to use it, we have to merge multi files as one file, which would be time-consuming especially when data is bigger. I wonder will we support it in future?
The text was updated successfully, but these errors were encountered: