Create Dataset from Arrow format #3369

jameslamb · 2020-09-08T04:58:26Z

Summary

Apache Arrow is an open source columnar in-memory storage format that's well-suited to tabular data. It offers efficient data loading from files or other input streams, and zero-copy data sharing between languages.

Motivation

I think that this feature could allow for faster data loading, esp. from the parquet and CSV file formats. It would also allow directly training on Arrow tables, so we might be able to avoid some data copying in language wrappers (e.g. converting to a pandas data frame or R data.frame).

pyarrow offers a fast, efficient Parquet reader. I believe that reading from Parquet files directly into Arrow, then being able to efficiently create a LightGBM Dataset from that pyarrow table, would allow for faster I/O and better memory efficiency by avoiding the need to ever create a pandas data frame: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html.

Description

I'm admittedly not very experienced with C++, so maybe others can expand this description. But basically, I think it would involve adding a LGBM_DatasetCreateFromArrow similar to LGBM_DatasetCreateFromCSV:

LightGBM/src/c_api.cpp

Line 1245 in 82e2ff7

int LGBM_DatasetCreateFromCSC(const void* col_ptr,

Arrow is a fairly heavy dependency (and pyarrow in Python / {arrow} in R, by extension), so an implementation should also explore how to make these optional for users who do not need the Arrow features.

References

There is an in-progress PR to add this feature to XGBoost: dmlc/xgboost#5667

Spark added support for Arrow as a memory representation in pyspark 3 years ago: https://arrow.apache.org/blog/2017/07/26/spark-arrow/.

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2020-09-08T14:05:38Z

Spark added support for Arrow as a memory representation in pyspark 3 years ago:

Does it mean that right now one can use MMLSpark (https://github.com/Azure/mmlspark) for Arrow + LightGBM (similarly to parquet #1286 (comment))?

jameslamb · 2020-09-08T14:54:11Z

possibly! But that definitely does not satisfy this feature. Spark is a heavy dependency that many users are unlikely to have access to.

guolinke · 2020-09-09T03:01:50Z

I think @shiyu1994 can help with this. He has some ideas to refine the dataset class recently.

yalwan-iqvia · 2020-09-15T10:47:56Z

Spark added support for Arrow as a memory representation in pyspark 3 years ago:

Does it mean that right now one can use MMLSpark (https://github.com/Azure/mmlspark) for Arrow + LightGBM (similarly to parquet #1286 (comment))?

I think it would be other way round. If LightGBM implemented datasetFromArrow, it would probably be useful to speed up / improve efficiency from within MMLSpark

StrikerRUS · 2020-12-25T21:35:10Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

StrikerRUS · 2021-07-29T13:38:33Z

For the reference: Parquet data reader implementation in XGBoost with optional Arrow dependency at compile time.
dmlc/dmlc-core#653
https://github.com/dmlc/dmlc-core/blob/5eaff7643a88949c81af1e5de11945632920bf96/CMakeLists.txt#L71-L73

jameslamb · 2023-05-03T18:46:41Z

Linking the eventual XGBoost implementation: dmlc/xgboost#7512

jameslamb · 2023-08-18T02:16:08Z

Sorry, this was locked accidentally. Just unlocked it.

jameslamb · 2023-12-04T19:44:25Z

Linking some related PRs:

And this related discussion about how Arrow support enables polars support: #6204

jameslamb added feature request efficiency labels Sep 8, 2020

jameslamb mentioned this issue Sep 8, 2020

[Discussion] efficiency improvements #2791

Open

guolinke assigned shiyu1994 Sep 9, 2020

StrikerRUS mentioned this issue Dec 25, 2020

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Dec 25, 2020

borchero mentioned this issue Aug 5, 2023

[python-package] Add support for passing Arrow to LightGBM #6022

Closed

This comment was marked as off-topic.

Sign in to view

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

microsoft unlocked this conversation Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Dataset from Arrow format #3369

Create Dataset from Arrow format #3369

jameslamb commented Sep 8, 2020

StrikerRUS commented Sep 8, 2020

jameslamb commented Sep 8, 2020

guolinke commented Sep 9, 2020

yalwan-iqvia commented Sep 15, 2020

StrikerRUS commented Dec 25, 2020

StrikerRUS commented Jul 29, 2021

jameslamb commented May 3, 2023

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

jameslamb commented Dec 4, 2023

Create Dataset from Arrow format #3369

Create Dataset from Arrow format #3369

Comments

jameslamb commented Sep 8, 2020

Summary

Motivation

Description

References

StrikerRUS commented Sep 8, 2020

jameslamb commented Sep 8, 2020

guolinke commented Sep 9, 2020

yalwan-iqvia commented Sep 15, 2020

StrikerRUS commented Dec 25, 2020

StrikerRUS commented Jul 29, 2021

jameslamb commented May 3, 2023

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

jameslamb commented Dec 4, 2023