You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a row in a large wide table, many (even most) columns may be null or set to their default values. This table might be loaded using SQL commands like COPY INTO wide_table(c1, c100) FROM ..., while wide_table itself may contain 1000 columns.
In memory, the unused columns are represented as Value::Scalar in DataBlock, which speeds up computation significantly. However, when we translate DataBlock into an Arrow RowBatch, it gets flattened. This results in:
Slower load progress.
When we read the data back, it is represented as Value::Column.
Impact
The flattening process during the conversion to Arrow RowBatch introduces performance overhead, causing slower load times.
The conversion of unused columns from Value::Scalar to Value::Column during read-back operations can negatively impact performance and resource usage.
The text was updated successfully, but these errors were encountered:
youngsofun
changed the title
optimize(fuse): record scalar column in meta file.
optimize(fuse): record scalar column in meta file or parquet meta?.
Jun 28, 2024
youngsofun
changed the title
optimize(fuse): record scalar column in meta file or parquet meta?.
optimize(fuse): record scalar column in meta file(or parquet meta)?.
Jun 28, 2024
For a row in a large wide table, many (even most) columns may be null or set to their default values. This table might be loaded using SQL commands like COPY INTO wide_table(c1, c100) FROM ...,
It looks like 'alter table t add column c int' or 'alter table t add column c int default 1',
maybe we need not to "materialize" those columns at all?
Summary
For a row in a large wide table, many (even most) columns may be null or set to their default values. This table might be loaded using SQL commands like COPY INTO wide_table(c1, c100) FROM ..., while wide_table itself may contain 1000 columns.
In memory, the unused columns are represented as Value::Scalar in DataBlock, which speeds up computation significantly. However, when we translate DataBlock into an Arrow RowBatch, it gets flattened. This results in:
Impact
The text was updated successfully, but these errors were encountered: