Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44010: [C++] Add arrow::RecordBatch::MakeStatisticsArray() #44252

Merged
merged 7 commits into from
Nov 8, 2024

Conversation

kou
Copy link
Member

@kou kou commented Sep 30, 2024

Rationale for this change

Statistics schema for Arrow C data interface (GH-43553) is complex because it uses nested types (struct, map and union). So reusable implementation to make statistics array is useful.

What changes are included in this PR?

arrow::RecordBatch::MakeStatisticsArray() is a convenient function that converts arrow::ArrayStatistics in a arrow::RecordBatch to arrow::Array for the Arrow C data interface.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

Copy link

⚠️ GitHub issue #44010 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/compare.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 30, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 903e3f4 to 92afc83 Compare September 30, 2024 08:17
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 30, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 92afc83 to b194430 Compare September 30, 2024 09:03
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 30, 2024
@kou
Copy link
Member Author

kou commented Oct 2, 2024

@pitrou @ianmcook What do you think about this?

Statistics schema https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R86-R95 is compact but it may be complex to build. Because it uses many nested types.

@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 5a00c48 to 12b1a97 Compare October 30, 2024 06:32
const std::shared_ptr<DataType>& operator()(const int64_t&) { return int64(); }
const std::shared_ptr<DataType>& operator()(const uint64_t&) { return uint64(); }
const std::shared_ptr<DataType>& operator()(const double&) { return float64(); }
const std::shared_ptr<DataType>& operator()(const std::string&) { return utf8(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may forgot a bit but we don't distinct "bytes" and "utf8" in stats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we didn't discuss it...
Let's discuss it in #44579.

We can assume "utf8" here for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a // TODO(GH-44579) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I should have added it...
I've added it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot to push the commit... I pushed now.

@@ -80,6 +80,24 @@ struct ArrowArray {
void* private_data;
};

# define ARROW_STATISTICS_KEY_AVERAGE_BYTE_WIDTH_EXACT "ARROW:average_byte_width:exact"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know constexpr std::string_view is better or this is better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use constexpr because this header may be used by C programs.

statistics.nth_statistics = 0;
statistics.start_new_column = true;
statistics.nth_column = std::nullopt;
statistics.key = ARROW_STATISTICS_KEY_ROW_COUNT_EXACT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So RowCount is also handled as a stats 🤔?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statistics array will be passed to consumer before consumer receives a record batch.
So this may be useful for consumer.

But DuckDB doesn't have row count in its BaseStatistics...: https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146
This may not be useful...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this for now to demonstrate table/record batch level statistics.

cpp/src/arrow/record_batch.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 30, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 8e4d618 to 9c529d1 Compare October 31, 2024 23:48
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 31, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a careful pass tonight

const std::shared_ptr<DataType>& operator()(const int64_t&) { return int64(); }
const std::shared_ptr<DataType>& operator()(const uint64_t&) { return uint64(); }
const std::shared_ptr<DataType>& operator()(const double&) { return float64(); }
const std::shared_ptr<DataType>& operator()(const std::string&) { return utf8(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a // TODO(GH-44579) here?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM but I'm not an expert on C ABI and data layer


// Statistics schema doesn't define static dense union type for
// values. Each statistics schema have a dense union type that has
// needled value types. The following block collects these types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually this is logically a "set" prepared for items?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

If there are the same types, the first type is only used.

};
using OnStatistics =
std::function<Status(const EnumeratedStatistics& enumerated_statistics)>;
Status EnumerateStatistics(const RecordBatch& record_batch, OnStatistics on_statistics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually this is for a two-phase building, one pass for types, and one-pass for data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

I think that it's one of complexities.
So I sent https://lists.apache.org/thread/0c9jftkspvj7yw1lpo73s3vtp6vfjqv8 to the mailing list. But nobody agreed it. So this complexity will be acceptable...

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 4, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 9c529d1 to ab80bf9 Compare November 6, 2024 22:00
It's a convenient function that converts `arrow::ArrayStatistics` in a
`arrow::RecordBatch` to `arrow::Array` for the Arrow C data interface.
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from ab80bf9 to e93d0f4 Compare November 6, 2024 22:00
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 6, 2024
Copy link
Member Author

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this in a few days is nobody objects this.

const std::shared_ptr<DataType>& operator()(const int64_t&) { return int64(); }
const std::shared_ptr<DataType>& operator()(const uint64_t&) { return uint64(); }
const std::shared_ptr<DataType>& operator()(const double&) { return float64(); }
const std::shared_ptr<DataType>& operator()(const std::string&) { return utf8(); }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot to push the commit... I pushed now.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 6, 2024
@kou kou merged commit d748ace into apache:main Nov 8, 2024
41 checks passed
@kou kou deleted the cpp-record-batch-make-statistics-array branch November 8, 2024 20:37
@kou kou removed the awaiting changes Awaiting changes label Nov 8, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit d748ace.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants