Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aggregation fuzzer framework #12667

Merged
merged 41 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
6514cd2
impl primitive arrays generator.
Rachelint Sep 28, 2024
1a11133
sort out the test record batch generating codes.
Rachelint Sep 28, 2024
e0ea349
draft for `DataSetsGenerator`.
Rachelint Sep 28, 2024
c952bdf
tmp
Rachelint Sep 29, 2024
214d67f
improve the data generator, and start to impl the session context gen…
Rachelint Oct 1, 2024
04b4246
impl context generator.
Rachelint Oct 2, 2024
6b2af7f
tmp
Rachelint Oct 2, 2024
77d2268
define the `AggregationFuzzer`.
Rachelint Oct 3, 2024
4bef192
add ut for data generator.
Rachelint Oct 3, 2024
e7fbf47
improve comments for `SessionContextGenerator`.
Rachelint Oct 3, 2024
984f6aa
define `GeneratedSessionContextBuilder` to reduce repeated codes.
Rachelint Oct 3, 2024
12e3f37
extract the check equality logic for reusing.
Rachelint Oct 3, 2024
ca4a40c
add ut for `SessionContextGenerator`.
Rachelint Oct 3, 2024
a4639de
tmp
Rachelint Oct 3, 2024
0cfd035
finish the main logic of `AggregationFuzzer`.
Rachelint Oct 3, 2024
8271079
try to rewrite some test using the fuzzer.
Rachelint Oct 3, 2024
d6e358e
fix header.
Rachelint Oct 3, 2024
2279ab7
expose table name through `AggregationFuzzerBuilder`.
Rachelint Oct 3, 2024
7deced4
throw err to aggr fuzzer, and expect them then.
Rachelint Oct 3, 2024
c5d80ce
switch to Arc<str> to slightly improve performance.
Rachelint Oct 3, 2024
b50ea49
throw more errors to fuzzer.
Rachelint Oct 3, 2024
7a9118f
print task informantion before panic.
Rachelint Oct 3, 2024
ea6ad89
improve comments.
Rachelint Oct 4, 2024
3d9bc15
support printing generated session context params in error reporting.
Rachelint Oct 4, 2024
bf7fc82
add todo.
Rachelint Oct 4, 2024
2e35985
add some new fuzz case based on `AggregationFuzzer`.
Rachelint Oct 4, 2024
0090e6c
fix lint.
Rachelint Oct 4, 2024
90cb038
print more information in error report.
Rachelint Oct 4, 2024
c2dcb60
fix clippy.
Rachelint Oct 4, 2024
d90b92b
improve comment of `SessionContextGenerator`.
Rachelint Oct 8, 2024
58c0777
just use fixed `data_gen_rounds` and `ctx_gen_rounds` currently, beca…
Rachelint Oct 8, 2024
d5ff6ec
improve comments for rounds constants.
Rachelint Oct 8, 2024
4b18d53
small improvements.
Rachelint Oct 8, 2024
fbf3a6e
select sql from some candidates ranther than fixed one.
Rachelint Oct 8, 2024
79b0734
make `data_gen_rounds` able to set again, and add more tests.
Rachelint Oct 8, 2024
ca36a88
add no group cases.
Rachelint Oct 8, 2024
ea5e80b
add fuzz test for basic string aggr.
Rachelint Oct 8, 2024
7f08f2b
make `data_gen_rounds` smaller.
Rachelint Oct 8, 2024
9b0005b
add comments.
Rachelint Oct 8, 2024
8040dc3
fix typo.
Rachelint Oct 8, 2024
5c90a6b
fix comment.
Rachelint Oct 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,118 @@ use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use tokio::task::JoinSet;

use crate::fuzz_cases::aggregation_fuzzer::{
AggregationFuzzerBuilder, ColumnDescr, DatasetGeneratorConfig,
};

// ========================================================================
// The new aggregation fuzz tests based on [`AggregationFuzzer`]
// ========================================================================

// TODO: write more test case to cover more `group by`s and `aggregation function`s
// TODO: maybe we can use macro to simply the case creating

/// Fuzz test for group by `single int64`
#[tokio::test(flavor = "multi_thread")]
async fn test_group_by_single_int64() {
let builder = AggregationFuzzerBuilder::default();

// Define data generator config
let columns = vec![
ColumnDescr::new("a", DataType::Int64),
ColumnDescr::new("b", DataType::Int64),
ColumnDescr::new("c", DataType::Int64),
];
let sort_keys_set = vec![
vec!["b".to_string()],
vec!["c".to_string(), "b".to_string()],
];
let data_gen_config = DatasetGeneratorConfig {
columns,
rows_num_range: (512, 1024),
sort_keys_set,
};

// Build fuzzer
let fuzzer = builder
.data_gen_config(data_gen_config)
.data_gen_rounds(20)
.sql("SELECT b, sum(a) FROM fuzz_table GROUP BY b")
.table_name("fuzz_table")
.build();

fuzzer.run().await;
}

/// Fuzz test for group by `single string`
#[tokio::test(flavor = "multi_thread")]
async fn test_group_by_single_string() {
let builder = AggregationFuzzerBuilder::default();

// Define data generator config
let columns = vec![
ColumnDescr::new("a", DataType::Int64),
ColumnDescr::new("b", DataType::Utf8),
ColumnDescr::new("c", DataType::Int64),
];
let sort_keys_set = vec![
vec!["b".to_string()],
vec!["c".to_string(), "b".to_string()],
];
let data_gen_config = DatasetGeneratorConfig {
columns,
rows_num_range: (512, 1024),
sort_keys_set,
};

// Build fuzzer
let fuzzer = builder
.data_gen_config(data_gen_config)
.data_gen_rounds(20)
.sql("SELECT b, sum(a) FROM fuzz_table GROUP BY b")
.table_name("fuzz_table")
.build();

fuzzer.run().await;
}

/// Fuzz test for group by `sting + int64`
#[tokio::test(flavor = "multi_thread")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be easier to see what was happening if we made a few district explicity tests (rather than a single one that is mulit-threaded), though I see you are just following the existing pattern

Copy link
Contributor Author

@Rachelint Rachelint Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not so clear about district explicity tests, is it possible for some examples? I am please to try to make it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant like

#[tokio::test]
async fn test_basic_string_aggr_group_by_mixed_string_int64_1() {
...
}

#[tokio::test]
async fn test_basic_string_aggr_group_by_mixed_string_int64_2() {
...
}

Rather than a single test that was multi-threaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. For me I found the error messages are messy in current pattern.
🤔 But datasets and session contexts in cases are randomly generated, seems hard to split cases?

We can indeed consider more about how to make seeing things in the tests easier.

async fn test_group_by_mixed_string_int64() {
let builder = AggregationFuzzerBuilder::default();

// Define data generator config
let columns = vec![
ColumnDescr::new("a", DataType::Int64),
ColumnDescr::new("b", DataType::Utf8),
ColumnDescr::new("c", DataType::Int64),
ColumnDescr::new("d", DataType::Int32),
];
let sort_keys_set = vec![
vec!["b".to_string(), "c".to_string()],
vec!["d".to_string(), "b".to_string(), "c".to_string()],
];
let data_gen_config = DatasetGeneratorConfig {
columns,
rows_num_range: (512, 1024),
sort_keys_set,
};

// Build fuzzer
let fuzzer = builder
.data_gen_config(data_gen_config)
.data_gen_rounds(20)
.sql("SELECT b, sum(a) FROM fuzz_table GROUP BY b,c")
.table_name("fuzz_table")
.build();

fuzzer.run().await;
}

// ========================================================================
// The old aggregation fuzz tests
// ========================================================================
/// Tracks if this stream is generating input or output
/// Tests that streaming aggregate and batch (non streaming) aggregate produce
/// same results
#[tokio::test(flavor = "multi_thread")]
Expand Down Expand Up @@ -311,6 +423,7 @@ async fn group_by_string_test(
let actual = extract_result_counts(results);
assert_eq!(expected, actual);
}

async fn verify_ordered_aggregate(frame: &DataFrame, expected_sort: bool) {
struct Visitor {
expected_sort: bool,
Expand Down
Loading