-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(perf): add mv write test that increases latency of regular reads #8724
Conversation
performance_regression_test.py
Outdated
node1.run_cqlsh("CREATE TABLE IF NOT EXISTS scylla_bench.test (pk bigint,ck bigint,v blob,PRIMARY KEY(pk, ck)) WITH compression = { }") | ||
node1.run_cqlsh("CREATE MATERIALIZED VIEW IF NOT EXISTS scylla_bench.view_test AS SELECT * FROM scylla_bench.test where v IS NOT NULL AND ck IS NOT NULL AND pk IS NOT NULL PRIMARY KEY (v, pk, ck)") | ||
start_time = time.time() | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have sdcm.utils.nemesis_utils.indexes.wait_for_view_to_be_built
for waiting for index to be built
n_loaders: 4 | ||
n_monitor_nodes: 1 | ||
|
||
instance_type_loader: 'c5.2xlarge' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use something newer to avoid problems with capacity. E.g. c6i
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just use perf-regression-latency-250gb-with-nemesis.yaml as a base. Can chane+ test on suggested instances
|
||
instance_type_loader: 'c5.2xlarge' | ||
instance_type_monitor: 't3.large' | ||
instance_type_db: 'i3.2xlarge' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still testing on i3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just use perf-regression-latency-250gb-with-nemesis.yaml as a base. Can chane+ test on suggested instances
@@ -0,0 +1,31 @@ | |||
test_duration: 680 | |||
prepare_write_cmd: ["cassandra-stress write no-warmup cl=ALL n=100000 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=100 -col 'size=FIXED(128) n=FIXED(8)' -pop seq=1..2000000", | |||
"scylla-bench -workload=sequential -mode=write -replication-factor=2 -partition-count=10 -partition-offset=0 -clustering-row-count=1000000 -clustering-row-size=uniform:100..5120 -concurrency=200 -rows-per-request=10 -timeout=30s -connection-count 200 -consistency-level=all", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like 'large partition' workload, is this 'special'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just prepare separate table with 'large partition"
'special workload", it is a combination of table with large partitions + MV + small write in table that causes a lot of MV updates
stress_cmd_no_mv and stress_cmd_mv are the same but difference in latency during executing this command is huge https://github.com/scylladb/qa-tasks/issues/1706#issuecomment-2351624595
performance_regression_test.py
Outdated
|
||
def test_read_mv_latency(self): | ||
# next 3 lines, is a workaround to have it working inside `latency_calculator_decorator` | ||
self.cluster = self.db_cluster # pylint: disable=attribute-defined-outside-init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if needed anymore, recently was fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hack still exists in PerformanceRegressionUpgradeTest.run_workload_and_upgrade where latency_calculator_decorator is used, but will doublecheck
7687975
to
5bafe7e
Compare
stress_cmd_mv: "scylla-bench -workload=uniform -mode=write -replication-factor=2 -partition-count=30 -clustering-row-count=1000000 -clustering-row-size=uniform:100..5120 -concurrency=200 -max-rate=4000 -rows-per-request=1 -timeout=30s -connection-count 200 -consistency-level=one -iterations=0 -duration=15m" | ||
|
||
n_db_nodes: 3 | ||
n_loaders: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need that number of loaders? did you check how they are utilized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, we probably don't need 4(I did not check, just copied from the latency-125gb test ). Is it important? The test is just reproducer of this problem and should not be used in regular runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
managed to reproduce the problem on 2 nodes
"scylla-bench -workload=sequential -mode=write -replication-factor=2 -partition-count=10 -partition-offset=10 -clustering-row-count=1000000 -clustering-row-size=uniform:100..5120 -concurrency=200 -rows-per-request=10 -timeout=30s -connection-count 200 -consistency-level=all", | ||
"scylla-bench -workload=sequential -mode=write -replication-factor=2 -partition-count=10 -partition-offset=20 -clustering-row-count=1000000 -clustering-row-size=uniform:100..5120 -concurrency=200 -rows-per-request=10 -timeout=30s -connection-count 200 -consistency-level=all"] | ||
|
||
stress_cmd_r: "cassandra-stress read cl=ALL duration=600m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=10 throttle=100/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..100000,50000,50000)' " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latency decorator verifies hdr reports from this command, does 100/s throttling make any sense?
Cluster is not loaded much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster actually overloaded during there test(but not by there command). latency for this command during do_rewrite_workload_with_mv step is 224412.04 or I/o load even failed.
@@ -891,3 +892,51 @@ def test_latency_write_with_upgrade(self): | |||
def test_latency_mixed_with_upgrade(self): | |||
self._prepare_latency_with_upgrade() | |||
self.run_workload_and_upgrade(stress_cmd=self.params.get('stress_cmd_m')) | |||
|
|||
|
|||
class PerformanceRegressionMaterializedViewLatencyTest(PerformanceRegressionTest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to add doc string describing purpose of this test, e.g. measuring impact of creating mv on read latency and describe specifics of "special" workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added doc string
test_name: "performance_regression_test.PerformanceRegressionMaterializedViewLatencyTest", | ||
test_config: """["test-cases/performance/perf-regression-latency-mv-read-concurrency.yaml"]""", | ||
sub_tests: ["test_read_mv_latency"], | ||
email_recipients: 'wojciech.mitros@scylladb.com,artsiom.mishuta@scylladb.com,piodul@scylladb.com' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only those 3 names? better use group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is just reproducer of this problem and should not be used in regular runs.
27aed59
to
c7c0c71
Compare
sdcm/results_analyze/__init__.py
Outdated
dataset_size_match = re.search(r'(\d{3}gb)', config_files) | ||
if dataset_size_match is None: | ||
dataset_size = 'unknown size' | ||
else: | ||
dataset_size = dataset_size_match.group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changed recently, need to rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebased
552e7a3
to
bf661d8
Compare
sdcm/results_analyze/__init__.py
Outdated
@@ -481,8 +481,11 @@ def check_regression(self, test_id, data, is_gce=False, node_benchmarks=None, em | |||
kernel_callstack_events_summary = {Severity.DEBUG.name: len(kernel_callstack_events)} | |||
|
|||
config_files = ' '.join(doc["_source"]["setup_details"]["config_files"]) | |||
search_size = re.search(r'(\d.*(?#t|g)b)', config_files) | |||
dataset_size = search_size.group() if search_size else 'unknown size' | |||
dataset_size_match = re.search(r'(\d.*(?#t|g)b)', config_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is now redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, removed
bf661d8
to
6967344
Compare
the idea is to test the hardest case - modifying a column that is a regular column in the base table, but in the materialized view is one of the primary key columns.
6967344
to
13a79b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Have several comments
node1.run_cqlsh( | ||
"CREATE TABLE IF NOT EXISTS scylla_bench.test (pk bigint,ck bigint,v blob,PRIMARY KEY(pk, ck)) WITH compression = { }") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this command? this table have to be created on prepare already
self.run_fstrim_on_all_db_nodes() | ||
|
||
self.create_test_stats(sub_type="read", append_sub_test_to_name=False, test_index="mv-overloading-latency-read") | ||
self.run_stress_thread(stress_cmd=self.params.get('stress_cmd_r'), stress_num=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the result of this command, do you? it is just background workload?
the idea is to test the hardest case - modifying a column that is a regular column in the base table, but in the materialized view is one of the primary key columns.
test steps
1 - 3 node cluster with 2 tables
2 - do special prepare CMD for table 1, and use table 2 as for latency PERF TEST (prepare_write_cmd) 3 - start read workload for table 2 - measure latency for table 2 (10min) (stress_cmd_r) 3 - do a special rewrite workload for table 1 to measure latency for table 2 (while changing for table 1 applying )(stress_cmd_no_mv) 4 - create MV, and wait for MV to sync - measure latency for table 2 (while MV is syncing )
5- do special rewrite workload for table 1 again - measure latency for table 2 (while changing for table 1 applying ) (stress_cmd_mv)
fixes: https://github.com/scylladb/qa-tasks/issues/1706
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)