CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in `ClusterConfigAliasTest.test_aliasing_with_upgrade` #13787

bharathv · 2023-09-29T00:10:09Z

https://buildkite.com/redpanda/vtools/builds/9713

Module: rptest.tests.cluster_config_test
Class: ClusterConfigAliasTest
Method: test_aliasing_with_upgrade
Arguments: {
    "prop_set": [
        "log_retention_ms",
        "delete_retention_ms",
        [   
            23, 
            3   
        ],  
        [   
            1000000,
            300000,
            500000
        ]   
    ],  
    "wipe_cache": true
}

test_id:    ClusterConfigAliasTest.test_aliasing_with_upgrade
status:     FAIL
run time:   10.059 seconds

AssertionError('Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000 (!=1000000)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 1471, in test_aliasing_with_upgrade
    self._check_value_everywhere(prop_set.aliased_name,
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 174, in _check_value_everywhere
    assert actual_value == expect_value, f"Wrong value on node {node.account.hostname}: {key}={actual_value} (!={expect_value})"
AssertionError: Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000 (!=1000000)

The text was updated successfully, but these errors were encountered:

andijcr · 2023-10-06T18:10:09Z

reviewing the failure, it might be a manifestation of the issue fixed in this pr #13964
because the test blocks at the first value, meaning it set the first value but wasn't able to find it while reading back the cluster config

piyushredpanda · 2023-10-06T20:39:47Z

Let's close this then; we can always reopen if the issue reappears.

rockwotj · 2023-10-17T18:24:23Z

It's back!

https://buildkite.com/redpanda/vtools/builds/9804
https://buildkite.com/redpanda/vtools/builds/9823
https://buildkite.com/redpanda/vtools/builds/9839
https://buildkite.com/redpanda/vtools/builds/9995

andijcr · 2023-10-31T14:06:58Z

link to last failure, for reference: https://buildkite.com/redpanda/redpanda/builds/39798#018b6813-83ed-4f15-9f49-20cfb4640d13

andijcr · 2023-10-31T14:30:03Z

of note: cluster_config.yaml in the ci run has this line

log_retention_ms: 500000

which is the correct value for the last step of the test. depending on when this file is generated, this might be a timing issue

andijcr · 2023-10-31T15:51:54Z

so cluster_config.yaml is read in the RedpandaService.stop method, via rpk.cluster_config_export(cluster_config_filename, True), so either we are seeing an incorrect transient state or the nodes have a different state

andijcr · 2023-10-31T16:23:34Z

just noticed that the last failure is for test_aliasing while the original failure and the recent vtools failures are for test_aliasing_with_upgrade.

The area is adjacent but the underlying problem seems different. copying last messages here #13507

andijcr · 2023-10-31T16:52:34Z

back on the recent failures:

Module: rptest.tests.cluster_config_test
Class:  ClusterConfigAliasTest
Method: test_aliasing_with_upgrade
Arguments:
{
  "prop_set": [
    "log_retention_ms",
    "delete_retention_ms",
    [
      23,
      3
    ],
    [
      1000000,
      300000,
      500000
    ]
  ],
  "wipe_cache": false
}


        Validate that a property written under an alias in a previous release
        is read correctly after upgrade.

        :param wipe_cache: if true, erase cluster config cache to ensure that the
                           upgraded node is reading from the controller log rather
                           than just cache.

AssertionError('Wrong value on node ip-172-31-6-224: delete_retention_ms=604800000 (!=1000000)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 1464, in test_aliasing_with_upgrade
    self._check_value_everywhere(prop_set.aliased_name,
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 174, in _check_value_everywhere
    assert actual_value == expect_value, f"Wrong value on node {node.account.hostname}: {key}={actual_value} (!={expect_value})"
AssertionError: Wrong value on node ip-172-31-6-224: delete_retention_ms=604800000 (!=1000000)

also in this case cluster_config.yaml reports the correct value

# delete segments older than this - default 1 week (may be nil)
delete_retention_ms: 1000000

~~but it's not supposed to be delete_retention_ms anymore in v23.3.~~

~~Reading the log from https://buildkite.com/redpanda/vtools/builds/9995 the version is~~

INFO  2023-10-15 15:58:27,833 [shard 0] main - application.cc:349 - Redpanda v23.2.12 - 4ff29d7deb7ef445c5d292f31f45bdf4975a24f5

~~so are all the failures for backports ?~~

edit: it's an upgrade test so of course i'm seeing v23.2.12. the original point still stands, it might just need the backport to be released

andijcr · 2023-10-31T16:54:45Z

the original fix is not backported yet #13992

andijcr · 2023-11-02T13:01:43Z

as with #13507 , the test observes a transition and fails
get cluster_config is issue before the node has a chance to apply the config delta.

pr #14581 relaxes the check

DEBUG 2023-10-15 15:58:30,658 [shard 1] admin_api_server - admin_server.cc:581 - [_anonymous] GET http://ip-172-31-6-224:9644/v1/cluster_config
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - storage_resources.cc:213 - configuration_manager_take_bytes 4486 += 107 (current 2684350074)
TRACE 2023-10-15 15:58:30,658 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2849 - Follower commit index updated 23
TRACE 2023-10-15 15:58:30,658 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1810 - Received append entries request: {raft_group:{0}, commit_index:{24}, term:{1}, prev_log_index:{24}, prev_log_term:{1}, last_visible_index:{24}}
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - storage_resources.cc:226 - stm_take_bytes 4593 += 107 (current 2684349967)
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - storage_resources.cc:213 - configuration_manager_take_bytes 4593 += 107 (current 2684349967)
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - readers_cache.cc:101 - {redpanda/controller/0} - trying to get reader for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:0, aborted:0}
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - readers_cache.cc:134 - {redpanda/controller/0} - reader cache hit for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:0, aborted:0}
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_frontend.cc:140 - set_next_version: 4
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_manager.cc:878 - apply_delta: 1 upserts, 0 removes
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000
TRACE 2023-10-15 15:58:30,658 [shard 3] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000
TRACE 2023-10-15 15:58:30,658 [shard 2] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000

abhijat · 2023-11-06T08:26:57Z

seen in #14768
https://ci-artifacts.dev.vectorized.cloud/redpanda/40484/018ba344-a12c-41b4-a875-6cb490928331/vbuild/ducktape/results/2023-11-06--001/report.html

bharathv added area/storage ci-failure labels Sep 29, 2023

andijcr self-assigned this Sep 29, 2023

andijcr added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Sep 29, 2023

andijcr mentioned this issue Oct 2, 2023

cluster/config_manager: fix double apply of configs #13820

Draft

7 tasks

piyushredpanda closed this as completed Oct 6, 2023

rockwotj reopened this Oct 17, 2023

bharathv mentioned this issue Oct 25, 2023

tx: add more trace logging in init_producer_id code path #14401

Merged

7 tasks

andijcr mentioned this issue Oct 31, 2023

tests/cluster_config_test: _check_value_everywhere retry and better error message #14581

Merged

7 tasks

andijcr mentioned this issue Oct 31, 2023

CI Failure (key symptom) in ClusterConfigAliasTest.test_aliasing #13507

Closed

This was referenced Nov 8, 2023

archival: start reupload range after log start #14816

Merged

storage: enable sliding window compaction by default #14330

Merged

andijcr closed this as completed in #14581 Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in `ClusterConfigAliasTest.test_aliasing_with_upgrade` #13787

CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in `ClusterConfigAliasTest.test_aliasing_with_upgrade` #13787

bharathv commented Sep 29, 2023

andijcr commented Oct 6, 2023

piyushredpanda commented Oct 6, 2023

rockwotj commented Oct 17, 2023

andijcr commented Oct 31, 2023

andijcr commented Oct 31, 2023 •

edited

Loading

andijcr commented Oct 31, 2023

andijcr commented Oct 31, 2023 •

edited

Loading

andijcr commented Oct 31, 2023 •

edited

Loading

andijcr commented Oct 31, 2023

andijcr commented Nov 2, 2023 •

edited

Loading

abhijat commented Nov 6, 2023

CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in ClusterConfigAliasTest.test_aliasing_with_upgrade #13787

CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in ClusterConfigAliasTest.test_aliasing_with_upgrade #13787

Comments

bharathv commented Sep 29, 2023

andijcr commented Oct 6, 2023

piyushredpanda commented Oct 6, 2023

rockwotj commented Oct 17, 2023

andijcr commented Oct 31, 2023

andijcr commented Oct 31, 2023 • edited Loading

andijcr commented Oct 31, 2023

andijcr commented Oct 31, 2023 • edited Loading

andijcr commented Oct 31, 2023 • edited Loading

andijcr commented Oct 31, 2023

andijcr commented Nov 2, 2023 • edited Loading

abhijat commented Nov 6, 2023

CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in `ClusterConfigAliasTest.test_aliasing_with_upgrade` #13787

CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in `ClusterConfigAliasTest.test_aliasing_with_upgrade` #13787

andijcr commented Oct 31, 2023 •

edited

Loading

andijcr commented Oct 31, 2023 •

edited

Loading

andijcr commented Oct 31, 2023 •

edited

Loading

andijcr commented Nov 2, 2023 •

edited

Loading