Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000) in ClusterConfigAliasTest.test_aliasing_with_upgrade #13787

Closed
bharathv opened this issue Sep 29, 2023 · 11 comments · Fixed by #14581 · May be fixed by #13820
Assignees
Labels
area/storage ci-failure sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@bharathv
Copy link
Contributor

https://buildkite.com/redpanda/vtools/builds/9713

Module: rptest.tests.cluster_config_test
Class: ClusterConfigAliasTest
Method: test_aliasing_with_upgrade
Arguments: {
    "prop_set": [
        "log_retention_ms",
        "delete_retention_ms",
        [   
            23, 
            3   
        ],  
        [   
            1000000,
            300000,
            500000
        ]   
    ],  
    "wipe_cache": true
}
test_id:    ClusterConfigAliasTest.test_aliasing_with_upgrade
status:     FAIL
run time:   10.059 seconds

AssertionError('Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000 (!=1000000)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 1471, in test_aliasing_with_upgrade
    self._check_value_everywhere(prop_set.aliased_name,
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 174, in _check_value_everywhere
    assert actual_value == expect_value, f"Wrong value on node {node.account.hostname}: {key}={actual_value} (!={expect_value})"
AssertionError: Wrong value on node ip-172-31-6-42: delete_retention_ms=604800000 (!=1000000)
@andijcr andijcr self-assigned this Sep 29, 2023
@andijcr andijcr added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Sep 29, 2023
@andijcr
Copy link
Contributor

andijcr commented Oct 6, 2023

reviewing the failure, it might be a manifestation of the issue fixed in this pr #13964
because the test blocks at the first value, meaning it set the first value but wasn't able to find it while reading back the cluster config

@piyushredpanda
Copy link
Contributor

Let's close this then; we can always reopen if the issue reappears.

@rockwotj
Copy link
Contributor

@andijcr
Copy link
Contributor

andijcr commented Oct 31, 2023

@andijcr
Copy link
Contributor

andijcr commented Oct 31, 2023

of note: cluster_config.yaml in the ci run has this line

log_retention_ms: 500000

which is the correct value for the last step of the test. depending on when this file is generated, this might be a timing issue

@andijcr
Copy link
Contributor

andijcr commented Oct 31, 2023

so cluster_config.yaml is read in the RedpandaService.stop method, via rpk.cluster_config_export(cluster_config_filename, True), so either we are seeing an incorrect transient state or the nodes have a different state

@andijcr
Copy link
Contributor

andijcr commented Oct 31, 2023

just noticed that the last failure is for test_aliasing while the original failure and the recent vtools failures are for test_aliasing_with_upgrade.

The area is adjacent but the underlying problem seems different. copying last messages here #13507

@andijcr
Copy link
Contributor

andijcr commented Oct 31, 2023

back on the recent failures:

Module: rptest.tests.cluster_config_test
Class:  ClusterConfigAliasTest
Method: test_aliasing_with_upgrade
Arguments:
{
  "prop_set": [
    "log_retention_ms",
    "delete_retention_ms",
    [
      23,
      3
    ],
    [
      1000000,
      300000,
      500000
    ]
  ],
  "wipe_cache": false
}

        Validate that a property written under an alias in a previous release
        is read correctly after upgrade.

        :param wipe_cache: if true, erase cluster config cache to ensure that the
                           upgraded node is reading from the controller log rather
                           than just cache.
AssertionError('Wrong value on node ip-172-31-6-224: delete_retention_ms=604800000 (!=1000000)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 1464, in test_aliasing_with_upgrade
    self._check_value_everywhere(prop_set.aliased_name,
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_config_test.py", line 174, in _check_value_everywhere
    assert actual_value == expect_value, f"Wrong value on node {node.account.hostname}: {key}={actual_value} (!={expect_value})"
AssertionError: Wrong value on node ip-172-31-6-224: delete_retention_ms=604800000 (!=1000000)

also in this case cluster_config.yaml reports the correct value

# delete segments older than this - default 1 week (may be nil)
delete_retention_ms: 1000000

but it's not supposed to be delete_retention_ms anymore in v23.3.

Reading the log from https://buildkite.com/redpanda/vtools/builds/9995 the version is

INFO  2023-10-15 15:58:27,833 [shard 0] main - application.cc:349 - Redpanda v23.2.12 - 4ff29d7deb7ef445c5d292f31f45bdf4975a24f5

so are all the failures for backports ?

edit: it's an upgrade test so of course i'm seeing v23.2.12. the original point still stands, it might just need the backport to be released

@andijcr
Copy link
Contributor

andijcr commented Oct 31, 2023

the original fix is not backported yet #13992

@andijcr
Copy link
Contributor

andijcr commented Nov 2, 2023

as with #13507 , the test observes a transition and fails
get cluster_config is issue before the node has a chance to apply the config delta.

pr #14581 relaxes the check

DEBUG 2023-10-15 15:58:30,658 [shard 1] admin_api_server - admin_server.cc:581 - [_anonymous] GET http://ip-172-31-6-224:9644/v1/cluster_config
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - storage_resources.cc:213 - configuration_manager_take_bytes 4486 += 107 (current 2684350074)
TRACE 2023-10-15 15:58:30,658 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2849 - Follower commit index updated 23
TRACE 2023-10-15 15:58:30,658 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1810 - Received append entries request: {raft_group:{0}, commit_index:{24}, term:{1}, prev_log_index:{24}, prev_log_term:{1}, last_visible_index:{24}}
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - storage_resources.cc:226 - stm_take_bytes 4593 += 107 (current 2684349967)
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - storage_resources.cc:213 - configuration_manager_take_bytes 4593 += 107 (current 2684349967)
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - readers_cache.cc:101 - {redpanda/controller/0} - trying to get reader for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:0, aborted:0}
TRACE 2023-10-15 15:58:30,658 [shard 0] storage - readers_cache.cc:134 - {redpanda/controller/0} - reader cache hit for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:0, aborted:0}
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_frontend.cc:140 - set_next_version: 4
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_manager.cc:878 - apply_delta: 1 upserts, 0 removes
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000
TRACE 2023-10-15 15:58:30,658 [shard 0] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000
TRACE 2023-10-15 15:58:30,658 [shard 3] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000
TRACE 2023-10-15 15:58:30,658 [shard 2] cluster - config_manager.cc:647 - apply: upsert delete_retention_ms=1000000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage ci-failure sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
5 participants