Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (kafka-topics.sh --alter --topic --partitions 7 fails with "Partition count must be greater then current number of partitions") in ControllerUpgradeTest.test_updating_cluster_when_executing_operations #8637

Closed
dlex opened this issue Feb 6, 2023 · 11 comments · Fixed by #8705
Assignees
Labels
area/controller ci-failure ci-rca/infra CI Root Cause Analysis - Infrastructure Issue kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@dlex
Copy link
Contributor

dlex commented Feb 6, 2023

https://buildkite.com/redpanda/redpanda/builds/22535

Module: rptest.tests.controller_upgrade_test
Class:  ControllerUpgradeTest
Method: test_updating_cluster_when_executing_operations
test_id:    rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
status:     FAIL
run time:   3 minutes 21.306 seconds


    CalledProcessError(1, ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-14:9092,docker-rp-13:9092,docker-rp-15:9092,docker-rp-11:9092,docker-rp-12:9092', '--alter', '--topic', 'fuzzy-operator-1365-odqcci', '--partitions', '7'])
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/controller_upgrade_test.py", line 113, in test_updating_cluster_when_executing_operations
    admin_fuzz.wait(num_executed_before_restart + 2, 240)
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 608, in wait
    wait_until(check, timeout_sec=timeout, backoff_sec=2)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 596, in check
    raise self.error
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 488, in thread_loop
    if self.execute_with_retries(op_type, op):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 527, in execute_with_retries
    raise error
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 520, in execute_with_retries
    return op.execute(self.operation_ctx)
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 233, in execute
    cli.create_topic_partitions(self.topic, self.total)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 88, in create_topic_partitions
    return self._run("kafka-topics.sh", args)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 401, in _run
    return self._execute(cmd)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 411, in _execute
    res = subprocess.check_output(cmd,
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
subprocess.CalledProcessError: Command '['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-14:9092,docker-rp-13:9092,docker-rp-15:9092,docker-rp-11:9092,docker-rp-12:9092', '--alter', '--topic', 'fuzzy-operator-1365-odqcci', '--partitions', '7']' returned non-zero exit status 1.

The chronology of the error:

  • The topic fuzzy-operator-1365-odqcci was created with 1 partition, then upgraded to 4, then to 6, and from 6 is was upgraded to 11 before the issues started.
[INFO  - 2023-02-04 20:22:09,961 - admin_ops_fuzzer - execute - lineno:229]: Updating topic: fuzzy-operator-1365-odqcci partitions count. Current partition count: 6 new partition count: 11                                   
[DEBUG - 2023-02-04 20:22:09,962 - kafka_cli_tools - create_topic_partitions - lineno:83]: Adding 11 partitions to topic: fuzzy-operator-1365-odqcci                                                                           
[DEBUG - 2023-02-04 20:22:09,962 - kafka_cli_tools - _execute - lineno:409]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-12:9092,docker-rp-13:9092,docker-rp-14:9092,docker-rp-15:9092,docker-rp-11:9092', '--alter', '--topic', 'fuzzy-operator-1365-odqcci', '--partitions', '11']
[DEBUG - 2023-02-04 20:22:24,779 - rpk - _execute - lineno:814]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-01b7d8c66b156ce8e-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-11:9092,docker-rp-12:9092,docker-rp-15:9092,docker-rp-13:9092', 'list']
[DEBUG - 2023-02-04 20:22:24,838 - redpanda - __is_status_ready - lineno:1252]: node docker-rp-14 not yet accepting connections                                                                                               
[DEBUG - 2023-02-04 20:22:24,840 - rpk - _execute - lineno:827]:                                                                                                                                                              
NAME                        PARTITIONS  REPLICAS                                                                                                                                                                              
fuzzy-operator-1365-jaoawi  4           3                                                                                                                                                                                     
fuzzy-operator-1365-odqcci  11          3                                                                                                                                                                                     
fuzzy-operator-1365-pggyok  1           3                                                                                                                                                                                     
fuzzy-operator-1365-rvkmim  1           3                                                                                                                                                                                     
fuzzy-operator-1365-sjogjr  6           3                                                                                                                                                                                     
fuzzy-operator-1365-szsuul  3           3                                                                                                                                                                                     
fuzzy-operator-1365-wtcylv  1           3                                                                                                                                                                                     
topic-yidfsiebpv            6           3                                                                                                                                                                                     
  • However the rpk topic describe still reports 6 partitions.
[DEBUG - 2023-02-04 20:22:25,895 - rpk - _execute - lineno:814]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-01b7d8c66b156ce8e-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-14:9092,docker-rp-12:9092,docker-rp-11:9092,docker-rp-15:9092,docker-rp-13:9092', 'describe',  'fuzzy-operator-1365-odqcci', '-p']          
[DEBUG - 2023-02-04 20:22:26,585 - rpk - _execute - lineno:827]:                                                                                                                                                               
PARTITION  LEADER  EPOCH  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK                                                                                                                                                             
0          1	 2	[1 2 4]   0                 0                                                                                                                                                                        
1          2	 3	[1 2 3]   0                 0                                                                                                                                                                        
2          5	 1	[2 3 5]   0                 0                                                                                                                                                                        
3          5	 2	[2 4 5]   0                 0                                                                                                                                                                        
4          5	 1	[2 3 5]   0                 0                                                                                                                                                                        
5          1	 2	[1 4 5]   0                 0                                                                                                                                                                        
  • The test still thinks that there are 6 partitions in the topic, and wants to increase the number from 6 to 7 in the next iteration.
[INFO  - 2023-02-04 20:22:26,585 - admin_ops_fuzzer - execute - lineno:229]: Updating topic: fuzzy-operator-1365-odqcci partitions count. Current partition count: 6 new partition count: 7                                    
[DEBUG - 2023-02-04 20:22:26,585 - kafka_cli_tools - create_topic_partitions - lineno:83]: Adding 7 partitions to topic: fuzzy-operator-1365-odqcci                                                                            
[DEBUG - 2023-02-04 20:22:26,585 - kafka_cli_tools - _execute - lineno:409]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-13:9092,docker-rp-15:9092,docker-rp-14:9092,docker-rp-12:9092,docker-rp-11:9092', '--alter', '--topic', 'fuzzy-operator-1365-odqcci', '--partitions', '7']
  • There is a 37:create_partitions call to the controller log leader, but from the log we don't know what the arguments of the call are. An immediate failure response it sent as if there already was the requested number of partitions in the topic.
[docker-rp-15]/redpanda.log│TRACE 2023-02-04 20:22:28,106 [shard 0] kafka - requests.cc:91 - [172.16.16.28:40456] processing name:create_partitions, key:37, version:3 for adminclient-1, mem_units: 8128                       
[docker-rp-15]/redpanda.log│TRACE 2023-02-04 20:22:28,106 [shard 0] kafka - request_context.h:168 - [172.16.16.28:40456] sending 37:create_partitions for {adminclient-1}, response {throttle_time_ms=0 results={{name={fuzzy-operator-1365-odqcci} error_code={ error_code: invalid_request [42] } error_message={Partition count must be greater then current number of partitions}}}}

see https://github.com/redpanda-data/redpanda/blob/dev/src/v/kafka/server/handlers/create_partitions.cc#L189-L200

  • The error is surfaced in DT via kafka-topics.sh
[DEBUG - 2023-02-04 20:22:28,437 - kafka_cli_tools - _execute - lineno:417]: Error (1) executing command: Error while executing topic command : Partition count must be greater then current number of partitions   
[2023-02-04 20:22:28,110] ERROR org.apache.kafka.common.errors.InvalidRequestException: Partition count must be greater then current number of partitions                                                           
 (kafka.admin.TopicCommand$)                                                                                                                                                                                        
                                                                                                                                                                                                                    
[INFO  - 2023-02-04 20:22:28,437 - admin_ops_fuzzer - execute_with_retries - lineno:523]: Operation: RedpandaAdminOperation.ADD_PARTITIONS, retries left: 5/5                                                       
Traceback (most recent call last):                                                                                                                                                                                  
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 520, in execute_with_retries                                                                                                                         
    return op.execute(self.operation_ctx)                                                                                                                                                                           
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 233, in execute                                                                                                                                      
    cli.create_topic_partitions(self.topic, self.total)                                                                                                                                                             
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 88, in create_topic_partitions                                                                                                                         
    return self._run("kafka-topics.sh", args)                                                                                                                                                                       
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 401, in _run                                                                                                                                           
    return self._execute(cmd)                                                                                                                                                                                       
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 411, in _execute                                                                                                                                       
    res = subprocess.check_output(cmd,                                                                                                                                                                              
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output                                                                                                                                               
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,                                                                                                                                                
  File "/usr/lib/python3.10/subprocess.py", line 524, in run                                                                                                                                                        
    raise CalledProcessError(retcode, process.args,                                                                                                                                                                 
  • The attempt is retries 5 times with same arguments and same outcome, then the test fails.
@NyaliaLui
Copy link
Contributor

There is something strange going on here between the AdminServer, rpk, and registering new partitions.

  • From the debug log we see that the topic fuzzy-operator-1365-odqcci was successfully updated from 6 partitions to 11 partitions
[DEBUG - 2023-02-04 20:22:21,057 - rpk - _execute - lineno:814]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-01b7d8c66b156ce8e-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-11:9092,docker-rp-15:9092,docker-rp-13:9092,docker-rp-12:9092', 'describe', 'fuzzy-operator-1365-odqcci', '-p']
[DEBUG - 2023-02-04 20:22:21,431 - rpk - _execute - lineno:827]: 
PARTITION  LEADER  EPOCH  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK
0          1       2      [1 2 4]   0                 0
1          2       3      [1 2 3]   0                 0
2          5       1      [2 3 5]   0                 0
3          5       2      [2 4 5]   0                 0
4          5       1      [2 3 5]   0                 0
5          1       2      [1 4 5]   0                 0
6          5       1      [1 4 5]   0                 0
7          5       1      [3 4 5]   0                 0
8          2       1      [2 4 5]   0                 0
9          1       1      [1 3 4]   0                 0
10         1       1      [1 4 5]   0                 0
  • And then, for some reason that I'm still looking for, we issue another rpk topic describe but it sees 6 partitions and attempts to increase to partition count of 7
[DEBUG - 2023-02-04 20:22:25,895 - rpk - _execute - lineno:814]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-01b7d8c66b156ce8e-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-14:9092,docker-rp-12:9092,docker-rp-11:9092,docker-rp-15:9092,docker-rp-13:9092', 'describe', 'fuzzy-operator-1365-odqcci', '-p']
[DEBUG - 2023-02-04 20:22:26,585 - rpk - _execute - lineno:827]: 
PARTITION  LEADER  EPOCH  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK
0          1       2      [1 2 4]   0                 0
1          2       3      [1 2 3]   0                 0
2          5       1      [2 3 5]   0                 0
3          5       2      [2 4 5]   0                 0
4          5       1      [2 3 5]   0                 0
5          1       2      [1 4 5]   0                 0

[INFO  - 2023-02-04 20:22:26,585 - admin_ops_fuzzer - execute - lineno:229]: Updating topic: fuzzy-operator-1365-odqcci partitions count. Current partition count: 6 new partition count: 7
  • Then we see the CalledProcess error in this ticket. So I imagine the error is because of this inconsistency between reporting partitions.
[INFO  - 2023-02-04 20:22:28,437 - admin_ops_fuzzer - execute_with_retries - lineno:523]: Operation: RedpandaAdminOperation.ADD_PARTITIONS, retries left: 5/5
Traceback (most recent call last):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 520, in execute_with_retries
    return op.execute(self.operation_ctx)
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 233, in execute
    cli.create_topic_partitions(self.topic, self.total)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 88, in create_topic_partitions
    return self._run("kafka-topics.sh", args)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 401, in _run
    return self._execute(cmd)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 411, in _execute
    res = subprocess.check_output(cmd,
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/kafka-3.0.0/bin/kafka-topics.sh', 

It is important to note that one of the brokers was down for the entire duration when we were increasing partition count from 6->11.

[DEBUG - 2023-02-04 20:22:12,802 - redpanda - __is_status_ready - lineno:1252]: node docker-rp-14 not yet accepting connections

Therefore, I'm adding sev/medium until I find more information because a crash did not occur (so not sev/high) and it is not a paper cut (so not sev/low) but there could be a bug in Redpanda

@NyaliaLui NyaliaLui added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Feb 6, 2023
@rystsov
Copy link
Contributor

rystsov commented Feb 6, 2023

My guess it's caused the same problem as #8562 which was supposed to be fixed in #8569

The guess is that rpk topic describe output depends on the node it connects to so we should treat the result as eventual consistent and probably stale. The test changes number of partitions waits until at least a single node respond a success and then change it again. If it's unlucky it may hit not updated yet node, got old data and try to update topic with wrong number of partitions.

Probably the test should not depend on rpk and use admin api to query and wait until all nodes are updated.

// cc @mmaslankaprv

@mmaslankaprv
Copy link
Member

exactly as @rystsov said it is a bug in test related with eventual consistency of metadata.

@dotnwat
Copy link
Member

dotnwat commented Feb 7, 2023

@mmaslankaprv is this a ticket your team is taking up? i can't quite figure out if this is an area/kafka issue

@rystsov
Copy link
Contributor

rystsov commented Feb 7, 2023

@dotnwat it's common thing but initially was introduced by the replication team, putting it in our queue

@piyushredpanda
Copy link
Contributor

Is this a sev/low though given it appears to be a test bug than a Redpanda one?

@mmaslankaprv mmaslankaprv added sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages and removed sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. labels Feb 8, 2023
@mmaslankaprv
Copy link
Member

marked as sev/low as this is test setup issue

@mmaslankaprv mmaslankaprv self-assigned this Feb 8, 2023
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 8, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 13, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 13, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 13, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 15, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 15, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 17, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 17, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@jcsp
Copy link
Contributor

jcsp commented Feb 20, 2023

https://buildkite.com/redpanda/redpanda/builds/23564#01866ea0-18d6-4480-b837-c726ba0b6b48

This is a different backtrace but also appears to be a CLI tool complaining about something out of whack with metadata

[DEBUG - 2023-02-20 12:15:40,725 - rpk - _execute - lineno:824]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-051762c7eb1df377d-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-15:9092,docker-rp-14:9092,docker-rp-11:9092,docker-rp-13:9092,docker-rp-12:9092', 'alter-config', 'fuzzy-operator-5939-hhllrg', '--set', 'retention.ms=680977']
[DEBUG - 2023-02-20 12:15:40,773 - rpk - _execute - lineno:837]: 
TOPIC                       STATUS 
fuzzy-operator-5939-hhllrg  UNKNOWN_TOPIC_OR_PARTITION 

[INFO  - 2023-02-20 12:15:40,773 - admin_ops_fuzzer - execute_with_retries - lineno:529]: Operation: RedpandaAdminOperation.UPDATE_TOPIC, retries left: 5/5
Traceback (most recent call last): 
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 526, in execute_with_retries 
    return op.execute(self.operation_ctx)
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 188, in execute
    ctx.rpk().alter_topic_config(self.topic, self.property,
  File "/root/tests/rptest/clients/rpk.py", line 467, in alter_topic_config
    raise RpkException(
rptest.clients.rpk.RpkException: RpkException<Unexpected output, expected 'fuzzy-operator-5939-hhllrg\s+OK' got 'fuzzy-operator-5939-hhllrg  UNKNOWN_TOPIC_OR_PARTITION' on setting fuzzy-operator-5939-hhllrg retention.ms=680977> 

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 28, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@VladLazar
Copy link
Contributor

I ran into another failure mode here (similar to the one John encountered), but reading through the ticket, I think the cause may be the same. In this case, the partition was not found within the 5 retries:

====================================================================================================
test_id:    rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
status:     FAIL
run time:   2 minutes 3.363 seconds


    RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0a1ae0015bf2744ea-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk topic --brokers docker-rp-12:9092,docker-rp-13:9092,docker-rp-15:9092,docker-rp-14:9092,docker-rp-11:9092 describe fuzzy-operator-1740-jjdirz -c returned 1, output: ', 'config response contained error: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition.\n', 1)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/controller_upgrade_test.py", line 113, in test_updating_cluster_when_executing_operations
    admin_fuzz.wait(num_executed_before_restart + 2, 240)
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 622, in wait
    wait_until(check, timeout_sec=timeout, backoff_sec=2)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 610, in check
    raise self.error
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 499, in thread_loop
    if self.execute_with_retries(op_type, op):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 538, in execute_with_retries
    raise error
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 529, in execute_with_retries
    if op.validate(self.operation_ctx):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 204, in validate
    desc = ctx.rpk().describe_topic_configs(self.topic)
  File "/root/tests/rptest/clients/rpk.py", line 438, in describe_topic_configs
    output = self._run_topic(cmd)
  File "/root/tests/rptest/clients/rpk.py", line 723, in _run_topic
    return self._execute(cmd, stdin=stdin, timeout=timeout)
  File "/root/tests/rptest/clients/rpk.py", line 841, in _execute
    raise RpkException(
rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0a1ae0015bf2744ea-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk topic --brokers docker-rp-12:9092,docker-rp-13:9092,docker-rp-15:9092,docker-rp-14:9092,docker-rp-11:9092 describe fuzzy-operator-1740-jjdirz -c returned 1, output:  error: config response contained error: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition.
 returncode: 1>

@ztlpn
Copy link
Contributor

ztlpn commented Mar 7, 2023

looks like a similar issue: https://buildkite.com/redpanda/redpanda/builds/24566#0186bd22-ca67-4000-8962-e7a32a39050f

[DEBUG - 2023-03-07 17:50:47,073 - kafka_cli_tools - _execute - lineno:416]: Error (1) executing command: Error while executing topic command : Topic 'fuzzy-operator-3247-tnmyjf' does not exist as expected
[2023-03-07 17:50:46,741] ERROR java.lang.IllegalArgumentException: Topic 'fuzzy-operator-3247-tnmyjf' does not exist as expected
        at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:373)
        at kafka.admin.TopicCommand$TopicService.alterTopic(TopicCommand.scala:264)
        at kafka.admin.TopicCommand$.main(TopicCommand.scala:57)
        at kafka.admin.TopicCommand.main(TopicCommand.scala)
 (kafka.admin.TopicCommand$)

[INFO  - 2023-03-07 17:50:47,073 - admin_ops_fuzzer - execute_with_retries - lineno:534]: Operation: RedpandaAdminOperation.ADD_PARTITIONS, retries left: 4/5
Traceback (most recent call last):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 531, in execute_with_retries
    return op.execute(self.operation_ctx)
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 238, in execute
    cli.create_topic_partitions(self.topic, self.total)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 90, in create_topic_partitions
    return self._run("kafka-topics.sh", args)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 400, in _run
    return self._execute(cmd)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 410, in _execute
    res = subprocess.check_output(cmd,
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-14:9092,docker-rp-12:9092,docker-rp-13:9092,docker-rp-11:9092,docker-rp-15:9092', '--alter', '--topic', 'fuzzy-operator-3247-tnmy
jf', '--partitions', '4']' returned non-zero exit status 1.

@dlex
Copy link
Contributor Author

dlex commented Mar 8, 2023

looks like a similar issue as well: on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24566#0186bd22-ca67-4000-8962-e7a32a39050f

[DEBUG - 2023-03-07 17:50:40,059 - kafka_cli_tools - _execute - lineno:416]: Error (1) executing command: Error while executing topic command : Topic 'fuzzy-operator-3247-tnmyjf' does not exist as expected                  
[2023-03-07 17:50:39,727] ERROR java.lang.IllegalArgumentException: Topic 'fuzzy-operator-3247-tnmyjf' does not exist as expected                                                                                              
  at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:373)                                                                                                                             
  at kafka.admin.TopicCommand$TopicService.alterTopic(TopicCommand.scala:264)                                                                                                                                                  
  at kafka.admin.TopicCommand$.main(TopicCommand.scala:57)                                                                                                                                                                     
  at kafka.admin.TopicCommand.main(TopicCommand.scala)                                                                                                                                                                         
 (kafka.admin.TopicCommand$)                                                                                                                                                                                                   
                                                                                                                                                                                                                               
[INFO  - 2023-03-07 17:50:40,059 - admin_ops_fuzzer - execute_with_retries - lineno:534]: Operation: RedpandaAdminOperation.ADD_PARTITIONS, retries left: 5/5                                                                  
Traceback (most recent call last):                                                                                                                                                                                             
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 531, in execute_with_retries                                                                                                                                    
    return op.execute(self.operation_ctx)                                                                                                                                                                                      
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 238, in execute                                                                                                                                                 
    cli.create_topic_partitions(self.topic, self.total)                                                                                                                                                                        
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 90, in create_topic_partitions                                                                                                                                    
    return self._run("kafka-topics.sh", args)                                                                                                                                                                                  
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 400, in _run                                                                                                                                                      
    return self._execute(cmd)                                                                                                                                                                                                  
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 410, in _execute                                                                                                                                                  
    res = subprocess.check_output(cmd,                                                                                                                                                                                         
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output                                                                                                                                                          
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,                                                                                                                                                           
  File "/usr/lib/python3.10/subprocess.py", line 524, in run                                                                                                                                                                   
    raise CalledProcessError(retcode, process.args,                                                                                                                                                                            
subprocess.CalledProcessError: Command '['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-12:9092,docker-rp-11:9092,docker-rp-15:9092,docker-rp-13:9092,docker-rp-14:9092', '--alter', '--topic', 'fuzzy-operator-3247-tnmyjf', '--partitions', '4']' returned non-zero exit status 1

although this time rpk topic list has reported 3 partitions in the topic

[DEBUG - 2023-03-07 17:50:38,068 - rpk - _execute - lineno:837]:  
NAME                        PARTITIONS  REPLICAS                  
fuzzy-operator-3247-dhukck  9           3                         
fuzzy-operator-3247-klpvny  3           3                         
fuzzy-operator-3247-oiquvm  2           3                         
fuzzy-operator-3247-pvzwif  4           3                         
fuzzy-operator-3247-spaxuw  4           3                         
fuzzy-operator-3247-tnmyjf  3           3      <=== this one                        
fuzzy-operator-3247-zthhwi  3           3                         
fuzzy-operator-3247-zulqdj  5           3                         
topic-ejjjnmbftu            6           3                         

and immediatelly after that rpk topic describe has reported 0 partitions:

[DEBUG - 2023-03-07 17:50:38,068 - rpk - _execute - lineno:824]: Executing command: [                                                                                                                                       
    '/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0421e35e7c65ad7ce-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk',               
    'topic',                                                                                                                                                  
    '--brokers',                                                                                                                                              
    'docker-rp-14:9092,docker-rp-11:9092,docker-rp-13:9092,docker-rp-12:9092,docker-rp-15:9092',                                                              
    'describe',                                                                                                                                               
    'fuzzy-operator-3247-tnmyjf',                                                                                                                             
    '-p'                                                                                                                                                      
]                                                                                                                                                             
[DEBUG - 2023-03-07 17:50:38,179 - rpk - _execute - lineno:837]:                                                                                                                                                               
PARTITION  LEADER  EPOCH  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK                                                                                                                                                           
                                                                                                                                                                                                                               
[INFO  - 2023-03-07 17:50:38,180 - admin_ops_fuzzer - execute - lineno:234]: Updating topic: fuzzy-operator-3247-tnmyjf partitions count. Current partition count: 0 new partition count: 4                                    
[DEBUG - 2023-03-07 17:50:38,180 - kafka_cli_tools - create_topic_partitions - lineno:85]: Adding 4 partitions to topic: fuzzy-operator-3247-tnmyjf                                                                            
[DEBUG - 2023-03-07 17:50:38,180 - kafka_cli_tools - _execute - lineno:408]: ]: Executing command: [                                                                                                    
    '/opt/kafka-3.0.0/bin/kafka-topics.sh',                                                                                
    '--bootstrap-server',                                                                                                  
    'docker-rp-12:9092,docker-rp-11:9092,docker-rp-15:9092,docker-rp-13:9092,docker-rp-14:9092',                           
    '--alter',                                                                                                             
    '--topic',                                                                                                             
    'fuzzy-operator-3247-tnmyjf',                                                                                          
    '--partitions',                                                                                                        
    '4'                                                                                                                    
]                                                                                                                          

ballard26 pushed a commit to ballard26/redpanda that referenced this issue May 9, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 21, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Aug 2, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
rockwotj pushed a commit to rockwotj/redpanda that referenced this issue Aug 15, 2023
Since metadata are eventually consistent we query all nodes about
current partition count of a topic. Only if all values are the same we
allow to execute add partitions operation. This way operation result
will always be deterministic.

Fixes: redpanda-data#8637

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit aab3b0c)
@rystsov rystsov reopened this Sep 2, 2023
@rystsov rystsov closed this as completed Sep 2, 2023
@piyushredpanda piyushredpanda added the ci-rca/infra CI Root Cause Analysis - Infrastructure Issue label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller ci-failure ci-rca/infra CI Root Cause Analysis - Infrastructure Issue kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants