Get cluster load API (sometimes) ignores `start` and `end` arguments #2154

AlbertoPeon · 2024-05-23T12:34:04Z

Hello,

We have noticed that the GET /kafkacruisecontrol/load endpoint ignores the start, end and time parameters under some conditions (which we have not yet been able to identify).

This can be easily reproduced using cccli. For instance, retrieving the cluster load for a given 1 hour time window does not always return the same results.

The first time shows the right average across every dimension for the 1 hour time windows.

$  cccli -a kafka-dev-cruise-control-headless:9090 load --add-parameter start=1716454190232 end=1716457790233
Starting long-running poll of http://kafka-dev-cruise-control-headless:9090/kafkacruisecontrol/load?allow_capacity_estimation=False&start=1716454190232&end=1716457790233

HOST         BROKER      RACK         DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
-,         10000,us-east-1a,         1192092.000,          10466.635/00.88,                  1,         9.706,               97656.000,                 361.137,                 711.288,              195312.000,           1081.202,           3224.528,            62/185
-,         10001,us-east-1b,         1192092.000,          10466.635/00.88,                  1,         9.014,               97656.000,                 319.845,                 752.581,              195312.000,            962.311,           3224.528,            58/185
-,         10002,us-east-1c,         1192092.000,          10466.635/00.88,                  1,        12.585,               97656.000,                 391.444,                 680.982,              195312.000,           1181.015,           3224.528,            65/185

However, waiting a bit and running the same query outputs different values. We suspect these values correspond to the cluster load for the default time window, effectively ignoring the start and end parameters. I believe this corresponds to the time window between the earliest available timestamp and the current time.

$ cccli -a kafka-dev-cruise-control-headless:9090 load --add-parameter start=1716454190232 end=1716457790233
Starting long-running poll of http://kafka-dev-cruise-control-headless:9090/kafkacruisecontrol/load?allow_capacity_estimation=False&start=1716454190232&end=1716457790233

HOST         BROKER      RACK         DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
-,         10000,us-east-1a,         1192092.000,          10931.121/00.92,                  1,        11.381,               97656.000,                 386.458,                 762.798,              195312.000,           1155.897,           3446.797,            62/185
-,         10001,us-east-1b,         1192092.000,          10931.121/00.92,                  1,        10.193,               97656.000,                 343.802,                 805.454,              195312.000,           1032.157,           3446.797,            58/185
-,         10002,us-east-1c,         1192092.000,          10931.121/00.92,                  1,        10.482,               97656.000,                 418.996,                 730.259,              195312.000,           1258.743,           3446.797,            65/185

In fact, running the same without start and end arguments returns the same values as the previous command:

$ cccli -a kafka-dev-cruise-control-headless:9090 load
Starting long-running poll of http://kafka-dev-cruise-control-headless:9090/kafkacruisecontrol/load?allow_capacity_estimation=False

HOST         BROKER      RACK         DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
-,         10000,us-east-1a,         1192092.000,          10931.121/00.92,                  1,        11.381,               97656.000,                 386.458,                 762.798,              195312.000,           1155.897,           3446.797,            62/185
-,         10001,us-east-1b,         1192092.000,          10931.121/00.92,                  1,        10.193,               97656.000,                 343.802,                 805.454,              195312.000,           1032.157,           3446.797,            58/185
-,         10002,us-east-1c,         1192092.000,          10931.121/00.92,                  1,        10.482,               97656.000,                 418.996,                 730.259,              195312.000,           1258.743,           3446.797,            65/185

Periodically running a command with start and end (or time) parameters, will inconsistently return one or the other.

Plotting this into a graph we can confirm how the load of the cluster oscillates between the two time windows:

In blue we can see the live system metrics while in purple we see the cluster load as reported by the Cruise Control endpoint.

After cutting down Kafka traffic to half, we can see that the CruiseControl load reflects that after a delay (which makes sense as it is not live data but the accumulated average over the last time window). However, what it is not expected is that the values show "waves". From observation we suspect the low points of the waves correspond to querying the load within the 1 hour time window, as they converge with the system metric after that time. The high points of the wave take longer to converge, approximately after 4 hours, which we suspect is the default time window.

Could you help me understand why this is happening and how to prevent it? Thank you very much!

The text was updated successfully, but these errors were encountered:

AlbertoPeon mentioned this issue Nov 4, 2024

Do not reuse broker stats cache if time window is different DataDog/cruise-control#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get cluster load API (sometimes) ignores `start` and `end` arguments #2154

Get cluster load API (sometimes) ignores `start` and `end` arguments #2154

AlbertoPeon commented May 23, 2024

Get cluster load API (sometimes) ignores start and end arguments #2154

Get cluster load API (sometimes) ignores start and end arguments #2154

Comments

AlbertoPeon commented May 23, 2024

Get cluster load API (sometimes) ignores `start` and `end` arguments #2154

Get cluster load API (sometimes) ignores `start` and `end` arguments #2154