Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get cluster load API (sometimes) ignores start and end arguments #2154

Open
AlbertoPeon opened this issue May 23, 2024 · 0 comments
Open

Get cluster load API (sometimes) ignores start and end arguments #2154

AlbertoPeon opened this issue May 23, 2024 · 0 comments

Comments

@AlbertoPeon
Copy link

Hello,

We have noticed that the GET /kafkacruisecontrol/load endpoint ignores the start, end and time parameters under some conditions (which we have not yet been able to identify).

This can be easily reproduced using cccli. For instance, retrieving the cluster load for a given 1 hour time window does not always return the same results.

The first time shows the right average across every dimension for the 1 hour time windows.

$  cccli -a kafka-dev-cruise-control-headless:9090 load --add-parameter start=1716454190232 end=1716457790233
Starting long-running poll of http://kafka-dev-cruise-control-headless:9090/kafkacruisecontrol/load?allow_capacity_estimation=False&start=1716454190232&end=1716457790233

HOST         BROKER      RACK         DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
-,         10000,us-east-1a,         1192092.000,          10466.635/00.88,                  1,         9.706,               97656.000,                 361.137,                 711.288,              195312.000,           1081.202,           3224.528,            62/185
-,         10001,us-east-1b,         1192092.000,          10466.635/00.88,                  1,         9.014,               97656.000,                 319.845,                 752.581,              195312.000,            962.311,           3224.528,            58/185
-,         10002,us-east-1c,         1192092.000,          10466.635/00.88,                  1,        12.585,               97656.000,                 391.444,                 680.982,              195312.000,           1181.015,           3224.528,            65/185

However, waiting a bit and running the same query outputs different values. We suspect these values correspond to the cluster load for the default time window, effectively ignoring the start and end parameters. I believe this corresponds to the time window between the earliest available timestamp and the current time.

$ cccli -a kafka-dev-cruise-control-headless:9090 load --add-parameter start=1716454190232 end=1716457790233
Starting long-running poll of http://kafka-dev-cruise-control-headless:9090/kafkacruisecontrol/load?allow_capacity_estimation=False&start=1716454190232&end=1716457790233

HOST         BROKER      RACK         DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
-,         10000,us-east-1a,         1192092.000,          10931.121/00.92,                  1,        11.381,               97656.000,                 386.458,                 762.798,              195312.000,           1155.897,           3446.797,            62/185
-,         10001,us-east-1b,         1192092.000,          10931.121/00.92,                  1,        10.193,               97656.000,                 343.802,                 805.454,              195312.000,           1032.157,           3446.797,            58/185
-,         10002,us-east-1c,         1192092.000,          10931.121/00.92,                  1,        10.482,               97656.000,                 418.996,                 730.259,              195312.000,           1258.743,           3446.797,            65/185

In fact, running the same without start and end arguments returns the same values as the previous command:

$ cccli -a kafka-dev-cruise-control-headless:9090 load
Starting long-running poll of http://kafka-dev-cruise-control-headless:9090/kafkacruisecontrol/load?allow_capacity_estimation=False

HOST         BROKER      RACK         DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS
-,         10000,us-east-1a,         1192092.000,          10931.121/00.92,                  1,        11.381,               97656.000,                 386.458,                 762.798,              195312.000,           1155.897,           3446.797,            62/185
-,         10001,us-east-1b,         1192092.000,          10931.121/00.92,                  1,        10.193,               97656.000,                 343.802,                 805.454,              195312.000,           1032.157,           3446.797,            58/185
-,         10002,us-east-1c,         1192092.000,          10931.121/00.92,                  1,        10.482,               97656.000,                 418.996,                 730.259,              195312.000,           1258.743,           3446.797,            65/185

Periodically running a command with start and end (or time) parameters, will inconsistently return one or the other.

Plotting this into a graph we can confirm how the load of the cluster oscillates between the two time windows:

Screenshot 2024-05-23 at 13 58 23

In blue we can see the live system metrics while in purple we see the cluster load as reported by the Cruise Control endpoint.

After cutting down Kafka traffic to half, we can see that the CruiseControl load reflects that after a delay (which makes sense as it is not live data but the accumulated average over the last time window). However, what it is not expected is that the values show "waves". From observation we suspect the low points of the waves correspond to querying the load within the 1 hour time window, as they converge with the system metric after that time. The high points of the wave take longer to converge, approximately after 4 hours, which we suspect is the default time window.

Could you help me understand why this is happening and how to prevent it? Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant