Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid units on network page (/network) #15070

Closed
cipherboy opened this issue Dec 22, 2020 · 15 comments
Closed

Invalid units on network page (/network) #15070

cipherboy opened this issue Dec 22, 2020 · 15 comments

Comments

@cipherboy
Copy link

Cockpit version: cockpit-234-1.fc32.x86_64 / cockpit-networkmanager-234-1.fc32.noarch / cockpit-pcp-234-1.fc32.x86_64
OS: Fedora 32
Page: Network

Units sometimes are incorrect on the network page. See screenshot:

Screenshot from 2020-12-22 13-36-32

I'm not really sure how to reproduce it, other than sending varying amounts of traffic. :-)

The relevant devices are:

01:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08)
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)
05:00.0 Network controller: Ralink corp. RT3290 Wireless 802.11n 1T/1R PCIe


Model name:                      AMD A10-5700 APU

There's no way I peaked anywhere close to 1000Gbps like the chart on the left implies. I'm fairly sure the board and the processor don't have that much total bandwidth. :-)

@marusak
Copy link
Member

marusak commented Jan 4, 2021

I remember that @mvollmer was looking into this and he found out, that sometimes we can get huge numbers due to 64 bits numbers being used to represent 32 bits incorrectly or something like that...
@mvollmer do you know what I am talking about? If so, is it possible this is the problem?

@mvollmer
Copy link
Member

mvollmer commented Jan 7, 2021

@mvollmer do you know what I am talking about?

Yes. We would receive garbage bits for some specific storage metrics.

If so, is it possible this is the problem?

Very unlikely. The bug was caused by misconfiguration of specific metrics and is very likely not present in the network metrics we use for this plot. Also, the garbage bits would be constant which would result in a single enourmous spike in the plot (since the plot shows the differences between samples, not the samples directly). There is a pronounced constant plateau here, which wouldn't be caused by that bug.

@mvollmer
Copy link
Member

mvollmer commented Jan 7, 2021

I'm not really sure how to reproduce it, other than sending varying amounts of traffic. :-)

Is there a specific pattern of traffic that triggers this? Can you make it happen on purpose and then record a video?

We have a whole new implementation for the graphs lined up: #14913. Are you able to try out that pull request? Otherwise it would probably make sense to wait a bit for it to be released and reach you (hopefully in two or three weeks or so).

@cipherboy
Copy link
Author

Ask and you shall receive. My architecture is as such:

internet -> recon7
           /      \
         nas      (other hosts)

Upstream internet comes in via Realtek card and gets split internally via the Qualcomm card. If my network is mostly idling (admittedly some video streams, VPN connections and other stuff) -- I get a nice graph:

Screenshot from 2021-01-07 12-10-57

However, if I spike it with iperf3, I'm instantly able to get the bogus units:

Screenshot from 2021-01-07 12-11-30

Like I said, the board is far too old (and running over 1Gbe / cat6) to support that throughput :)

How can I tell if the Qualcomm card is just giving bogus data?

From iperf3 I see:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-42.94  sec  4.66 GBytes   933 Mbits/sec    0             sender
[  5]   0.00-42.94  sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated

and in iftop I see reasonable numbers when looking at that device:

Screenshot from 2021-01-07 12-16-31

Screenshot from 2021-01-07 12-16-46

Any thoughts for how to debug this further?

@cipherboy
Copy link
Author

(I'm happy to test an RPM build of the PR if someone has one, but I don't really want to figure out how to build cockpit :-)

@mvollmer
Copy link
Member

mvollmer commented Jan 8, 2021

Any thoughts for how to debug this further?

How do the numbers from iperf3 and iftop compare to Cockpit's numbers in the "reasonable" scenario? Maybe the whole Cockpit plot is just wrong by some factor.

Cockpit gets its numbers from PCP, or from its own internal source if PCP is not available. If they come from PCP, there is a cockpit-pcp process running. In that case, you can check with

# pmval network.interface.in.bytes
# pmval network.interface.out.bytes

Note that Cockpit shows the traffic in bits per seconds, while pmval will show it in bytes per second.

If there is no PCP, cockpit-bridge will directly read /proc/net/dev.

I'll check here as well if all the numbers agree.

@mvollmer
Copy link
Member

mvollmer commented Jan 8, 2021

I'll check here as well if all the numbers agree.

It all checks out here, running iperf3 between a VM and its host. I get a peak rate of about 10 Gbits/s according to iperf3 itself, and all of iftop, pmval, the Cockpit graphs and the numerical display of Cockpit (in the "Interfaces panel") agree. Hmm.

@mvollmer
Copy link
Member

mvollmer commented Jan 8, 2021

(I'm happy to test an RPM build of the PR if someone has one, but I don't really want to figure out how to build cockpit :-)

Here they are: https://copr.fedorainfracloud.org/coprs/mvo/pr-14913/

@cipherboy
Copy link
Author

cipherboy commented Jan 8, 2021

@mvollmer -- I'm guessing it is an issue in the display logic. I get:

[root@recon7 cipherboy]# pmval network.interface.out.bytes

metric:    network.interface.out.bytes
host:      recon7.cipherboy.com
semantics: cumulative counter (converting to rate)
units:     byte (converting to byte / sec)
samples:   all

                lo                enp1s0                enp4s0              wlp5s0f0                  tun0    
               0.0                3018.                 4398.                    0.0                   0.0    
            1948.                 2596.                 1401.                    0.0                   0.0    
               0.0                2.605E+04              236.8                   0.0                   0.0    
               0.0                2.673E+04              221.9                   0.0                   0.0    
               0.0                1.225E+04              269.8                   0.0                   0.0    
               0.0                1.370E+04              842.2                   0.0                   0.0    
            1948.                 2165.                  281.7                   0.0                   0.0    
               0.0                1409.                  142.9                   0.0                   0.0    
               0.0                3165.                 1506.                    0.0                   0.0    
               0.0                9935.                 4956.                    0.0                   0.0    
             737.4                2272.                   65.94                  0.0                   0.0    
            2686.                 2784.                  495.5                   0.0                   0.0    
             737.4                2532.                  995.1                   0.0                   0.0    
               0.0                7700.                 6518.                    0.0                   0.0    
               0.0                7117.                 2864.                    0.0                   0.0    
               0.0                2351.                    0.0                   0.0                   0.0    
            1948.                 1.171E+08              375.7                   0.0                  51.95   
               0.0                1.219E+08              645.3                   0.0                   0.0    
               0.0                1.219E+08              117.9                   0.0                  51.95   
               0.0                1.219E+08              201.8                   0.0                   0.0    
               0.0                1.219E+08              348.6                   0.0                   0.0    
            2685.                 1.219E+08             5482.                    0.0                   0.0    
               0.0                1.219E+08              257.7                   0.0                   0.0    
               0.0                1.219E+08             4182.                    0.0                  51.95   
               0.0                1.219E+08               65.93                  0.0                   0.0    
               0.0                1.219E+08              660.2                   0.0                   0.0    

(spike at the end is iperf3). According to my calculations, that's just around 100MB/s or 800Mbit/s.

The rest of the traffic lines up well in the chart, and iftop matches the graph on lower traffic values over the interface. Curious! :)

@cipherboy
Copy link
Author

cipherboy commented Jan 8, 2021

I installed the new package.

Screenshot from 2021-01-08 11-44-27

Same bug.

About Web Console
Cockpit is an interactive Linux server admin interface.
Project website
Version 234.
Licensed under: GNU LGPL version 2.1

Interestingly, if I reload the page after spiking the traffic, I get the right units again:

Screenshot from 2021-01-08 11-42-44

But the recorded traffic is off, by a factor of 2ish:

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.77  sec  1.17 GBytes   933 Mbits/sec    0             sender
[  5]   0.00-10.77  sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated

Oddly, this persists -- if you look at the screenshot above, both plateaus are using the same iperf and achieve the same rate, within 20 to 30Mbps. The idle floor on the network is probably around 2-10Mbps.


I closed the tab and re-opened it after installing the new package, though I honestly don't see any difference in the charts :)


Oddly, if I watch the chart as traffic ramps up, I notice that the units were correct (up to around 800Mbps), but as soon as it shows 1200, it switches to Gbps. I'm guessing there's a subtle logic bug somewhere: 1200Mbps is 1.2Gbps, so the units switch over, even though it is wrong. So either the numerical axis labels need to be updated to be in Gbps, or the label still needs to show Mbps until we get into higher values of Gbps.

@mvollmer
Copy link
Member

I'm guessing there's a subtle logic bug somewhere: 1200Mbps is 1.2Gbps, so the units switch over, even though it is wrong. So either the numerical axis labels need to be updated to be in Gbps, or the label still needs to show Mbps until we get into higher values of Gbps.

Yes, this makes a lot of sense. (And that's why I was asking for a video. If it's not too much trouble, it would be nice to see it in action, but since you have seen it and described the effect, I don't think I could get any more details out of a video, actually...)

I'll do some code reading here and mock up some artificial traffic ramps.

@mvollmer
Copy link
Member

I installed the new package.

You should be seeing version 235. You need to at least log out and log in, I guess. Executing systemctl restart cockpit can't hurt. To be super sure, just reboot. :-)

And thanks a lot for all the effort in your side for figuring this out! I think we are getting closer!

@cipherboy
Copy link
Author

I pulled the repo as of:

commit daac62b7bd1b3deee1517f6c8f5e6592a82b19cc (HEAD -> master, origin/master, origin/HEAD)
Author: Matej Marusak <mmarusak@redhat.com>
Date:   Fri Dec 18 10:04:50 2020 +0100

    test: Replace `wait_present` with `wait_visible`

and went on with building & running instructions (thanks @mvollmer!).

Some observations:

#1: Using inspector, I was able to see that the data sent from the server is correct (i.e., I'm seeing the expected traffic rate in bytes/second).

#2: When I logged calls into the interesting plot functions (bits_per_sec_tick_unit and format_bits_per_sec_tick_no_unit), I saw something weird: different values for yaxis (I noticed this because max differs, see below) were being used for calculating the y-axis numerical values than the y-axis unit label.

Log messages of plot function calls while calculating problemeatic graph

These are for the y-axis values:

bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 Mbps plot.js:878:12
format_bits_per_sec_tick_no_unit 0 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 0 plot.js:884:12
bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 Mbps plot.js:878:12
format_bits_per_sec_tick_no_unit 50000000 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 400 plot.js:884:12
bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 Mbps plot.js:878:12
format_bits_per_sec_tick_no_unit 100000000 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 800 plot.js:884:12
bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 Mbps plot.js:878:12
format_bits_per_sec_tick_no_unit 150000000 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 124433136.04970215, … }
 1200 plot.js:884:12

These are for the y-axis unit labels:

bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121993270.63696289, used: true, show: true, reserveSpace: true, min: 0, max: 150000000, … }
 Gbps

#3: When I changed the y-axis calculation to use datamax instead of max, I was able to get the right units. However, I'm not sure the implications of this; it might not be the right solution:

Log messages of plot function calls while calculating problemeatic graph

These are for the y-axis values:

bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 Mbps plot.js:879:12
format_bits_per_sec_tick_no_unit 0 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 0 plot.js:885:12
bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 Mbps plot.js:879:12
format_bits_per_sec_tick_no_unit 50000000 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 400 plot.js:885:12
bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 Mbps plot.js:879:12
format_bits_per_sec_tick_no_unit 100000000 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 800 plot.js:885:12
bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 Mbps plot.js:879:12
format_bits_per_sec_tick_no_unit 150000000 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 124408651.25724609, … }
 1200 plot.js:885:12

These are for the y-axis unit label:

bits_per_sec_tick_unit 
Object { n: 1, direction: "y", options: {…}, datamin: 0, datamax: 121969265.93847656, used: true, show: true, reserveSpace: true, min: 0, max: 150000000, … }
 Mbps

The patch I made in both cases was fairly simple:

export function bits_per_sec_tick_unit(axis) {
    // Here, I made the change for datamax over data:
    // const max_value = axis.datamax ? axis.datamax : axis.max
    const ret = cockpit.format_bits_per_sec(axis.max * 8, 1000, true)[1];
    console.log("bits_per_sec_tick_unit", axis, ret);
    return ret;
}

export function format_bits_per_sec_tick_no_unit(val, axis) {
    const ret = cockpit.format_bits_per_sec(val * 8, bits_per_sec_tick_unit(axis), true)[0];
    console.log("format_bits_per_sec_tick_no_unit", val, axis, ret);
    return ret;
}

export function format_bits_per_sec_tick(val, axis) {
    const ret = cockpit.format_bits_per_sec(val * 8, 1000);
    console.log("format_bits_per_sec_tick", val, axis);
    return ret;
}

To me, this says that there's an issue with the plotting library: the same value needs to be used on both the unit label and the numerical values.

Since your branch didn't help (sorry!) I used the current Fedora 32 cockpit version for the video. From the loaded page, I refreshed. Waited for graphs to load. Then kicked off Sending iperf3. Watched graph. Then kicked off receiving iperf3.

cockpit-network-issue.mp4

@mvollmer
Copy link
Member

mvollmer commented Jan 14, 2021

To me, this says that there's an issue with the plotting library: the same value needs to be used on both the unit label and the numerical values.

Excellent, you nailed it down perfectly, as far as I can tell. I had assumed that yaxis.max would be the same when the labels are formatted and after the plot. But flot seems to change it, probably rounding it to a nice value.

Using datamax should work, since we don't need the exact value of the yaxis range, but we need something that is stable.

I don't think the new code has this bug, since it is all done under our control:

function value_ticks(data, config) {

It might have other bugs, so I would really appreciate it if you could test it (once it is released).

@mvollmer
Copy link
Member

#14913 has been merged to master, so it will be in this weeks release, 236. I'll close this issue, thanks a lot for the investigations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants