Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #125

Closed
CHmSID opened this issue Sep 2, 2024 · 5 comments · Fixed by bisdn/meta-ofdpa#74
Closed

Memory leak #125

CHmSID opened this issue Sep 2, 2024 · 5 comments · Fixed by bisdn/meta-ofdpa#74
Labels
bug Something isn't working

Comments

@CHmSID
Copy link

CHmSID commented Sep 2, 2024

We found a steady memory leak while repeatedly calling the command client_drivshell ps.

Screenshot_20240902_105653

Hardware: Agema ag7648
OS: issue found on both BISDN 4.7.0 and 5.2.0

Steps to reproduce:

  1. On a freshly installed BISDN OS (4.7.0 or 5.2.0, doesn't matter)
  2. Run sudo watch -n1 "client_drivshell ps" in a screen for a longer period of time
  3. With graphs the issue is already visible within few hours, you can also probably save output of free -m every hour and compare

Notes:

  • We found that the file /tmp/ofdpa_client.log is constantly growing (and thus might be contributing to the memory usage) I will attach this file here for investigation (truncated to fit within github upload limits)
  • In this file we can also see that each call to client_drivshell ps opens a new datagram socket, but those socket files are not removed after the command is finished. Indeed ls -l /tmp/ | grep fpc. | wc -l returns a big number 167391 on our SDN after a few days of running. I'm not sure if this is related to the memory leak though.
  • I also checked if memory is lost with other commands such as onlpdump -S, but it seems that at least this command is not leaking memory

ofdpa_client.log

I hope this is helpful. Let me know if you have any other questions.

Our use case is to find ports' status (up or down) so that we can report it in our metrics. We do not run baseboxd and so the ports are not visible to the kernel. Do you know of any alternatives that can be used to get this information?

@KanjiMonster
Copy link
Contributor

Hi, thank you for the report! I have an idea about where this comes from, at least the growing log and the set of left over datagram sockets, though fixing this won't be super simple.

The information about port status is exposed via OpenFlow via the Port Table, and on userspace via client_porttable_dump:

client_port_table_dump
0x00000001 (type: physical (0x0), index = 1) | port1:
	config = 0x00000001, state = 0x00000001, mac = f88e.a127.ef97, max. speed = 1000000 kbps, current speed = 1000000 kbps
	CurrFeature: 0x2810, AdvertFeature: 0x402f, SupportedFeature: 0xe82f, PeerFeature: 0x0
...
0x00000035 (type: physical (0x0), index = 53) | port53:
	config = 0x00000000, state = 0x00000000, mac = f88e.a127.efcb, max. speed = 100000000 kbps, current speed = 100000000 kbps
	CurrFeature: 0x900, AdvertFeature: 0x0, SupportedFeature: 0xfc80, PeerFeature: 0x0

where the lowest bit in config means "port set down", and the bit in state means "no link" (so all zero means it does have a link). These are eqivalent to the OpenFlow config and state fields of ports.

But client_port_table_dump uses the same interface for accessing OF-DPA as client_drivshell does, so it will also cause the same log and socket spam. This affects all other client_* utilities, as well as the python bindings.

So the only way of obtaining that info without triggering the issue currently will be via OpenFlow from ofagent.

@KanjiMonster KanjiMonster added the bug Something isn't working label Sep 2, 2024
@KanjiMonster
Copy link
Contributor

As a simple workaround against the log file growing, you can mark it as read only

chattr +i /tmp/ofdpa_client.log

The side effect of that is that it will start logging to stderr instead of stdout:

client_port_table_dump  1
<0> 2024-09-02 10:48:35 [03881]: Entering rpccltCommSetup: inst=1 msgseq=0x66d59803
<1> 2024-09-02 10:48:35 [03881]: Created datagram socket: inst=1 sockid=3881 sockfd=3 cliaddr=/tmp/fpc.03881
<0> 2024-09-02 10:48:35 [03881]: Leaving rpccltCommSetup: inst=1
<1> 2024-09-02 10:48:35 [03881]: rpccltSuppWrapperOfdpa (inst=1 grp=2): Initialized RPC client OFDPA function table.
0x00000001 (type: physical (0x0), index = 1) | port1:
	config = 0x00000001, state = 0x00000001, mac = f88e.a127.ef97, max. speed = 1000000 kbps, current speed = 1000000 kbps
	CurrFeature: 0x2810, AdvertFeature: 0x402f, SupportedFeature: 0xe82f, PeerFeature: 0x0

But as it is stderr and the normal output is stdout, you can at least filter it out.

@CHmSID
Copy link
Author

CHmSID commented Sep 2, 2024

Hi, thank you for your swift response.

This affects all other client_* utilities, as well as the python bindings.

By python bindings do you mean the OFDPA bindings for python? This is what we use for our custom controller running on the SDN. It has not been causing any memory issues so far even though we query packet counters through it (we migrated from Ryu to OFDPA_python maybe a year or two ago already and had no issues until we introduced the client_drivshell ps call recently)

Given the above, I think I will try to query the port status using the bindings and hope it works. Once we migrated away from Ryu we also stopped using ofagent and I'd prefer not to reintroduce it again if not necessary.

As a simple workaround against the log file growing, you can mark it as read only

This seems to do the trick, the log file itself is not growing anymore.

I hope a fix can be found for the leak. If not closely monitored, running out of memory on the SDN becomes very costly. The switch becomes inaccessible through ssh and serial console, meaning that someone has to physically go to the site to reboot it; not to mention the interruption to customer's network traffic...

@KanjiMonster
Copy link
Contributor

By python bindings do you mean the OFDPA bindings for python?

Yes, though I think there isn't much logged by default apart from startup, so if you controller has a persistent connection, then the log file doesn't grow apart from the first few lines.

AFAICT the issue is that on every new connection to OF-DPA, a few lines are written to the log, which eventually accumulates. And since the logfile is in /tmp, and /tmp is a ramdisk, it slowly eats up memory.

@KanjiMonster
Copy link
Contributor

Both issues are fixed internally, and upcoming nightlies will include the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants