-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[telemetry] Rotate streaming telemetry secrets. #9600
base: master
Are you sure you want to change the base?
Changes from 9 commits
a921b8e
7da4005
ede36b9
7c3c7d3
90d9d03
0bfcec9
9db5b04
b1844b7
e28f64d
536fc4c
17c4d60
59dfe9b
ce9f713
68bd0b6
6ad0ef5
6ce4e4d
6512a6f
c6e2fdd
3dbb79d
0881704
41377a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
#!/usr/bin/env python3 | ||
|
||
""" | ||
certificate_rollover_checker | ||
|
||
This script will be leveraged to periodically check whether the certificate and private key | ||
files of streaming telemetry were rolled over by dSMS service or not. The streaming telemetry | ||
container will be restarted if the certificate and private key are rolled over by dSMS service | ||
and then updated by ACMS agent running in ACMS container. | ||
""" | ||
|
||
import os | ||
import signal | ||
import sys | ||
import syslog | ||
import time | ||
|
||
from swsscommon import swsscommon | ||
|
||
CERTIFICATE_CHECKING_INTERVAL_SECS = 3600 | ||
|
||
|
||
def get_file_last_mod_time(file_path): | ||
"""Gets the last modification time of a specific file. Args: | ||
file_path: A string represents the file path. | ||
|
||
Returns: | ||
last_mod_time: A float number in seconds represents the last moditification time of file | ||
since epoch. | ||
""" | ||
last_mod_time = 0.0 | ||
|
||
try: | ||
last_mod_time = os.path.getmtime(file_path) | ||
except OSError as error: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Could not get last modification time of the file and error message is '{}'.".format(error)) | ||
sys.exit(1) | ||
|
||
return last_mod_time | ||
|
||
|
||
def restart_streaming_telemetry(): | ||
"""Restarts the streaming telemetry container by terminating the root process. | ||
|
||
Args: | ||
None | ||
|
||
Returns: | ||
None | ||
""" | ||
root_process_pid = os.getppid() | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Restarting streaming telemetry service by terminating the process with pid: '{}'".format(root_process_pid)) | ||
os.kill(root_process_pid, signal.SIGTERM) | ||
|
||
|
||
def certificate_rollover_check(): | ||
"""Checks certificate and key files and restart streaming telemetry contianer if necessary. | ||
|
||
Checks the last modification time of certificate and private key files of streaming telemetry | ||
to see whether they were already rolled over by dSMS service and updated by ACMS agent running | ||
in ACMS container. The streaming telemetry container will be restarted if they were rolled over. | ||
|
||
Args: | ||
None | ||
|
||
Returns: | ||
None | ||
""" | ||
certificate_path = "" | ||
private_key_path = "" | ||
certificate_last_mod_time = 0 | ||
private_key_last_mod_time = 0 | ||
|
||
config_db = swsscommon.DBConnector("CONFIG_DB", 0) | ||
telemetry_table = swsscommon.Table(config_db, "TELEMETRY") | ||
telemetry_table_keys = telemetry_table.getKeys() | ||
if "certs" in telemetry_table_keys: | ||
certs_info = dict(telemetry_table.get("certs")[1]) | ||
if "server_crt_acms" in certs_info and "server_key_acms" in certs_info: | ||
certificate_path = certs_info["server_crt_acms"] | ||
private_key_path = certs_info["server_key_acms"] | ||
syslog.syslog(syslog.LOG_INFO, "Path of certificate file is '{}'".format(certificate_path)) | ||
syslog.syslog(syslog.LOG_INFO, "Path of key file is '{}'".format(private_key_path)) | ||
else: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Failed to retrieve the path of certificate and key file from 'TELEMETRY' table!") | ||
sys.exit(2) | ||
else: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Failed to retrieve the certificate information from 'TELEMETRY' table!") | ||
sys.exit(3) | ||
|
||
while True: | ||
if not os.path.exists(certificate_path) or not os.path.exists(private_key_path): | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Certificate or key file did not exist on device and sleep '{}' seconds to check again ...".format(CERTIFICATE_CHECKING_INTERVAL_SECS)) | ||
time.sleep(CERTIFICATE_CHECKING_INTERVAL_SECS) | ||
else: | ||
break | ||
|
||
certificate_last_mod_time = get_file_last_mod_time(certificate_path) | ||
private_key_last_mod_time = get_file_last_mod_time(private_key_path) | ||
|
||
while True: | ||
certificate_mod_time = get_file_last_mod_time(certificate_path) | ||
private_key_mod_time = get_file_last_mod_time(private_key_path) | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Last modification time of certificate file is: '{}'".format(time.ctime(certificate_last_mod_time))) | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Last modification time of key file is: '{}'".format(time.ctime(private_key_last_mod_time))) | ||
|
||
if (certificate_mod_time > certificate_last_mod_time | ||
or private_key_mod_time > private_key_last_mod_time): | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Last modification time of certificate file is changed to '{}': ".format(time.ctime(certificate_mod_time))) | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Last modification time of key file is changed to '{}': ".format(time.ctime(private_key_mod_time))) | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Secrets were rolled over and restarting streaming telemetry service ...") | ||
restart_streaming_telemetry() | ||
|
||
# Wait for specified seconds and then do the next round checking | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Sleeping '{}' seconds before doing the next round rollover checking ...".format(CERTIFICATE_CHECKING_INTERVAL_SECS)) | ||
time.sleep(CERTIFICATE_CHECKING_INTERVAL_SECS) | ||
|
||
|
||
def main(): | ||
certificate_rollover_check() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you try inotify (https://www.linuxjournal.com/content/linux-filesystem-events-inotify ) instead of reinvent the wheel? #Closed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
sys.exit(0) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
program:telemetry | ||
program:dialout | ||
program:certificate_rollover_checker |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -127,6 +127,10 @@ sudo rm -rf $FILESYSTEM_ROOT/$REDIS_DUMP_LOAD_PY3_WHEEL_NAME | |
# Install Python module for psutil | ||
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install psutil | ||
|
||
# Install Python module for inotify | ||
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install inotify | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove extra empty line. #Closed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||
# Install Python module for ipaddr | ||
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install ipaddr | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modern convention is to use sighup (ref: https://stackoverflow.com/a/28327659/2514803 ).
The benefit is not to explicitly terminate the other process and trigger critical process monitor alerts.
You may need to add the sighup handler in sonic-telemetry if not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably use the
supervisorctl restart telemetry
command to only restart the streaming telemetry server process once the secrets were rotated. This can avoid triggerring the critical process alerts.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restarting is using
SIGTERM
andSIGKILL
internally. One big concern is graceful shutdown. Considering the client will often fetch large amount of data, graceful shutdown will make client easier.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the point. We can do this by either sending the signal
SIGKILL
or executing the commandsupervisorctl restart telemetry
. However, our main focus is how we can do some cleanup before gracefully stopping the telemetry server process and disconnecting with gNMI client side.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my first comment in this thread. You did not get the point of using sighup.