-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[telemetry] Rotate streaming telemetry secrets. #9600
base: master
Are you sure you want to change the base?
Changes from 12 commits
a921b8e
7da4005
ede36b9
7c3c7d3
90d9d03
0bfcec9
9db5b04
b1844b7
e28f64d
536fc4c
17c4d60
59dfe9b
ce9f713
68bd0b6
6ad0ef5
6ce4e4d
6512a6f
c6e2fdd
3dbb79d
0881704
41377a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
#!/usr/bin/env python3 | ||
|
||
""" | ||
certificate_rotation_checker | ||
|
||
This script will be leveraged to periodically check whether the certificate and private key | ||
files of streaming telemetry were rotated by dSMS service or not. The streaming telemetry | ||
server process will be restarted if the certificate and private key are rotated by dSMS service | ||
and then updated by the acms agent running in ACMS container. | ||
""" | ||
|
||
import os | ||
import subprocess | ||
import sys | ||
import syslog | ||
import time | ||
|
||
import inotify.adapters | ||
|
||
from swsscommon import swsscommon | ||
|
||
MAX_RETRY_TIMES = 10 | ||
CERTIFICATE_CHECKING_INTERVAL_SECS = 3600 | ||
|
||
CREDENTIALS_DIR_PATH = "/etc/sonic/credentials/" | ||
|
||
|
||
def get_command_result(command): | ||
"""Executes the command and returns the exiting code and resulting output. | ||
|
||
Args: | ||
command: A string contains the command to be executed. | ||
|
||
Returns: | ||
An integer indicates the exiting code. | ||
A string which contains the output of command. | ||
""" | ||
command_stdout = "" | ||
command_stderr = "" | ||
|
||
try: | ||
proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, | ||
shell=True, universal_newlines=True) | ||
command_stdout, command_stderr = proc_instance.communicate() | ||
except (OSError, ValueError) as err: | ||
syslog.syslog(syslog.LOG_ERR, "Failed to execute the command '{}'. Error: '{}'" | ||
.format(command, err)) | ||
return 2, command_stderr | ||
|
||
return proc_instance.returncode, command_stdout.strip() | ||
|
||
|
||
def check_telemetry_server_running(): | ||
"""Checkes whether telemetry server process is running. | ||
|
||
Args: | ||
None. | ||
|
||
Returns: | ||
None. | ||
""" | ||
processes_status_cmd = "supervisorctl status" | ||
retry_times = 0 | ||
is_running = False | ||
|
||
while retry_times <= MAX_RETRY_TIMES: | ||
retry_times += 1 | ||
exit_code, command_stdout = get_command_result(processes_status_cmd) | ||
if exit_code != 3: | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Failed to get the processes running status in telemetry container and retry after 60 seconds ...") | ||
time.sleep(60) | ||
else: | ||
for line in command_stdout.splitlines(): | ||
if "telemetry" in line and "RUNNING" in line: | ||
is_running = True | ||
break | ||
if is_running: | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Telemetry server process is running after certificate and private key were rotated!") | ||
break | ||
|
||
if not is_running: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Telemetry server process is not running after certificate and private key were rotated and exiting ...") | ||
sys.exit(1) | ||
|
||
|
||
def restart_telemetry_server(): | ||
"""Restarts the telemetry server process by Supervisord and then checks | ||
it is actually running. | ||
|
||
Args: | ||
None | ||
|
||
Returns: | ||
None | ||
""" | ||
restart_telemetry_server_cmd = "supervisorctl restart telemetry" | ||
retry_times = 0 | ||
|
||
while retry_times <= MAX_RETRY_TIMES: | ||
retry_times += 1 | ||
exit_code, command_stdout = get_command_result(restart_telemetry_server_cmd) | ||
if exit_code != 0: | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Failed to restart telemetry server process and retry after 60 seconds ...") | ||
time.sleep(60) | ||
else: | ||
break | ||
|
||
if retry_times > MAX_RETRY_TIMES: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Failed to restart telemetry server process after trying '{}' times and exiting ..." | ||
.format(MAX_RETRY_TIMES)) | ||
sys.exit(2) | ||
|
||
check_telemetry_server_running() | ||
|
||
|
||
def check_certificate_rotated(certificate_file_name, private_key_file_name): | ||
"""Leverages the 'inotify' module to monitor the file system events under the | ||
directory which stores the SONiC credentials and restarts telemetry server | ||
process if certificate and private key were rotated. | ||
|
||
|
||
Args: | ||
certificate_file_name: A string indicates the telemetry certificate file name. | ||
private_key_file_name: A string indicates the telemetry private key file name. | ||
|
||
Returns: | ||
None. | ||
""" | ||
certificate_file_rotated = False | ||
private_key_file_rotated = False | ||
|
||
inotify_instance = inotify.adapters.Inotify() | ||
inotify_instance.add_watch(CREDENTIALS_DIR_PATH) | ||
for event in inotify_instance.event_gen(yield_nones=False): | ||
header, event_type, monitoring_path, file_name = event | ||
if (file_name == certificate_file_name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The FSM logic is complex and may be messed up by some input sequence. Could you use one file as the main indicator, and always rotate if that file changed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated and use the rotation of certificate file as the main indicator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please make sure describe the main file in document? This is very critical design assumption and the cert rotator should treat it as a contract. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated the design document. |
||
and ("IN_CREATE" in event_type or "IN_MOVED_TO" in event_type)): | ||
certificate_file_rotated = True | ||
if (file_name == private_key_file_name | ||
and ("IN_CREATE" in event_type or "IN_MOVED_TO" in event_type)): | ||
private_key_file_rotated = True | ||
|
||
if certificate_file_rotated and private_key_file_rotated: | ||
certificate_file_rotated = False | ||
private_key_file_rotated = False | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Certificate and private key were rotated and restarting telemetry server process ...") | ||
restart_telemetry_server() | ||
|
||
# Wait for specified seconds and then do the next round checking | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Initially what I am thinking is since the directory I will let There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got your motivation, and the considering may be helpful. However, event you sleep, the event will be queued and processed after each sleep, you don't save too much? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree and this will not save too much! |
||
syslog.syslog(syslog.LOG_INFO, | ||
"Sleeping '{}' seconds before doing the next round certifcate rotation checking ..." | ||
.format(CERTIFICATE_CHECKING_INTERVAL_SECS)) | ||
time.sleep(CERTIFICATE_CHECKING_INTERVAL_SECS) | ||
|
||
|
||
def certificate_rotated_checker(): | ||
"""Checks rotation of certificate and key files and restart streaming telemetry server if necessary. | ||
|
||
Leverages 'inotify' module to check whether the certificate and private key files of | ||
streaming telemetry were already rotated by dSMS service and updated by acms agent running | ||
in ACMS container. The streaming telemetry server process will be restarted if they were rotated. | ||
|
||
Args: | ||
None | ||
|
||
Returns: | ||
None | ||
""" | ||
certificate_file_path = "" | ||
private_key_file_path = "" | ||
certificate_file_name = "" | ||
private_key_file_name = "" | ||
|
||
config_db = swsscommon.DBConnector("CONFIG_DB", 0) | ||
telemetry_table = swsscommon.Table(config_db, "TELEMETRY") | ||
telemetry_table_keys = telemetry_table.getKeys() | ||
if "certs" in telemetry_table_keys: | ||
certs_info = dict(telemetry_table.get("certs")[1]) | ||
if "server_crt" in certs_info and "server_key" in certs_info: | ||
certificate_file_path = certs_info["server_crt"] | ||
private_key_file_path = certs_info["server_key"] | ||
syslog.syslog(syslog.LOG_INFO, "Path of certificate file is '{}'".format(certificate_file_path)) | ||
syslog.syslog(syslog.LOG_INFO, "Path of key file is '{}'".format(private_key_file_path)) | ||
else: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Failed to retrieve the path of certificate and key file from 'TELEMETRY' table!") | ||
sys.exit(3) | ||
else: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Failed to retrieve the certificate and key information from 'TELEMETRY' table!") | ||
sys.exit(4) | ||
|
||
while True: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated and checks the existence of both two files. |
||
if not os.path.exists(certificate_file_path) or not os.path.exists(private_key_file_path): | ||
syslog.syslog(syslog.LOG_ERR, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
"Certificate or key file did not exist on device and sleep '{}' seconds to check again ..." | ||
.format(CERTIFICATE_CHECKING_INTERVAL_SECS)) | ||
time.sleep(CERTIFICATE_CHECKING_INTERVAL_SECS) | ||
else: | ||
break | ||
|
||
certificate_file_name = certificate_file_path.strip().split("/")[-1] | ||
private_key_file_name = private_key_file_path.strip().split("/")[-1] | ||
syslog.syslog(syslog.LOG_INFO, "cer_file_name: {}, key_file_name: {}".format(certificate_file_name, private_key_file_name)) | ||
if not certificate_file_name or not private_key_file_name: | ||
syslog.syslog(syslog.LOG_ERR, | ||
"Failed to retrieve the file name of certificate or private key!") | ||
sys.exit(5) | ||
|
||
check_certificate_rotated(certificate_file_name, private_key_file_name) | ||
|
||
|
||
def main(): | ||
certificate_rotated_checker() | ||
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
sys.exit(0) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
program:telemetry | ||
program:dialout | ||
program:certificate_rollover_checker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In extreme case, the file is deleted by a malicious user, will the inotify_instance still working? I think its link to inode, and deleting file will destroy the inode.
If this is true, a crash is better than a dead loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
inotify
isinode
based and it will monitor the credentials directory/etc/sonic/credentials/
to see whether the telemetry certificate file was rotated or not. If certificate file was deleted by accidentally, theinotify_instance
will not be impacted.I updated the PR to log an error message if the certificate was deleted. What I am thinking is if the certificate was restored later, then it can be treated as a kind of
rotation
operation and the telemetry server will be restarted by this script.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If certificate file was deleted by accidentally, what is the expected behavior?
I am considering in this case, we can kill telemetry daemon.