Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[telemetry] Rotate streaming telemetry secrets. #9600

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
a921b8e
[telemetry] Roll over streaming telemetry secrets by ACMS.
yozhao101 Dec 21, 2021
7da4005
[telemetry] Fix the indent in the script `telemetry.sh`.
yozhao101 Dec 21, 2021
ede36b9
[telemetry] Fix the indent issue in the script `telemetry.sh`.
yozhao101 Dec 21, 2021
7c3c7d3
[telemetry] Add the field of `buffer_size=1024` in supervisord config…
yozhao101 Dec 21, 2021
90d9d03
[telemetry] Fix the indent issue.
yozhao101 Dec 21, 2021
0bfcec9
[telemetry] Fix a syntax error.
yozhao101 Dec 21, 2021
9db5b04
[telemetry] Fix a syntax error.
yozhao101 Dec 21, 2021
b1844b7
[telemetry] Install `inotify` module.
yozhao101 Jan 6, 2022
e28f64d
[telemetry] Install the inotify.
yozhao101 Jan 7, 2022
536fc4c
[telemetry] Install the `inotify` in telemetry container.
yozhao101 Jan 10, 2022
17c4d60
[telemetry] Use `inotify` module to monitor the rotation of certifica…
yozhao101 Jan 10, 2022
59dfe9b
[telemetry] Remove the unused import from script.
yozhao101 Jan 10, 2022
ce9f713
[telemetry] Use the rotation of certificate file as an indicator.
yozhao101 Jan 12, 2022
68bd0b6
[telemetry] Change the script to be executable.
yozhao101 Jan 12, 2022
6ad0ef5
[telemetry] Remove the unused import.
yozhao101 Jan 12, 2022
6ce4e4d
[telemetry] Change the script name.
yozhao101 Jan 12, 2022
6512a6f
[telemetry] Checks whether both certificate and private key files exi…
yozhao101 Jan 18, 2022
c6e2fdd
[telemetry] Reload telemetry server configuration by sending a `SIGHU…
yozhao101 Jan 25, 2022
3dbb79d
[telemetry] Fix the undefinied fields issue in telemetry server start…
yozhao101 Jan 26, 2022
0881704
[telemetry] Fix the indentation.
yozhao101 Jan 26, 2022
41377a1
[telemetry] Fix the indentation issue.
yozhao101 Jan 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dockers/docker-sonic-telemetry/Dockerfile.j2
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ RUN apt-get clean -y && \
apt-get autoremove -y && \
rm -rf /debs

COPY ["start.sh", "telemetry.sh", "dialout.sh", "/usr/bin/"]
COPY ["start.sh", "telemetry.sh", "dialout.sh", "certificate_rollover_checker", "/usr/bin/"]
COPY ["telemetry_vars.j2", "/usr/share/sonic/templates/"]
COPY ["supervisord.conf", "/etc/supervisor/conf.d/"]
COPY ["files/supervisor-proc-exit-listener", "/usr/bin"]
Expand Down
136 changes: 136 additions & 0 deletions dockers/docker-sonic-telemetry/certificate_rollover_checker
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
#!/usr/bin/env python3

"""
certificate_rollover_checker

This script will be leveraged to periodically check whether the certificate and private key
files of streaming telemetry were rolled over by dSMS service or not. The streaming telemetry
container will be restarted if the certificate and private key are rolled over by dSMS service
and then updated by ACMS agent running in ACMS container.
"""

import os
import signal
import sys
import syslog
import time

from swsscommon import swsscommon

CERTIFICATE_CHECKING_INTERVAL_SECS = 3600


def get_file_last_mod_time(file_path):
"""Gets the last modification time of a specific file. Args:
file_path: A string represents the file path.

Returns:
last_mod_time: A float number in seconds represents the last moditification time of file
since epoch.
"""
last_mod_time = 0.0

try:
last_mod_time = os.path.getmtime(file_path)
except OSError as error:
syslog.syslog(syslog.LOG_ERR,
"Could not get last modification time of the file and error message is '{}'.".format(error))
sys.exit(1)

return last_mod_time


def restart_streaming_telemetry():
"""Restarts the streaming telemetry container by terminating the root process.

Args:
None

Returns:
None
"""
root_process_pid = os.getppid()
syslog.syslog(syslog.LOG_INFO,
"Restarting streaming telemetry service by terminating the process with pid: '{}'".format(root_process_pid))
os.kill(root_process_pid, signal.SIGTERM)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGTERM

The modern convention is to use sighup (ref: https://stackoverflow.com/a/28327659/2514803 ).

The benefit is not to explicitly terminate the other process and trigger critical process monitor alerts.

You may need to add the sighup handler in sonic-telemetry if not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably use the supervisorctl restart telemetry command to only restart the streaming telemetry server process once the secrets were rotated. This can avoid triggerring the critical process alerts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting is using SIGTERM and SIGKILL internally. One big concern is graceful shutdown. Considering the client will often fetch large amount of data, graceful shutdown will make client easier.

Copy link
Contributor Author

@yozhao101 yozhao101 Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the point. We can do this by either sending the signal SIGKILL or executing the command supervisorctl restart telemetry. However, our main focus is how we can do some cleanup before gracefully stopping the telemetry server process and disconnecting with gNMI client side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my first comment in this thread. You did not get the point of using sighup.



def certificate_rollover_check():
"""Checks certificate and key files and restart streaming telemetry contianer if necessary.

Checks the last modification time of certificate and private key files of streaming telemetry
to see whether they were already rolled over by dSMS service and updated by ACMS agent running
in ACMS container. The streaming telemetry container will be restarted if they were rolled over.

Args:
None

Returns:
None
"""
certificate_path = ""
private_key_path = ""
certificate_last_mod_time = 0
private_key_last_mod_time = 0

config_db = swsscommon.DBConnector("CONFIG_DB", 0)
telemetry_table = swsscommon.Table(config_db, "TELEMETRY")
telemetry_table_keys = telemetry_table.getKeys()
if "certs" in telemetry_table_keys:
certs_info = dict(telemetry_table.get("certs")[1])
if "server_crt_acms" in certs_info and "server_key_acms" in certs_info:
certificate_path = certs_info["server_crt_acms"]
private_key_path = certs_info["server_key_acms"]
syslog.syslog(syslog.LOG_INFO, "Path of certificate file is '{}'".format(certificate_path))
syslog.syslog(syslog.LOG_INFO, "Path of key file is '{}'".format(private_key_path))
else:
syslog.syslog(syslog.LOG_ERR,
"Failed to retrieve the path of certificate and key file from 'TELEMETRY' table!")
sys.exit(2)
else:
syslog.syslog(syslog.LOG_ERR,
"Failed to retrieve the certificate information from 'TELEMETRY' table!")
sys.exit(3)

while True:
if not os.path.exists(certificate_path) or not os.path.exists(private_key_path):
syslog.syslog(syslog.LOG_ERR,
"Certificate or key file did not exist on device and sleep '{}' seconds to check again ...".format(CERTIFICATE_CHECKING_INTERVAL_SECS))
time.sleep(CERTIFICATE_CHECKING_INTERVAL_SECS)
else:
break

certificate_last_mod_time = get_file_last_mod_time(certificate_path)
private_key_last_mod_time = get_file_last_mod_time(private_key_path)

while True:
certificate_mod_time = get_file_last_mod_time(certificate_path)
private_key_mod_time = get_file_last_mod_time(private_key_path)
syslog.syslog(syslog.LOG_INFO,
"Last modification time of certificate file is: '{}'".format(time.ctime(certificate_last_mod_time)))
syslog.syslog(syslog.LOG_INFO,
"Last modification time of key file is: '{}'".format(time.ctime(private_key_last_mod_time)))

if (certificate_mod_time > certificate_last_mod_time
or private_key_mod_time > private_key_last_mod_time):
syslog.syslog(syslog.LOG_INFO,
"Last modification time of certificate file is changed to '{}': ".format(time.ctime(certificate_mod_time)))
syslog.syslog(syslog.LOG_INFO,
"Last modification time of key file is changed to '{}': ".format(time.ctime(private_key_mod_time)))
syslog.syslog(syslog.LOG_INFO,
"Secrets were rolled over and restarting streaming telemetry service ...")
restart_streaming_telemetry()

# Wait for specified seconds and then do the next round checking
syslog.syslog(syslog.LOG_INFO,
"Sleeping '{}' seconds before doing the next round rollover checking ...".format(CERTIFICATE_CHECKING_INTERVAL_SECS))
time.sleep(CERTIFICATE_CHECKING_INTERVAL_SECS)


def main():
certificate_rollover_check()
Copy link
Collaborator

@qiluo-msft qiluo-msft Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certificate_rollover_check

Could you try inotify (https://www.linuxjournal.com/content/linux-filesystem-events-inotify ) instead of reinvent the wheel? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.



if __name__ == "__main__":
main()
sys.exit(0)
1 change: 1 addition & 0 deletions dockers/docker-sonic-telemetry/critical_processes
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
program:telemetry
program:dialout
program:certificate_rollover_checker
10 changes: 10 additions & 0 deletions dockers/docker-sonic-telemetry/supervisord.conf
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,13 @@ stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=telemetry:running

[program:certificate_rollover_checker]
command=/usr/bin/certificate_rollover_checker
priority=5
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=start:exited
78 changes: 53 additions & 25 deletions dockers/docker-sonic-telemetry/telemetry.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,35 +19,63 @@ CERTS=$(echo $TELEMETRY_VARS | jq -r '.certs')
TELEMETRY_ARGS=" -logtostderr"
export CVL_SCHEMA_PATH=/usr/sbin/schema

if [ -n "$CERTS" ]; then
SERVER_CRT=$(echo $CERTS | jq -r '.server_crt')
SERVER_KEY=$(echo $CERTS | jq -r '.server_key')
if [ -z $SERVER_CRT ] || [ -z $SERVER_KEY ]; then
TELEMETRY_ARGS+=" --insecure"
else
TELEMETRY_ARGS+=" --server_crt $SERVER_CRT --server_key $SERVER_KEY "
fi
while true
do
if [ -n "$CERTS" ]; then
SERVER_CRT=$(echo $CERTS | jq -r '.server_crt')
SERVER_KEY=$(echo $CERTS | jq -r '.server_key')
CA_CRT=$(echo $CERTS | jq -r '.ca_crt')

CA_CRT=$(echo $CERTS | jq -r '.ca_crt')
if [ ! -z $CA_CRT ]; then
TELEMETRY_ARGS+=" --ca_crt $CA_CRT"
fi
elif [ -n "$X509" ]; then
SERVER_CRT=$(echo $X509 | jq -r '.server_crt')
SERVER_KEY=$(echo $X509 | jq -r '.server_key')
if [ -z $SERVER_CRT ] || [ -z $SERVER_KEY ]; then
TELEMETRY_ARGS+=" --insecure"
logger "Trying to retrieve server certificate, key and Root CA certificate managed by HwProxy ..."
logger "The file path of server certificate in CONFIG_DB is: $SERVER_CRT"
logger "The file path of server provate key in CONFIG_DB is: $SERVER_KEY"
logger "The file path of Root CA certificate in CONFIG_DB is: $CA_CRT"

if [[ -f $SERVER_CRT && -f $SERVER_KEY && -f $CA_CRT ]]; then
logger "Succeeded in retrieving server certificate, key and Root CA certificate from HwProxy."
TELEMETRY_ARGS+=" --server_crt $SERVER_CRT --server_key $SERVER_KEY --ca_crt $CA_CRT"
break
else
logger "Failed to retrieve server certificate, key or Root CA certificate from HwProxy!"
fi

SERVER_CRT_ACMS=$(echo $CERTS | jq -r '.server_crt_acms')
SERVER_KEY_ACMS=$(echo $CERTS | jq -r '.server_key_acms')
CA_CRT_ACMS=$(echo $CERTS | jq -r '.ca_crt_acms')

logger "Trying to retrieve server certificate, key and Root CA certificate managed by ACMS ..."
logger "The file path of server certificate in CONFIG_DB is: $SERVER_CRT_ACMS"
logger "The file path of server private key in CONFIG_DB is: $SERVER_KEY_ACMS"
logger "The file path of Root CA certificate in CONFIG_DB is: $CA_CRT_ACMS"

if [[ -f $SERVER_CRT_ACMS && -f $SERVER_KEY_ACMS && -f $CA_CRT_ACMS ]]; then
logger "Succeeded in retrieving the certificate, key and Root CA certificate from ACMS."
continue
else
logger "Failed to retrieve server certificate, key or Root CA certificate from ACMS!"
fi
elif [ -n "$X509" ]; then
SERVER_CRT=$(echo $X509 | jq -r '.server_crt')
SERVER_KEY=$(echo $X509 | jq -r '.server_key')
if [ -z $SERVER_CRT ] || [ -z $SERVER_KEY ]; then
TELEMETRY_ARGS+=" --insecure"
else
TELEMETRY_ARGS+=" --server_crt $SERVER_CRT --server_key $SERVER_KEY "
fi

CA_CRT=$(echo $X509 | jq -r '.ca_crt')
if [ ! -z $CA_CRT ]; then
TELEMETRY_ARGS+=" --ca_crt $CA_CRT"
fi
break
else
TELEMETRY_ARGS+=" --server_crt $SERVER_CRT --server_key $SERVER_KEY "
TELEMETRY_ARGS+=" --noTLS"
break
fi

CA_CRT=$(echo $X509 | jq -r '.ca_crt')
if [ ! -z $CA_CRT ]; then
TELEMETRY_ARGS+=" --ca_crt $CA_CRT"
fi
else
TELEMETRY_ARGS+=" --noTLS"
fi
logger "Sleeping 3600 seconds and checks the existence of secrets again ..."
sleep 3600
done

# If no configuration entry exists for TELEMETRY, create one default port
if [ -z "$GNMI" ]; then
Expand Down
4 changes: 4 additions & 0 deletions files/build_templates/sonic_debian_extension.j2
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,10 @@ sudo rm -rf $FILESYSTEM_ROOT/$REDIS_DUMP_LOAD_PY3_WHEEL_NAME
# Install Python module for psutil
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install psutil

# Install Python module for inotify
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install inotify


Copy link
Collaborator

@qiluo-msft qiluo-msft Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra empty line. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

# Install Python module for ipaddr
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install ipaddr

Expand Down
5 changes: 4 additions & 1 deletion src/sonic-config-engine/minigraph.py
Original file line number Diff line number Diff line change
Expand Up @@ -1635,7 +1635,10 @@ def parse_xml(filename, platform=None, port_config_file=None, asic_name=None, hw
'certs': {
'server_crt': '/etc/sonic/telemetry/streamingtelemetryserver.cer',
'server_key': '/etc/sonic/telemetry/streamingtelemetryserver.key',
'ca_crt': '/etc/sonic/telemetry/dsmsroot.cer'
'ca_crt': '/etc/sonic/telemetry/dsmsroot.cer',
'server_crt_acms': '/etc/sonic/credentials/streamingtelemetryserver.cer',
'server_key_acms': '/etc/sonic/credentials/streamingtelemetryserver.key',
'ca_crt_acms': '/etc/sonic/credentials/dsmsroot.cer'
}
}
results['RESTAPI'] = {
Expand Down