-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[telemetry] Rotate streaming telemetry secrets. #9600
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…uration file. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
This pull request introduces 2 alerts and fixes 3 when merging 90d9d03 into 34be53a - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
This pull request introduces 1 alert and fixes 3 when merging 0bfcec9 into 67e40b5 - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
root_process_pid = os.getppid() | ||
syslog.syslog(syslog.LOG_INFO, | ||
"Restarting streaming telemetry service by terminating the process with pid: '{}'".format(root_process_pid)) | ||
os.kill(root_process_pid, signal.SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modern convention is to use sighup (ref: https://stackoverflow.com/a/28327659/2514803 ).
The benefit is not to explicitly terminate the other process and trigger critical process monitor alerts.
You may need to add the sighup handler in sonic-telemetry if not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably use the supervisorctl restart telemetry
command to only restart the streaming telemetry server process once the secrets were rotated. This can avoid triggerring the critical process alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restarting is using SIGTERM
and SIGKILL
internally. One big concern is graceful shutdown. Considering the client will often fetch large amount of data, graceful shutdown will make client easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the point. We can do this by either sending the signal SIGKILL
or executing the command supervisorctl restart telemetry
. However, our main focus is how we can do some cleanup before gracefully stopping the telemetry server process and disconnecting with gNMI client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my first comment in this thread. You did not get the point of using sighup.
# Install Python module for inotify | ||
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install inotify | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra empty line. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
|
||
def main(): | ||
certificate_rollover_check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try inotify (https://www.linuxjournal.com/content/linux-filesystem-events-inotify ) instead of reinvent the wheel? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As comments
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…te and private key. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
This pull request introduces 1 alert when merging 17c4d60 into 58c5bb6 - view on LGTM.com new alerts:
|
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
/azp run |
You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list. |
/azp run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
"Certificate and private key were rotated and restarting telemetry server process ...") | ||
restart_telemetry_server() | ||
|
||
# Wait for specified seconds and then do the next round checking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially what I am thinking is since the directory /etc/sonic/credentials/
will store all the certificates and keys for different applications in the future and if one or several certificates under this directory are rotated too frequently (by accident?), this event loop will be executed continuously and then burn too much CPU cycles potentially. Another thinking the polling frequency of our internal tool to get the update from certificate storage is 8 ~ 24 hours. That's why I put a sleep function at here.
I will let inotify
to monitor the certificate file directly instead of monitoring the /etc/sonic/credentials/
and then we can remove this sleep function in event loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your motivation, and the considering may be helpful. However, event you sleep, the event will be queued and processed after each sleep, you don't save too much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and this will not save too much!
inotify_instance.add_watch(CREDENTIALS_DIR_PATH) | ||
for event in inotify_instance.event_gen(yield_nones=False): | ||
header, event_type, monitoring_path, file_name = event | ||
if (file_name == certificate_file_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FSM logic is complex and may be messed up by some input sequence. Could you use one file as the main indicator, and always rotate if that file changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and use the rotation of certificate file as the main indicator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure describe the main file in document? This is very critical design assumption and the cert rotator should treat it as a contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the design document.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
||
inotify_instance = inotify.adapters.Inotify() | ||
inotify_instance.add_watch(CREDENTIALS_DIR_PATH) | ||
for event in inotify_instance.event_gen(yield_nones=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inotify
is inode
based and it will monitor the credentials directory /etc/sonic/credentials/
to see whether the telemetry certificate file was rotated or not. If certificate file was deleted by accidentally, the inotify_instance
will not be impacted.
I updated the PR to log an error message if the certificate was deleted. What I am thinking is if the certificate was restored later, then it can be treated as a kind of rotation
operation and the telemetry server will be restarted by this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If certificate file was deleted by accidentally, what is the expected behavior?
I am considering in this case, we can kill telemetry daemon.
"Failed to retrieve the certificate information from 'TELEMETRY' table!") | ||
sys.exit(4) | ||
|
||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and checks the existence of both two files.
…sted on device and log an error message if certificate file was deleted accidentally. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…P` signal. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
This pull request introduces 1 alert when merging c6e2fdd into 61e9a76 - view on LGTM.com new alerts:
|
…ing script. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
This pull request introduces 1 alert when merging 41377a1 into 1ac140a - view on LGTM.com new alerts:
|
@yozhao101 could you please make sure your PR is rebased to latest and rerun checkers? |
@yozhao101 could you please rebase your work on latest and rerun checkers? |
@qiluo-msft Can you help me review this PR please? |
Thanks @liat-grozovik for your reviewing! I will rebase my PR. |
|
||
while True: | ||
if not os.path.exists(certificate_file_path) or not os.path.exists(private_key_file_path): | ||
syslog.syslog(syslog.LOG_ERR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yozhao101 any update on this PR? should it go to 202205? if so, can you add the proper label? |
@yozhao101 can you please help to response the comments and provide an update on this PR? Thanks. |
Hi @yozhao101 this looks like a very useful change. Would like to know what is the status of this PR and what's the plan to get it to master? Thanks. |
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Initially the secrets of Streaming Telemetry container were managed and deployed by several internal tools. Since the secrets of Streaming Telemetry were not always be able to be deployed successfully on SONiC production devices, we decide to leverage an exising feature in SONiC to manage and deploy the secrets.
How I did it
I mainly did the following two changes:
First I added a python script named
certificate_rotation_checker
in Streaming Telemtry container and this script will check periodically in background to see whether the certificate and private key were rotated and updated. The Streaming Telemetry server process will be restarted if and only if the certificate and key were actually rotated. We have another open PR to reload gNMI server configuration in telemetry repo: sonic-net/sonic-telemetry#92.Second, In the script
telemetry.sh
, the telemetry server process will be started if and only if the certificate and private key files did exist on the device. Otherwise, it will check the existence of certificate and private key files on device periodically.How to verify it
I did this implementation and tested them on a virtual switch.
Which release branch to backport (provide reason below if selected)
Description for the changelog
A picture of a cute animal (not mandatory but encouraged)