Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[telemetry] Rotate streaming telemetry secrets. #9600

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

yozhao101
Copy link
Contributor

@yozhao101 yozhao101 commented Dec 21, 2021

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it

Initially the secrets of Streaming Telemetry container were managed and deployed by several internal tools. Since the secrets of Streaming Telemetry were not always be able to be deployed successfully on SONiC production devices, we decide to leverage an exising feature in SONiC to manage and deploy the secrets.

How I did it

I mainly did the following two changes:

First I added a python script named certificate_rotation_checker in Streaming Telemtry container and this script will check periodically in background to see whether the certificate and private key were rotated and updated. The Streaming Telemetry server process will be restarted if and only if the certificate and key were actually rotated. We have another open PR to reload gNMI server configuration in telemetry repo: sonic-net/sonic-telemetry#92.

Second, In the script telemetry.sh, the telemetry server process will be started if and only if the certificate and private key files did exist on the device. Otherwise, it will check the existence of certificate and private key files on device periodically.

How to verify it

I did this implementation and tested them on a virtual switch.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101 yozhao101 changed the title [telemetry] Roll over streaming telemetry secrets by ACMS. [telemetry] Roll over streaming telemetry secrets. Dec 21, 2021
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…uration file.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Dec 21, 2021

This pull request introduces 2 alerts and fixes 3 when merging 90d9d03 into 34be53a - view on LGTM.com

new alerts:

  • 1 for Too few arguments in formatting call
  • 1 for Syntax error

fixed alerts:

  • 2 for Unused local variable
  • 1 for Except block handles 'BaseException'

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Dec 21, 2021

This pull request introduces 1 alert and fixes 3 when merging 0bfcec9 into 67e40b5 - view on LGTM.com

new alerts:

  • 1 for Syntax error

fixed alerts:

  • 2 for Unused local variable
  • 1 for Except block handles 'BaseException'

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@lguohan lguohan requested a review from qiluo-msft January 5, 2022 18:52
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
root_process_pid = os.getppid()
syslog.syslog(syslog.LOG_INFO,
"Restarting streaming telemetry service by terminating the process with pid: '{}'".format(root_process_pid))
os.kill(root_process_pid, signal.SIGTERM)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGTERM

The modern convention is to use sighup (ref: https://stackoverflow.com/a/28327659/2514803 ).

The benefit is not to explicitly terminate the other process and trigger critical process monitor alerts.

You may need to add the sighup handler in sonic-telemetry if not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably use the supervisorctl restart telemetry command to only restart the streaming telemetry server process once the secrets were rotated. This can avoid triggerring the critical process alerts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting is using SIGTERM and SIGKILL internally. One big concern is graceful shutdown. Considering the client will often fetch large amount of data, graceful shutdown will make client easier.

Copy link
Contributor Author

@yozhao101 yozhao101 Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the point. We can do this by either sending the signal SIGKILL or executing the command supervisorctl restart telemetry. However, our main focus is how we can do some cleanup before gracefully stopping the telemetry server process and disconnecting with gNMI client side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my first comment in this thread. You did not get the point of using sighup.

# Install Python module for inotify
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip3 install inotify


Copy link
Collaborator

@qiluo-msft qiluo-msft Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra empty line. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.



def main():
certificate_rollover_check()
Copy link
Collaborator

@qiluo-msft qiluo-msft Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certificate_rollover_check

Could you try inotify (https://www.linuxjournal.com/content/linux-filesystem-events-inotify ) instead of reinvent the wheel? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Collaborator

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…te and private key.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101 yozhao101 marked this pull request as ready for review January 10, 2022 19:52
@yozhao101 yozhao101 requested a review from lguohan as a code owner January 10, 2022 19:52
@yozhao101 yozhao101 changed the title [telemetry] Roll over streaming telemetry secrets. [telemetry] Rotate streaming telemetry secrets. Jan 10, 2022
@lgtm-com
Copy link

lgtm-com bot commented Jan 10, 2022

This pull request introduces 1 alert when merging 17c4d60 into 58c5bb6 - view on LGTM.com

new alerts:

  • 1 for Unused import

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@yozhao101
Copy link
Contributor Author

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

"Certificate and private key were rotated and restarting telemetry server process ...")
restart_telemetry_server()

# Wait for specified seconds and then do the next round checking
Copy link
Collaborator

@qiluo-msft qiluo-msft Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for specified second

You don't need sleep in an event loop. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially what I am thinking is since the directory /etc/sonic/credentials/ will store all the certificates and keys for different applications in the future and if one or several certificates under this directory are rotated too frequently (by accident?), this event loop will be executed continuously and then burn too much CPU cycles potentially. Another thinking the polling frequency of our internal tool to get the update from certificate storage is 8 ~ 24 hours. That's why I put a sleep function at here.

I will let inotify to monitor the certificate file directly instead of monitoring the /etc/sonic/credentials/ and then we can remove this sleep function in event loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your motivation, and the considering may be helpful. However, event you sleep, the event will be queued and processed after each sleep, you don't save too much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and this will not save too much!

inotify_instance.add_watch(CREDENTIALS_DIR_PATH)
for event in inotify_instance.event_gen(yield_nones=False):
header, event_type, monitoring_path, file_name = event
if (file_name == certificate_file_name
Copy link
Collaborator

@qiluo-msft qiluo-msft Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if

The FSM logic is complex and may be messed up by some input sequence. Could you use one file as the main indicator, and always rotate if that file changed.

ref: https://github.com/Azure/sonic-restapi/blob/94805a39ac0712219f7dc08faa2cfdbf371dd177/go-server-server/main.go#L103 #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and use the rotation of certificate file as the main indicator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure describe the main file in document? This is very critical design assumption and the cert rotator should treat it as a contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the design document.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

inotify_instance = inotify.adapters.Inotify()
inotify_instance.add_watch(CREDENTIALS_DIR_PATH)
for event in inotify_instance.event_gen(yield_nones=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inotify_instance

In extreme case, the file is deleted by a malicious user, will the inotify_instance still working? I think its link to inode, and deleting file will destroy the inode.

If this is true, a crash is better than a dead loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inotify is inode based and it will monitor the credentials directory /etc/sonic/credentials/ to see whether the telemetry certificate file was rotated or not. If certificate file was deleted by accidentally, the inotify_instance will not be impacted.

I updated the PR to log an error message if the certificate was deleted. What I am thinking is if the certificate was restored later, then it can be treated as a kind of rotation operation and the telemetry server will be restarted by this script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If certificate file was deleted by accidentally, what is the expected behavior?

I am considering in this case, we can kill telemetry daemon.

"Failed to retrieve the certificate information from 'TELEMETRY' table!")
sys.exit(4)

while True:
Copy link
Collaborator

@qiluo-msft qiluo-msft Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while True:

The intention of this loop is to wait until two files exist. why only check one file? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and checks the existence of both two files.

…sted on device and

log an error message if certificate file was deleted accidentally.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…P` signal.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Jan 25, 2022

This pull request introduces 1 alert when merging c6e2fdd into 61e9a76 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

…ing script.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Jan 26, 2022

This pull request introduces 1 alert when merging 41377a1 into 1ac140a - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@liat-grozovik
Copy link
Collaborator

@yozhao101 could you please make sure your PR is rebased to latest and rerun checkers?
this fix is needed for 202111 and we appreciate if it can be ready for signoff.

@liat-grozovik
Copy link
Collaborator

@yozhao101 could you please rebase your work on latest and rerun checkers?
@qiluo-msft could you review?

@yozhao101
Copy link
Contributor Author

@qiluo-msft Can you help me review this PR please?

@yozhao101
Copy link
Contributor Author

@yozhao101 could you please rebase your work on latest and rerun checkers? @qiluo-msft could you review?

Thanks @liat-grozovik for your reviewing! I will rebase my PR.


while True:
if not os.path.exists(certificate_file_path) or not os.path.exists(private_key_file_path):
syslog.syslog(syslog.LOG_ERR,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG_ERR

Instead of error message and keep loop, it is better to just simply kill telemetry daemon.

@yozhao101 yozhao101 requested a review from a team as a code owner June 10, 2022 02:01
@liat-grozovik
Copy link
Collaborator

@yozhao101 any update on this PR? should it go to 202205? if so, can you add the proper label?

@zhangyanzhao
Copy link
Collaborator

@yozhao101 can you please help to response the comments and provide an update on this PR? Thanks.

@jeflo
Copy link

jeflo commented Jul 19, 2023

Hi @yozhao101 this looks like a very useful change. Would like to know what is the status of this PR and what's the plan to get it to master? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants