Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the meaning of KEY FEATURE[ Anonymization ] in README? #541

Closed
2 of 4 tasks
jatinmehrotra opened this issue Jul 5, 2023 · 7 comments
Closed
2 of 4 tasks

Comments

@jatinmehrotra
Copy link
Contributor

Checklist

  • I've searched for similar issues and couldn't find anything matching
  • I've included steps to reproduce the behavior

Affected Components

  • K8sGPT (CLI)
  • K8sGPT Operator

K8sGPT Version

No response

Kubernetes Version

No response

Host OS and its Version

No response

Steps to reproduce

Searched the entire REPO for the meaning of events where anonymization would not apply.

Expected behaviour

I am trying to understand this particular line which is mentioned in README -> Key Features -> Anonymization

Anonymization does not currently apply to events.

What kind of events are we talking about? and what are the details will be shared with AI backend in case of such events?

Actual behaviour

No response

Additional Information

No response

@arbreezy
Copy link
Member

hey @jatinmehrotra,
K8sGPT has analysers for certain K8s resources (Sts, Deployments,Pods, etc..)

In a few analysers like Pod, we feed to the AI backend the event messages which are not known beforehand thus we are not masking them for the time being.

Further research has to be made to understand the patterns and be able to mask the sensitive parts of an event like pod name, namespace.

The majority of the analysers are producing customer errors that we have created and we are able to mask.

By masking I mean, swapping sensitive strings( e.g namespace and pod names ) of the error messages with random hashes which then is shared with the backend AI of your choice then in the analysis report we swap them back again and present the initial pod and namespace string to the user; hope that makes more sense.

@jatinmehrotra
Copy link
Contributor Author

Thank you @arbreezy for the explanation. It is really helpful and definitely makes sense 💯

Based on the above explanation I want to confirm a few things as I am planning to introduce k8gpt in one of the projects.

In a few analysers like Pod, we feed to the AI backend the event messages which are not known beforehand thus we are not masking them for the time being.

  • can you provide me with the list of analysers in which masking is not taking place and in which it is taking place?
  • Can you provide me with the list of parameters which are not being masked and sent to backend AI?
  • by when the k8gpt time is planning to implement masking for unknown events like is it in the pipeline? ( I know this can be difficult to answer )

By masking I mean, swapping sensitive strings( e.g namespace and pod names ) of the error messages

  • What are the details which are currently being masked and sent to backend AI, can you provide me with the list of strings being masked and sent to backend AI?

Unrelated to the above explanation

  • In the docs it was mentioned that k8gpt is being used in production for customer projects, if that's the case and like it is mentioned that unknown events are not being masked, does it not pose a security risk to the customer sensitive information? In other words is it safe to assume that information which is not being masked will not pose any security risk to customer sensitive information? Please clarify on this.

@AlexsJones
Copy link
Member

AlexsJones commented Jul 12, 2023

Hi, thanks for your interest and thanks for @arbreezy for answering some of the questions.

I am one the founder of the project, I am thrilled to see involvement and discussion here.
I also wanted to extend some of the answers and hopefully give some satisfactory responses!

Thank you @arbreezy for the explanation. It is really helpful and definitely makes sense 💯

Based on the above explanation I want to confirm a few things as I am planning to introduce k8gpt in one of the projects.

In a few analysers like Pod, we feed to the AI backend the event messages which are not known beforehand thus we are not masking them for the time being.

  • can you provide me with the list of analysers in which masking is not taking place and in which it is taking place?

Masking

  • Statefulset
  • Service
  • PodDisruptionBudget
  • Node
  • NetworkPolicy
  • Ingress
  • HPA
  • Deployment
  • Cronjob

We typically wil not mask the below because we don't send any identifying information, just that one of these things has been detected to be incorrect

No Masking

  • RepicaSet
  • PersistentVolumeClaim
  • Pod
  • Events
  • Can you provide me with the list of parameters which are not being masked and sent to backend AI?

Fields:

  • Describe
  • ObjectStatus
  • Replicas
  • ContainerStatus
  • Event Message
  • ReplicaStatus
  • Count (Pod)
  • by when the k8gpt time is planning to implement masking for unknown events like is it in the pipeline? ( I know this can be difficult to answer )

It's for V2, which will be later this year Q4

By masking I mean, swapping sensitive strings( e.g namespace and pod names ) of the error messages

  • What are the details which are currently being masked and sent to backend AI, can you provide me with the list of strings being masked and sent to backend AI?

Please see https://docs.k8sgpt.ai/reference/guidelines/privacy/

I don't have an exact list of strings being sent, maybe I misunderstand

Unrelated to the above explanation

  • In the docs it was mentioned that k8gpt is being used in production for customer projects, if that's the case and like it is mentioned that unknown events are not being masked, does it not pose a security risk to the customer sensitive information? In other words is it safe to assume that information which is not being masked will not pose any security risk to customer sensitive information? Please clarify on this.

The bottom line is that in critical production environments (like one of the banks I used to work at) I would recommend an entirely different backend -> use a local model. Then you can rest easily that its inside your DMZ and nothing is leaking.
If there is even a hint of uncertainty sending up data that might be business critical, I would not advising doing so to a public LLM, it's very nebulous how the corpus of data will be grown from your questions by some of them.

If would like an example of how to use LocalAI ( one of our providers ) that lets you use your own models, we would be happy to share docs, blogs, posts.

@jatinmehrotra
Copy link
Contributor Author

@AlexsJones Thank you so much for the explanation and your conclusion to use local AI.

We typically will not mask the below because we don't send any identifying information, just that one of these things has been detected to be incorrect

Fields:
Describe
ObjectStatus
Replicas
ContainerStatus
Event Message
ReplicaStatus
Count (Pod)

If my understanding is correct out of the unmasked field Event Message field is the one ( one of these things has been detected to be incorrect) which contains identifying information isn't it? If I am correct is there any example for the Event Message so that I can refer and gauge to what extent identifying information is being sent to backend AI?

@AlexsJones
Copy link
Member

I always err on the side of caution - so yes, it is quite possible the payload of the event might have something like "super-secret-project-pod-X crashed" which we don't currently redact.

As an example - if you use k8sgpt integration enable trivy you'll see events unredacted like this ->

 Message:             Created pod: scan-vulnerabilityreport-9c4c6f747-4g879    

@jatinmehrotra
Copy link
Contributor Author

Thank you so much @AlexsJones for your explanation. Really helpful.

I would like to send a PR to update the README for Anonymization based on our discussion as I am sure there might be others who might be wondering the same. By the end of the day I will push a PR

@AlexsJones
Copy link
Member

Thank you so much @AlexsJones for your explanation. Really helpful.

I would like to send a PR to update the README for Anonymization based on our discussion as I am sure there might be others who might be wondering the same. By the end of the day I will push a PR

Sounds great, I will close this issue for now but please feel free to reference/re-open if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants