Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1248 deduper using infinispan cache #639

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jpinsonneau
Copy link
Collaborator

@jpinsonneau jpinsonneau commented Mar 19, 2024

Description

Implements infinispan shared cache:

  • Add cache functions for deduper case
    Keeps Egress flows between two nodes in cache
    Then merge interfaces and ifdirections when Ingress is captured
  • Make url configurable
  • Auth / tls
  • Implements extra cases

Merged flows contains a list of interfaces / ifdirections with a * next to names coming from Egress flow:
Screenshot from 2024-03-18 20-19-43
Screenshot from 2024-03-18 20-19-55

Steps to use this code:

  • go mod tidy && go mod vendor
  • Install Infinispan Operator OR Red Hat Data Grid Operator
  • Create cache cluster in netobserv namespace with disabled TLS / Auth
apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: cache
  namespace: netobserv
spec:
  security:
    endpointAuthentication: false
    endpointEncryption:
      clientCert: None
      type: None
    endpointSecretName: cache-generated-secret
  container:
    memory: 1Gi
  service:
    replicationFactor: 2
    type: Cache
  jmx: {}
  configListener:
    enabled: true
    logging:
      level: info
  upgrades:
    type: Shutdown
  version: 8.4.6-2
  replicas: 1
  • Create deduper cache
apiVersion: infinispan.org/v2alpha1
kind: Cache
metadata:
  name: deduper
  namespace: netobserv
spec:
  clusterName: cache
  name: deduper
  updates:
    strategy: retain

These will provide a service cache.netobserv.svc.cluster.local on port 11222.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Copy link

openshift-ci bot commented Mar 19, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Mar 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jpinsonneau. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


func (c *Cache) getDedupKey(v config.GenericMap) string {
return regex.ReplaceAllString(
fmt.Sprintf("%d-%d-%s-%d-%s-%d", v["TimeFlowEndMs"], v["Proto"], v["SrcAddr"], v["SrcPort"], v["DstAddr"], v["DstPort"]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put TimeFlowEndMs on top of the 5 tuple to avoid conflicts when 2 Egress comes before first Ingress.
As is, I miss some flows as you can see in the screens. We need to find a better way to unique identify each flows.

Any thoughts ?

Copy link
Member

@jotak jotak Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can do an assumption here, that the interfaces are not going to change over time across flows in the same 5-tuples. So we could remove the timestamp from the key. In fact I'm already doing this same assumption here
On top of that, it should be much better for the memory footprint.

The downside is that it works for merging interfaces, but might not work if we want to reuse the cache for other purposes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cache is only for deduper so I'm fine with that;

However are we sure that the interfaces are not changing when the pod is recreated for example ?
The ip addresses can be the same but ovn generated interfaces could change 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on top of that, I could push only fields we want to merge in the cache instead of the whole flow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However are we sure that the interfaces are not changing when the pod is recreated for example ?
The ip addresses can be the same but ovn generated interfaces could change

A pod restart would very very very likely use a different IP as far as I know; and even if that's not the case, the connection would need to be reinitiated so the source port would change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but perhaps in any case there should be an expiry mechanism

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made changes in 335d737

Comment on lines 38 to 43
Endpoint: "http://cache.netobserv.svc.cluster.local:11222",
/*Credentials: &Credentials{
Username: "netobserv",
Password: "netobserv",
},*/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make these configurable and add TLS from secret

Comment on lines 141 to 148
log.Debugf("merging cached flow %v with %v ...", srcFlow, outputEntry)
if len(srcFlow["Interfaces"].([]interface{})) == len(srcFlow["IfDirections"].([]interface{})) {
for i := 0; i < len(srcFlow["Interfaces"].([]interface{})); i++ {
outputEntry["Interfaces"] = append(outputEntry["Interfaces"].([]string), fmt.Sprintf("%s*", srcFlow["Interfaces"].([]interface{})[i].(string)))
outputEntry["IfDirections"] = append(outputEntry["IfDirections"].([]int), int(srcFlow["IfDirections"].([]interface{})[i].(float64)))
}
}
log.Debugf("merged flow: %v", outputEntry)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👆 merging Egress flow content in Ingress one.
For now just merging Interfaces and IfDirections and adding a * next to the names.
We could imagine merging more feature related contents, TCP flags and also check Bytes / Packets amounts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if Egress comes after Ingress, it will be ignored? Perhaps it's another reason to not have the timestamp in the key as discussed here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah currently if Egress comes after Ingress, it's pushed in the cache and never retreived.
Only Ingress will appear in storage and will not have merged fields.

An improvment could be to send first flow to the cache whatever direction is found and merge on the "second" one; but that will double the cache queries

@jotak jotak added ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. and removed ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. labels Mar 19, 2024
)

var log = logrus.WithField("component", "transform.Network.Cache")
var regex = regexp.MustCompile(`[^a-zA-Z0-9 ]+`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there forbidden characters in key that require this replaceAll ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had issues with my initial attempt either with . or : when I had standart ips + ports key
I can give another try to get rid of this regex and improve perfs indeed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also addressed in 335d737

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 20, 2024
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 20, 2024
reinterpretDirection(outputEntry, &n.DirectionInfo)
isInner, isSrcReporter := reinterpretDirection(outputEntry, &n.DirectionInfo)
if n.cache != nil {
return n.cache.DedupFlows(isInner, isSrcReporter, outputEntry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was weirdly seeing no flow under layer=app filter and now I understand : this return breaks the rules loop processing :-)
It should be continue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case push flow to cache so you don't need to manage extra transform rules on it right ?
The goal is actually to break the pipeline here.

When the second flow will hit this stage, the first flow is retreived from cache and other transform rules are applied such as layer

@jotak
Copy link
Member

jotak commented Mar 20, 2024

The new version is better but still increases the overall resource usage in my tests, although it's decrease for Loki & storage:
https://docs.google.com/spreadsheets/d/1qakBaK1dk_rERO30k1cSR4W-Nn0SXW4A3lqQ1sZC4rE/edit#gid=705512397

(By overall I mean: everything deployed in netobserv* namespaces - so that includes the infinispan deployment and loki)

CPU overall +146% (1.68)
Memory overall +30% (3GB)

Loki CPU -44% (0.035)
Loki memory -46% (652MB)
Loki storage -53%

So the benefit here is really on storage, and comes at a price in CPU and memory. We could run tests at a bigger scale to check if it's different.

Also one thing that seems wrong from my tests is that the captured traffic has dropped quite a lot, as if we were removing more flows than expected. That's to investigate.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 9, 2024
@jpinsonneau jpinsonneau marked this pull request as ready for review April 9, 2024 16:31
@jpinsonneau
Copy link
Collaborator Author

Rebased this PR and added CacheEndpoint config to condition cache usage

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jpinsonneau
Copy link
Collaborator Author

@jotak is this something we should still consider ?

I can replace the json Marshal / Unmarshal by fmt.Sprintf to improve FLP CPU usage but we will still have the infinispan resource impact 🤔
If you think it's not worth I will just close that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants