Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-314: NETOBSERV-1274 index type and duplicate (improve query perfs) #409

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

jotak
Copy link
Member

@jotak jotak commented Sep 7, 2023

Description

Index duplicate field in Loki to improve query performance.
This is a very low cardinality field (2), and very used in queries, so it is an obvious improvement to do

Dependencies

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@jotak
Copy link
Member Author

jotak commented Sep 7, 2023

I'll share here some performance tests ...

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 7, 2023
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

New images:

  • quay.io/netobserv/network-observability-operator:5ff4147
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-5ff4147
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-5ff4147

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:5ff4147 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-5ff4147

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-5ff4147
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@jotak
Copy link
Member Author

jotak commented Sep 7, 2023

Some quick perf tests below. tl;dr: it's not like chalk and cheese, but still a slight improvement...

Test scenario:
- Delete any FlowCollector, LokiStack, clean up AWS S3, delete any Loki PVC
- Install mesh-arena (once - no need to clean up)
- (Re)set plugin image to `e50845a`
- Install LokiStack
- Install FlowCollector with sampling=1
- Wait 10 minutes (it's important to wait same time between runs)
- Open UI: https://console-openshift-console.apps.jtakvori-sep07-0.devcluster.openshift.com/netflow-traffic?timeRange=900&limit=50&match=all&showDup=false&function=last&type=bytes&packetLoss=all&recordType=flowLog&filters=dst_kind%3DPod%3Bsrc_kind%3DPod%3Bdst_namespace%21%3Dopenshift-%2Cnetobserv%3Bsrc_namespace%21%3Dopenshift-%2Cnetobserv&bnf=true
- Open javascript console / monitor network events: calls to "topology"
- Refresh a couple of times to get a mean value / stats

----
Run with operator's `main`:
~1.1s each (min=1.08, max=1.21)

Run with operator's `5ff4147`:
~900ms each (min=805, max=954)

@jotak
Copy link
Member Author

jotak commented Sep 7, 2023

Oh but I get muuuuch better results when I index resource types:

~170ms each (min=150ms, max=192ms)

This one seems to be a win :-)

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 7, 2023
@jotak jotak changed the title NETOBSERV-1274 index duplicate (improve query perfs) NETOBSERV-314: NETOBSERV-1274 index duplicate (improve query perfs) Sep 7, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 7, 2023

@jotak: This pull request references NETOBSERV-314 which is a valid jira issue.

In response to this:

Description

Index duplicate field in Loki to improve query performance.
This is a very low cardinality field (2), and very used in queries, so it is an obvious improvement to do

Dependencies

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jotak jotak changed the title NETOBSERV-314: NETOBSERV-1274 index duplicate (improve query perfs) NETOBSERV-314: NETOBSERV-1274 index type and duplicate (improve query perfs) Sep 7, 2023
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 7, 2023
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

New images:

  • quay.io/netobserv/network-observability-operator:36c54f4
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-36c54f4
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-36c54f4

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:36c54f4 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-36c54f4

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-36c54f4
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@codecov
Copy link

codecov bot commented Sep 7, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (623a3b2) 55.70% compared to head (9a0872f) 55.70%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #409   +/-   ##
=======================================
  Coverage   55.70%   55.70%           
=======================================
  Files          46       46           
  Lines        5960     5960           
=======================================
  Hits         3320     3320           
  Misses       2410     2410           
  Partials      230      230           
Flag Coverage Δ
unittests 55.70% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
controllers/flowlogspipeline/flp_common_objects.go 80.68% <ø> (ø)

☔ View full report in Codecov by Sentry.

📢 Have feedback on the report? Share it here.

@jpinsonneau
Copy link
Contributor

Oh but I get muuuuch better results when I index resource types:

~170ms each (min=150ms, max=192ms)

This one seems to be a win :-)

Interesting the type has so much impact. I would have not expect such !

Do you think we should introduce Proto ? Maybe Interface ?
The hard part here is to identify which of these would be mostly used by the users. Maybe exposing a list of additionnal indexes in the CR would allow to cover all these cases 😄

We should also introduce PktDropLatestState / PktDropLatestDropCause / DnsLatencyMs / DnsFlagsResponseCode / TimeFlowRttNs according to enabled features; but that can be a followup.

Code looks good to me. Thanks @jotak

@jotak
Copy link
Member Author

jotak commented Sep 7, 2023

Interesting the type has so much impact. I would have not expect such !

Actually I should mention this was with the default filters, that include "SrcType=Pod && DstType=Pod" ; without those filters we wouldn't get such a difference I guess

Do you think we should introduce Proto ? Maybe Interface ? The hard part here is to identify which of these would be mostly used by the users. Maybe exposing a list of additionnal indexes in the CR would allow to cover all these cases 😄

Proto could possibly be interesting as you say if people commonly use it as a filter, but I don't have the impression it's a common one? Or maybe it's just me, I hardly never filter on it ....
Interface is likely not a good fit because it's high cardinality

We should also introduce PktDropLatestState / PktDropLatestDropCause / DnsLatencyMs / DnsFlagsResponseCode / TimeFlowRttNs according to enabled features; but that can be a followup.

Yeah but always keeping in mind the cardinality aspect: latencies can have pretty much any value so that would certainly generate too big indexes. But when values are bounded, like states, causes or codes, yeah that could do it.

Code looks good to me. Thanks @jotak

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 8, 2023

@jotak: This pull request references NETOBSERV-314 which is a valid jira issue.

In response to this:

Description

Index duplicate field in Loki to improve query performance.
This is a very low cardinality field (2), and very used in queries, so it is an obvious improvement to do

Dependencies

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 8, 2023

@jotak: This pull request references NETOBSERV-314 which is a valid jira issue.

In response to this:

Description

Index duplicate field in Loki to improve query performance.
This is a very low cardinality field (2), and very used in queries, so it is an obvious improvement to do

Dependencies

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Sep 8, 2023

New changes are detected. LGTM label has been removed.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 8, 2023
@jotak
Copy link
Member Author

jotak commented Sep 27, 2023

as discussed in jira, this is dev-only PR
/approve

@openshift-ci
Copy link

openshift-ci bot commented Sep 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jotak jotak added no-qe This PR doesn't necessitate QE approval no-doc This PR doesn't require documentation change on the NetObserv operator labels Sep 27, 2023
@jotak jotak merged commit 50bcbdf into netobserv:main Sep 27, 2023
16 of 17 checks passed
@jpinsonneau jpinsonneau added the breaking-change This pull request has breaking changes. They should be described in PR description. label Oct 5, 2023
@jpinsonneau
Copy link
Contributor

@Amoghrd @nathan-weinberg marking this PR as breaking change as it requires this PR + netobserv/network-observability-console-plugin#380 to work correctly.

Any pending console PR that doesn't contains these changes will show "No result found" in every tab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved breaking-change This pull request has breaking changes. They should be described in PR description. jira/valid-reference no-doc This PR doesn't require documentation change on the NetObserv operator no-qe This PR doesn't necessitate QE approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants