Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-255: FLP multiple deployments #78

Merged
merged 9 commits into from
May 17, 2022

Conversation

OlivierCazade
Copy link
Contributor

This PR rework the FLP reconciler to two different ones:

  • a single reconciler in charge of managing a single FLP deployment
  • a parent reconciler in charge of managing the global resources (clusterRole) and managing the single reconcilers

@OlivierCazade OlivierCazade changed the title FLP multiple deployments NETOBSERV-255: FLP multiple deployments Apr 7, 2022
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
api/v1alpha1/flowcollector_types.go Outdated Show resolved Hide resolved
required:
- address
- topic
type: object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the area of using FLP as ingress ... we will have multiple FLP instances consuming from the same topic. Need to make sure that we have some configuration on the distribution that Kafka is creating when sending data into those FLP's going forward this is going to be very important after we implement the stateful connection tracking capabilities. There might be multiple different configurations based on Load/ number of consumers etc ... those might need to be exported and configurable to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the kafka configuration block is meant to change, but I would prefer to change it again once we have the connection tracking so we can test the different configurations and chose what we need to expose.
Would you be fine with that approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't have to be in this PR, agree ... maybe just open a ticket somewhere to remember that we need to improve that later.

@@ -46,6 +46,10 @@ type FlowCollectorSpec struct {
// Loki contains settings related to the loki client
Loki FlowCollectorLoki `json:"loki,omitempty"`

// Kafka configurations, if empty the operator will deploy a all-in-one FLP
// +optional
Kafka *FlowCollectorKafka `json:"kafka,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have an explicit flag to enable/disable kafka, rather than relying on nil pointer. This is because of how the CRD is handled in OLD/console form, it doesn't allow to remove a section. We discovered it the hard way for eBPF: https://issues.redhat.com/browse/NETOBSERV-319

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to knows, thanks. I will add an enable flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable flag added

@jotak
Copy link
Member

jotak commented Apr 29, 2022

I haven't reviewed everything yet, will continue next week

@@ -32,40 +32,54 @@ const (
startupPeriodSeconds = 10
)

const (
ConfSingle = "allInOne"
ConfKafkaIngestor = "kafkaIngestor"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, don't we say Ingester ? :)

Comment on lines 43 to 44
ConfKafkaIngestor: "-kingestor",
ConfKafkaTransformer: "-ktransform",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the "k"
"-ingester"
"-transformer" (missing -er)

}
}

func buildClusterRole() *rbacv1.ClusterRole {
return &rbacv1.ClusterRole{
ObjectMeta: metav1.ObjectMeta{
Name: constants.FLPName,
Labels: buildAppLabel(),
Labels: buildAppLabel(""),
},
Rules: []rbacv1.PolicyRule{{
APIGroups: []string{""},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On cluster roles, actually, they need to be split, they are not the same for ingester and for transformer.
Only the transformer needs to watch pods/services/nodes/replicasets
And I guess only the ingester needs to use hostnetwork (I'm actually not sure why we need that, @mariomac can you remind me?)

We can create two clusterrole objects (even in the case of a single FLP if it makes it simpler), and deal with cluster role attribution via clusterrolebindings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't know why we need horizontalpodautoscalers here: it's not FLP itself who create the HPA ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, hostNetwork: true is only required by the eBPF agent, so as long as it should be associated with another cluster role, the hostnetwork label here shouldn't be necessary.

Copy link
Contributor Author

@OlivierCazade OlivierCazade May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need host network for daemonset deployment? So we can bind node port which is used by ovs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I retrieved it: #29
So that's an ingester thing.

Copy link
Contributor Author

@OlivierCazade OlivierCazade May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be automatically bound to the host port 9999 unless you explicitly set another hostPort property.

I am very surprised about this.

Do you have documentation about it? What kind of cluster are you using for this test? The k8s official documentation does not mention surch automatic bind and at the oposite state the following:

Note that the containers are not using port 80 on the node, nor are there any special NAT rules to route traffic to the pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The documentation says that you need to specify a hostPort in the container (but still doesn't require to set hostNetwork): https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#communicating-with-daemon-pods

So maybe this is an "undocumented" or just "undefined" behavior. So for more safety I'll replace the containerPort in my example daemonset by hostPort

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-correction: can't find the behavior description in the official K8s documentation but I found some other sources describing the behavior:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more safety, I'd specify hostPort to the same value as containerPort but it still should work without setting hostNetwork: true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIRA created, thanks for clearing that up! https://issues.redhat.com/browse/NETOBSERV-362

@@ -40,7 +40,7 @@ func NewFlowsConfigController(client reconcilers.ClientHelper,
// Reconcile reconciles the status of the ovs-flows-config configmap with
// the target FlowCollector ipfix section map
func (c *FlowsConfigController) Reconcile(
ctx context.Context, target *flowsv1alpha1.FlowCollector) error {
ctx context.Context, target *flowsv1alpha1.FlowCollector, flpServiceName string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to avoid OVS reconciliation here, if we keep the same service regardless if it's flp or flp-ingester, it would avoid the OVS reconfig.
As you prefer, if you think it can make things simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both deployment now use the same service, but now that the refactoring was done, I kept using an argument in the ovs reconciliation instead of relaying on a constant.

Is it fine for you?

@@ -75,7 +75,7 @@ func (m *NamespacedObjectManager) CleanupNamespace(ctx context.Context) {
ref.SetNamespace(namespace)
log.Info("Deleting old "+obj.kind, "Namespace", namespace, "Name", obj.name)
err := m.client.Delete(ctx, ref)
if err != nil {
if client.IgnoreNotFound(err) != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested and looks good, a couple of comments mostly on naming and splitting cluster roles

@OlivierCazade OlivierCazade force-pushed the cr-kafka-rework branch 2 times, most recently from e86d190 to 83e1672 Compare May 13, 2022 07:40
@OlivierCazade
Copy link
Contributor Author

I pushed a few new commits:

  • ingestor to ingester name correction
  • deployment suffix changes from -kingester to -ingestor and from -ktransform to -transformer (for the record, -ktransformer was hitting some length limit this why I move to -ktransform but -transformer also fit)
  • refactoring cluster role to split into two different roles, please note that both roles have for now the same permissions, this is because for now, both ingester and transformer do the same thing, some permissions will be removed once different kafka configurations will be done
  • flp-ingester and the single flp configuration now use the same service to avoid ovs unnecessary reconciliation

@@ -358,24 +372,27 @@ func (b *builder) service(old *corev1.Service) *corev1.Service {
}
// In case we're updating an existing service, we need to build from the old one to keep immutable fields such as clusterIP
newService := old.DeepCopy()
newService.Spec.Selector = b.selector
newService.Spec.SessionAffinity = corev1.ServiceAffinityClientIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually I guess this session affinity should be needed even without your work on kafka right?

@jotak
Copy link
Member

jotak commented May 16, 2022

While testing I noticed the transformer pod was stuck in a pending state, and it turns out it's because it is trying to listen on node port 2055, which obviously it shouldn't (only the ingester should listen there).
But maybe that's something you plan to change in NETOBSERV-256 ?

@jotak
Copy link
Member

jotak commented May 16, 2022

Also I'm seeing a couple of errors in the logs, which aren't too concerning (on first run, it's trying to delete stuff that doesn't exist), but we should try to fix, maybe via another PR / JIRA (I'm not sure if there's an easy fix or if it needs some refactoring)

@OlivierCazade
Copy link
Contributor Author

While testing I noticed the transformer pod was stuck in a pending state, and it turns out it's because it is trying to listen on node port 2055, which obviously it shouldn't (only the ingester should listen there).
But maybe that's something you plan to change in NETOBSERV-256 ?

Yes, the goal of this PR is to prepare the Kafka configuration by deploying FLP twice when Kafka is enabled, for now they both have the same configuration and only the ingestor is usefull.

With the next PR the tranformer will have its own confing and will not listen anymore to this port.

Also I'm seeing a couple of errors in the logs, which aren't too concerning (on first run, it's trying to delete stuff that doesn't exist), but we should try to fix, maybe via another PR / JIRA (I'm not sure if there's an easy fix or if it needs some refactoring)

I missed this errors, thanks, they are due to trying to call cleanNamespace without having a previous namespace (namespace field is empty at the first iteration). I don't see any easy fix either, are you fine addressing this in a new PR?

@jotak
Copy link
Member

jotak commented May 16, 2022

Also I'm seeing a couple of errors in the logs, which aren't too concerning (on first run, it's trying to delete stuff that doesn't exist), but we should try to fix, maybe via another PR / JIRA (I'm not sure if there's an easy fix or if it needs some refactoring)

I missed this errors, thanks, they are due to trying to call cleanNamespace without having a previous namespace (namespace field is empty at the first iteration). I don't see any easy fix either, are you fine addressing this in a new PR?

Sure we can do that, I'll create another ticket

/lgtm
thanks @OlivierCazade !

@openshift-ci openshift-ci bot added the lgtm label May 16, 2022
@jotak
Copy link
Member

jotak commented May 16, 2022

@OlivierCazade
Copy link
Contributor Author

/approve

@openshift-ci
Copy link

openshift-ci bot commented May 17, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: OlivierCazade

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 3189c39 into netobserv:main May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants