Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[issue-368] knative integration with DataIndex and JobService #467

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jianrongzhang89
Copy link
Contributor

@jianrongzhang89 jianrongzhang89 commented May 20, 2024

Implements the Knative integration for DataIndex and JobService as well as the workflow as outlined in the ADR:
https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit#heading=h.ds8q4xtkmu64

Unit test cases are updated. See the ADR for common test cases:
https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit?usp=sharing

Motivation for the change:
Knative integration

Fix #368

Checklist

  • Add or Modify a unit test for your change
  • Have you verified that all the tests are passing?
How to backport a pull request to a different branch?

If something goes wrong, the author will be notified and at this point a manual backporting is needed.

NOTE: this automated backporting is triggered whenever a pull request on main branch is labeled or closed, but both conditions must be satisfied to get the new PR created.

@jianrongzhang89 jianrongzhang89 marked this pull request as draft May 20, 2024 02:54
Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking care of this implementation. I'm sorry for the number of comments, but I guess you're getting familiar with the code base.

Let's try not to change the base API for object management. The client reference is cached for use across the code base.

api/v1alpha08/sonataflowplatform_types.go Outdated Show resolved Hide resolved
api/v1alpha08/sonataflowplatform_types.go Outdated Show resolved Hide resolved
api/v1alpha08/sonataflowplatform_types.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/profiles/common/mutate_visitors.go Outdated Show resolved Hide resolved
controllers/profiles/common/mutate_visitors.go Outdated Show resolved Hide resolved
controllers/profiles/common/object_creators.go Outdated Show resolved Hide resolved
controllers/profiles/common/properties/managed.go Outdated Show resolved Hide resolved
workflowproj/operator.go Show resolved Hide resolved
@ricardozanini
Copy link
Member

To pass the file generation check, make sure to run:

make generate-all, then make vet fmt before commiting.

@jianrongzhang89
Copy link
Contributor Author

To pass the file generation check, make sure to run:

make generate-all, then make vet fmt before commiting.

Thanks!

@jianrongzhang89 jianrongzhang89 force-pushed the issue-368 branch 2 times, most recently from 3c95aac to e6b4601 Compare May 21, 2024 01:39
@wmedvede
Copy link
Contributor

Hi @jianrongzhang89 , this starts to look good!

Some results here:

Case1:

DI, JS and workflows takes eventing configuration from the platform

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

  1. The use case works, evens are produced/registered as expected.

  2. Expected triggers are created

NAME BROKER SINK AGE CONDITIONS READY REASON
callbackstatetimeouts-callbackevent-trigger default service:callbackstatetimeouts 50m 7 OK / 7 True
sonataflow-platform-data-index-jobs-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-definition-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-error-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-node-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-sla-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-state-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-variable-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-job-service-create-job-trigger default service:sonataflow-platform-jobs-service 52m 7 OK / 7 True
sonataflow-platform-job-service-delete-job-trigger default service:sonataflow-platform-jobs-service 52m 7 OK / 7 True

  1. Expected sinkbindings are created

callbackstatetimeouts-sb SinkBinding sinkbindings.sources.knative.dev broker:default True
jobs-service-sb SinkBinding sinkbindings.sources.knative.dev broker:default True

@jianrongzhang89 to be continued tomorrow.

@ricardozanini
Copy link
Member

@jianrongzhang89 I'll do another review round tomorrow.

@wmedvede
Copy link
Contributor

Hi @jianrongzhang89 I was executing the following usecase in Openshift (since before I worked with minikube only) and I can see the following weird behaviour.

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

  1. The Job service deployment is produced.
  2. And after some minutes, when the callbackstatecallback workflow build finished, the corresponding deployment is also produced.

But, if I wait some time, I start to see the workflow and the JS restarting forever.

See some cases here:

Screenshot from 2024-05-30 11-22-50

Screenshot from 2024-05-30 11-22-41

Screenshot from 2024-05-30 11-24-17

@jianrongzhang89
Copy link
Contributor Author

jianrongzhang89 commented May 30, 2024

@wmedvede this issue is now fixed. Please check again.

Hi @jianrongzhang89 I was executing the following usecase in Openshift (since before I worked with minikube only) and I can see the following weird behaviour.

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

  1. The Job service deployment is produced.
  2. And after some minutes, when the callbackstatecallback workflow build finished, the corresponding deployment is also produced.

But, if I wait some time, I start to see the workflow and the JS restarting forever.

Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good! I think we can turn a few knobs here and there and it should be ready to go for QE to verify.

api/condition_types.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/profiles/dev/object_creators_dev.go Outdated Show resolved Hide resolved
controllers/profiles/dev/profile_dev_test.go Outdated Show resolved Hide resolved
controllers/profiles/dev/states_dev.go Outdated Show resolved Hide resolved
controllers/profiles/dev/states_dev.go Outdated Show resolved Hide resolved
controllers/profiles/preview/states_preview.go Outdated Show resolved Hide resolved
@jianrongzhang89 jianrongzhang89 force-pushed the issue-368 branch 2 times, most recently from 35bffcc to 13b1bbb Compare June 5, 2024 02:38
Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move on now to validate this PR. @domhanak can you take a look?

@wmedvede
Copy link
Contributor

Hi @jianrongzhang89 , here goes some more test results with the latest changes:

Case 2) Is still failing
https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case2

After deploying the DI, JS, and the workflow, we have same outcome as before:

kubectl get workflow -n case2-kn-eventing
 
NAME                    PROFILE   VERSION   URL   READY   REASON
callbackstatetimeouts   preview   0.0.1           False   DeploymentFailure

Case 5) is now working with the following considerations:
https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case5

Trigger names collision:

When we use the SFCP, with the cluster-wide instance of the DI, JS, and Broker, all the triggers must be created in the namespace of the SFP that defines these objects. (in the example case5-kn-eventing). This is necessary from the point of view of Knative Eventing.

And thus, if we deploy a workflow "callbackstatetimeouts" in namespace1, a trigger with the name callbackstatetimeouts-callbackevent-trigger will be created in "case5-kn-eventing".

Now, if we deploy "callbackstatetimeouts" in nampespace2, the operator will try to create callbackstatetimeouts-callbackevent-trigger in "case5-kn-eventing" and will produce an error.

The error only happens of course when we use that cluster-wide configuration.

Well, considering that we are doing work in parallel, to support the same workflow name to be deployed in different namespaces, I think this trigger issue, must be fixed as part of this PR. Not too much really.

note: we should include the namespace in the trigger name only for workflows that are deployed in a different ns I think.

Trigger names length:

It looks like the Trigger names, IDK why, must be no longer than 63 characters. With a larger name, it looks like the trigger is created, but, it doesn't work. And, by querying triggers with "kn trigger list", we can see the following output.

sonataflow-broker service:sonataflow-platform-data-index-service 9s 3 OK / 6 False NotSubscribed : Subscription.messaging.knative.dev "sonataflow-broker-sonataflow-pla6ff7c12b8027f7bef859225cc5ef7cf" is invalid: metadata.labels: Invalid value: "sonataflow-platform-data-index-service-process-def-triggerssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff": must be no more than 63 characters

Considering that we are not that far to reach that limit, think we must:

  1. Ask if this is an error in Knative Eventiing. Why so short trigger names?. Maybe in some newer version this was fixed, or mabye this is needed.

Mabye the rational for that limit is this: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#:~:text=in%20RFC%201035.-,This%20means%20the%20name%20must%3A,start%20with%20an%20alphabetic%20character

  1. If that length is there for good reasons, we must apply a smart trimming of the names for the generated Triggers, otherwise the probability of a failure is huge and the integration won't work.

On the other hand, the SinkBinding names looks to support 253 characters like other k8s objects.

@ricardozanini
Copy link
Member

ricardozanini commented Jun 13, 2024

@wmedvede @jianrongzhang89 For the naming problem, let's use trigger-<randon-id> as used by pods. Then we just have to make sure that this info is tied to the parent object. If in a workflow, we can add to the .status.triggers attribute an array of knative triggers. The same for DI and JS. Then, the name won't be a matter anymore and we have a good way of finding the relation between then.

@jianrongzhang89
Copy link
Contributor Author

For case 2, there is no broker defined in the platform's spec.inventing field and the workflow does not have sink defined. In this case, based on our previous discussions, the test is intended to fail. Please see the ADR https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit?usp=sharing use case #1.1.
@ricardozanini could you please confirm this is the desired behavior?

Hi @jianrongzhang89 , here goes some more test results with the latest changes:

Case 2) Is still failing https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case2

After deploying the DI, JS, and the workflow, we have same outcome as before:

kubectl get workflow -n case2-kn-eventing
 
NAME                    PROFILE   VERSION   URL   READY   REASON
callbackstatetimeouts   preview   0.0.1           False   DeploymentFailure

@jianrongzhang89
Copy link
Contributor Author

@wmedvede @jianrongzhang89 For the naming problem, let's use trigger-<randon-id> as used by pods. Then we just have to make sure that this info is tied to the parent object. If in a workflow, we can add to the .status.triggers attribute an array of knative triggers. The same for DI and JS. Then, the name won't be a matter anymore and we have a good way of finding the relation between then.

@ricardozanini @wmedvede this is a great idea. This limitation is indeed required by the Knative eventing itself, as it uses a label with the trigger name in the subscription objects. I updated the PR with a slightly different implementation using the knative's function ChildName() to generate the trigger name. It is based on the parent object (data index, job index or workflow)'s UID and make sure that the generated name won't be longer that 63 characters. Here are an example of the trigger names:
data-index-jobs-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-definition-73a585bd4a43f0ee289b57dfbedf39dd9
data-index-process-error-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-node-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-sla-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-state-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-variable-5fa0f63cae14d1df1c1bd23aaa599b559e8
jobs-service-create-job-9e8d24ce-5032-4771-9934-da0c15fd69cc
jobs-service-delete-job-9e8d24ce-5032-4771-9934-da0c15fd69cc
greetingtest5-greetingevent-be7d8bd8dce1feb49e4e476c2842512b8ea
greetingtest5-greetingevent2-0a73c1906bdc359d653befc487a9f7048e

Since these triggers have labels or owner references that indicate which workflows (or dataindex or job service's) they belong to already, it does not seem to me that we need to add the information to workflow's status. Please let me know if I missed something and this is really required. Thank you.

@ricardozanini
Copy link
Member

@jianrongzhang89 agreed with the naming approach, many thanks for looking into this!

Regarding with the test failure, please see my changes here: https://github.com/apache/incubator-kie-kogito-serverless-operator/pull/487/files#diff-4782f3846c145011deb7edc727cc7dd3016d1683615d1f426ce246e57908daf2

It should help fix this bug and have access to utils.GetClient() across all tests.

@ricardozanini
Copy link
Member

@ricardozanini could you please confirm this is the desired behavior?

It's, but we have to add an event or a clear status to the object alerting users about this behavior.

@jianrongzhang89
Copy link
Contributor Author

@ricardozanini could you please confirm this is the desired behavior?

It's, but we have to add an event or a clear status to the object alerting users about this behavior.

@ricardozanini the sonataflow status already has error message in the condition:
status:
address: {}
conditions:
- lastUpdateTime: "2024-06-15T01:16:08Z"
status: "True"
type: Built
- lastUpdateTime: "2024-06-15T01:16:08Z"
message: Error in deploy the workflow:no sink configured in the workflow or
the platform when Job Service or Data Index Service is enabled
reason: DeploymentFailure
status: "False"
type: Running
observedGeneration: 1
services:
dataIndexRef:
url: http://sonataflow-platform-data-index-service.sonataflow-infra
jobServiceRef:
url: http://sonataflow-platform-jobs-service.sonataflow-infra

Is there anything additional needed?

@jianrongzhang89
Copy link
Contributor Author

@jianrongzhang89 I just tested with adding only the following to the sonataflowplatform:

eventing:
    broker:
      ref:
        name: kafka-broker
        namespace: sonataflow-infra
        apiVersion: eventing.knative.dev/v1
        kind: Broker

The triggers are created:

$ oc get trigger -A
NAMESPACE      NAME                                                              BROKER         SUBSCRIBER_URI   AGE     READY   REASON
orchestrator   data-index-jobs-2ac1baab-d856-40bc-bcec-c6dd50951419              kafka-broker                    6m51s           
orchestrator   data-index-process-definition-634c6f230b6364cdda8272f98c5d58722   kafka-broker                    6m51s           
orchestrator   data-index-process-error-2ac1baab-d856-40bc-bcec-c6dd50951419     kafka-broker                    6m51s           
orchestrator   data-index-process-node-2ac1baab-d856-40bc-bcec-c6dd50951419      kafka-broker                    6m51s           
orchestrator   data-index-process-sla-2ac1baab-d856-40bc-bcec-c6dd50951419       kafka-broker                    6m51s           
orchestrator   data-index-process-state-2ac1baab-d856-40bc-bcec-c6dd50951419     kafka-broker                    6m51s           
orchestrator   data-index-process-variable-6f721bf111e75efc394000bca9884ae22ac   kafka-broker                    6m51s           
orchestrator   jobs-service-create-job-2ac1baab-d856-40bc-bcec-c6dd50951419      kafka-broker                    6m51s           
orchestrator   jobs-service-delete-job-2ac1baab-d856-40bc-bcec-c6dd50951419      kafka-broker                    6m51s  

But they are not in the same namespace as the broker, is that normal?

Then, I create the kafka broker:

$ oc get broker -A
NAMESPACE          NAME           URL                                                                                            AGE     READY   REASON
sonataflow-infra   kafka-broker   http://kafka-broker-ingress.knative-eventing.svc.cluster.local/sonataflow-infra/kafka-broker   5m15s   True    

But the triggers are never ready, I guess because they should be in the same namespace as the broker?

Thank you for catching this. This issue is fixed.

@jianrongzhang89
Copy link
Contributor Author

Hi @jianrongzhang89 , when doing knative deployments of a workflow I have found the following issue.

You can see the following Use case 3, knative services reproducer:

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case3-knative-services

In this case, we mark a workflow to produce a knative service deployment instead of a regular kubernetes one, by using the following field:

  podTemplate:
    deploymentModel: knative

see: https://github.com/flows-examples/techpreview2/blob/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case3-knative-services/07-sonataflow_eventstatetimeouts.sw.yaml#L26

With that configuration, we create a knative service to produce the workflow deployment.

And, the Triggers must be adjusted to refer a knative service instead.

This is what we see now:

eventstatetimeouts-event1-8b6712a2-8a8a-4def-a4a3-fdbc34b6e828 sonataflow-broker service:eventstatetimeouts 6m27s 1 OK / 7 False Unable to get the Subscriber's URI : failed to get object case3-kn-eventing-knservices/eventstatetimeouts: services "eventstatetimeouts" not found eventstatetimeouts-event2-8b6712a2-8a8a-4def-a4a3-fdbc34b6e828 sonataflow-broker service:eventstatetimeouts 6m27s 1 OK / 7 False Unable to get the Subscriber's URI : failed to get object case3-kn-eventing-knservices/eventstatetimeouts: services "eventstatetimeouts" not found j

I think the issue is not too much, it's just a matter of taking account that when the triggers are created. The following type must be used instead:

  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service

And for the SinkBinding we might have to take into account a similar consideration.

DataIndex and JobsService are good, will never be knative services.

I updated the code to support knative services.

@jianrongzhang89
Copy link
Contributor Author

If I rename the broker in the sonataflowplatform, the triggers are not updated. ie:

eventing:
    broker:
      ref:
        name: kafka-broker
        namespace: sonataflow-infra
        apiVersion: eventing.knative.dev/v1
        kind: Broker

Creates the triggers, then I remove data-index-jobs. The trigger is not re-created automatically, is that normal?

Then I edit the eventing spec and I set the broker name to kafka-broker2 and then, only the deleted trigger is (re)created, the other do not change:

$ oc get trigger -A
NAMESPACE      NAME                                                              BROKER          SUBSCRIBER_URI   AGE   READY   REASON
orchestrator   data-index-jobs-2ac1baab-d856-40bc-bcec-c6dd50951419              kafka-broker2                    35s           
orchestrator   data-index-process-definition-634c6f230b6364cdda8272f98c5d58722   kafka-broker                     19m           
orchestrator   data-index-process-error-2ac1baab-d856-40bc-bcec-c6dd50951419     kafka-broker                     19m           
orchestrator   data-index-process-node-2ac1baab-d856-40bc-bcec-c6dd50951419      kafka-broker                     19m           
orchestrator   data-index-process-sla-2ac1baab-d856-40bc-bcec-c6dd50951419       kafka-broker                     19m           
orchestrator   data-index-process-state-2ac1baab-d856-40bc-bcec-c6dd50951419     kafka-broker                     19m           
orchestrator   data-index-process-variable-6f721bf111e75efc394000bca9884ae22ac   kafka-broker                     19m           
orchestrator   jobs-service-create-job-2ac1baab-d856-40bc-bcec-c6dd50951419      kafka-broker                     19m           
orchestrator   jobs-service-delete-job-2ac1baab-d856-40bc-bcec-c6dd50951419      kafka-broker                     19m 

I would expect all triggers (and sinkbindings) to be recreated with the new value

Similarly, if I delete the sinkbinding, it is not automatically re-created

Fix is done so that when a sinkbinding or trigger is deleted, it will get recreated by the operator.
However, when the sonataflowplatform broker name is updated, the trigger's broker won't get updated because the trigger broker name is immutable in Knative. In this case, an error will be generated in the operator logs.

eventing:
broker:
ref:
apiVersion: eventing.knative.dev/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this broker being created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is created via make deploy-broker. We may add it to make deploy-knative if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianrongzhang89 if you look at the structure many of the tests, each test as an independent kustomization.yaml file that creates all the objects used by the test including the workflows.

look here:
https://github.com/apache/incubator-kie-kogito-serverless-operator/blob/5936beebd371f3bd9e847061be0c1b9d61acf33c/test/testdata/workflow/persistence/by_service/kustomization.yaml

If you incorporate now a broker as part of the test, you must add a 0x_broker.yaml file and make it part of the kustomize.

In this way, each test execute isolated from the other. The pattern is normally like this:

  1. we create a random namespace testXXX
  2. we create all the objects for the test in that namespace
  3. when the test finish, we delete the namespace

and so on.

This warranty that the tests executes isolated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wmedvede. Done.

@wmedvede
Copy link
Contributor

@jianrongzhang89 looks like we need rebasing when you can please.

@jianrongzhang89 jianrongzhang89 force-pushed the issue-368 branch 3 times, most recently from e1b98f5 to 1bd198b Compare July 25, 2024 20:28
@jianrongzhang89
Copy link
Contributor Author

@jianrongzhang89 looks like we need rebasing when you can please.

@wmedvede code rebased.

@ricardozanini
Copy link
Member

@jianrongzhang89 can you please rebase again? 👍

@ricardozanini
Copy link
Member

It will be easier if you squash your commits.

@ricardozanini
Copy link
Member

@jianrongzhang89 please squash these commits so we have only one to merge. And your rebase will be way easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create knative resources for DataIndex and JobService
4 participants