Resync Theia Manager at startup #150

yanjunz97 · 2023-01-14T01:08:06Z

Theia Manager needs to be resynchronized with the SparkApplications and
db entries when it recovers from the downtime. This commit

Supports removing stale SparkApplications and db entries after Theia Manager restart
Supports adding back scheduled/running NPR to periodical sync list
Adds e2e tests for the Theia Manager restart

Fixes #135

Signed-off-by: Yanjun Zhou zhouya@vmware.com

yanjunz97 · 2023-01-14T02:20:43Z

/theia-test-e2e

yanjunz97 · 2023-01-17T23:56:59Z

/theia-test-e2e

wsquan171 · 2023-01-19T19:05:42Z

pkg/util/clickhouse/clickhouse.go

 )

 var (
 	openSql         = sql.Open
 	createK8sClient = k8s.CreateK8sClient
 )

-func SetupConnection() (connect *sql.DB, err error) {
-	url, err := getClickHouseURL()
+func SetupConnection(client *kubernetes.Interface) (connect *sql.DB, err error) {


Suggested change

func SetupConnection(client *kubernetes.Interface) (connect *sql.DB, err error) {

func SetupConnection(client kubernetes.Interface) (connect *sql.DB, err error) {

Go interfaces are already pointers. Should avoid using pointers to interfaces

Thanks Shawn for the reviewing! Updated.

wsquan171 · 2023-01-19T19:06:00Z

pkg/util/clickhouse/clickhouse.go

@@ -91,17 +91,22 @@ func GetSecret(client kubernetes.Interface, namespace string) (username string,
 	return username, password, nil
 }

-func getClickHouseURL() (url string, err error) {
+func getClickHouseURL(clientPtr *kubernetes.Interface) (url string, err error) {


Updated, thanks!

wsquan171 · 2023-01-19T19:13:51Z

pkg/controller/networkpolicyrecommendation/controller.go

@@ -194,6 +199,10 @@ func (c *NPRecommendationController) Run(stopCh <-chan struct{}) {
 		return
 	}

+	// The key can be anything as we only have single item.
+	c.gcQueue.Add("key")


Maybe create a const for the key in this package and use that?

Updated the key to a specific gcKey to allow separation, thanks!

wsquan171 · 2023-01-19T19:39:01Z

pkg/controller/networkpolicyrecommendation/controller.go

+	}
+
+	// Add scheduled/running NPR back to resycn list
+	nprList, err := c.ListNetworkPolicyRecommendation(env.GetTheiaNamespace())


Could there be a race case when a job is deleted between ListNetworkPolicyRecommendation and c.addPeriodicSync is called? Shouldn't be of much issue though as even that's the case, during the next resync the cleanup can be done w/o issue.

I think it is fine as the PeriodicSync just add the NPR key to the main workqueue, syncNPRecommendation will return nil if it does not find the NPR. As the theia manager is on, the deletion process will take care of the SparkApplication, db entires and also PeriodicSyncSet.

wsquan171 · 2023-01-19T19:40:59Z

pkg/controller/networkpolicyrecommendation/controller.go

+// handleStaleResources handles the stale Spark Applications and database entries.
+// It will delete the dangling resources without a matching NetworkPolicyRecommendation
+// and add the running NetworkPolicyRecommendation back to the periodical watch list.
+func (c *NPRecommendationController) handleStaleResources() error {


I kind of think that we should separate sync / retry of the two things we're doing there. Currently populating resync won't be done if errors occur when deleting stale SA or db entries.

Currently handleStaleResources will be retried as I use a gcworker to run it and add back the key upon error. But it's true if deleting stale SA or db entries never succeeds then populating resync will not be executed. Would you prefer to add another worker to take care of the populating resync thing?

Or we can persist which part of GC has successfully finished in gcworker, and only retry the part that has not. Also, if everything is done, should gcworker quit rather than blocking on the queue (returning false on L299)?

Sounds great, updated!

wsquan171 · 2023-01-19T19:43:58Z

test/e2e/policyrecommendation_test.go

+	require.NoError(t, err)
+
+	// Check the SparkApplication and database entries of jobName1 do not exist
+	cmd = fmt.Sprintf("kubectl get sparkapplication %s -n flow-visibility", jobName1)


Maybe wraps this and and L215 in some retries just in case theia manager gets slow in removing them?

Updated, thanks! Also add some retries for npr deletion as I found some failures.

wsquan171

LGTM.

yuntanghsu · 2023-02-01T23:57:37Z

pkg/controller/networkpolicyrecommendation/controller.go

+		removeStaleSparkApp:  true,
+		addResync:            true,
+	})
+	go wait.Until(c.gcworker, time.Second, stopCh)


I think we only add gcKey to gcQueue when theia-manager is started? I'm wondering why do we use wait.Util?
Won't the func processNextGcWorkItem get blocked at line #333?

As the key will be added back to the queue when there is an error in handleStaleResources, I apply the same logic as the other work queues. And as processNextGcWorkItem will return false when handleStaleResources succeed, the goroutine will finish at that time and won't be blocked. It should be possible to replace this by a simpler while loop, but I'm not sure if it is a better solution.

Thanks YunTang, you're right. Using wait.Until makes the goroutine running until the stopCh. Updated!

yuntanghsu

LGTM

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

yanjunz97 · 2023-02-02T19:18:36Z

/theia-test-e2e

yanjunz97 added this to the Theia v0.5 release milestone Jan 14, 2023

yanjunz97 force-pushed the theia-manager-gc branch from 3f7dc0c to 397f39c Compare January 14, 2023 01:50

yanjunz97 requested review from dreamtalen, wsquan171, salv-orlando and yuntanghsu January 18, 2023 01:20

wsquan171 reviewed Jan 19, 2023

View reviewed changes

wsquan171 approved these changes Jan 23, 2023

View reviewed changes

yuntanghsu reviewed Feb 1, 2023

View reviewed changes

yuntanghsu approved these changes Feb 2, 2023

View reviewed changes

Resync Theia Manager at startup

8bdcffe

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

yanjunz97 force-pushed the theia-manager-gc branch from 8a134e0 to 8bdcffe Compare February 2, 2023 19:18

yanjunz97 merged commit f2328e7 into antrea-io:main Feb 2, 2023

yanjunz97 mentioned this pull request Feb 27, 2023

Adding Throughput Anomaly detector files and CLI support #152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resync Theia Manager at startup #150

Resync Theia Manager at startup #150

yanjunz97 commented Jan 14, 2023

yanjunz97 commented Jan 14, 2023

yanjunz97 commented Jan 17, 2023

wsquan171 Jan 19, 2023

yanjunz97 Jan 21, 2023

wsquan171 Jan 19, 2023

yanjunz97 Jan 21, 2023

wsquan171 Jan 19, 2023

yanjunz97 Jan 21, 2023

wsquan171 Jan 19, 2023

yanjunz97 Jan 20, 2023

wsquan171 Jan 19, 2023

yanjunz97 Jan 20, 2023

wsquan171 Jan 20, 2023

yanjunz97 Jan 21, 2023

wsquan171 Jan 19, 2023

yanjunz97 Jan 21, 2023

wsquan171 left a comment

yuntanghsu Feb 1, 2023

yanjunz97 Feb 2, 2023

yanjunz97 Feb 2, 2023

yuntanghsu left a comment

yanjunz97 commented Feb 2, 2023

	func SetupConnection(client kubernetes.Interface) (connect sql.DB, err error) {
	func SetupConnection(client kubernetes.Interface) (connect *sql.DB, err error) {

Resync Theia Manager at startup #150

Resync Theia Manager at startup #150

Conversation

yanjunz97 commented Jan 14, 2023

yanjunz97 commented Jan 14, 2023

yanjunz97 commented Jan 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wsquan171 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuntanghsu left a comment

Choose a reason for hiding this comment

yanjunz97 commented Feb 2, 2023