Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Use PodTemplates to support ImageJobs #555

Conversation

pmengelbert
Copy link
Contributor

@pmengelbert pmengelbert commented Dec 19, 2022

This PR was caused by limits on the size of k8s API requests. By default, kubectl apply will add a field, metadata.annotations.kubectl."kubernetes.io/last-applied-configuration", which is a JSON representation of the object. This effectively increases the size of the request.

We ran into size limits when adding a v1 of the eraser API. Because we want to support a multi-version API, the request to create the CRD for imagejobs was essentially doubled by the addition of the new version.

The component of the CRD that took up the most space was the ImageJob.Spec field, a PodTemplateSpec type. The generated YAML for this type is extremely long, with a lot of documentation. Being a core kubernetes resource, a PodTemplateSpec is very long and complex.

This commit leverages the kubernetes resource PodTemplate in order to remove the ImageJob.Spec field entirely. Instead of using the ImageJob to specify configuration of the eraser or collector pods, that responsibility is offloaded to a PodTemplate.

There is a 1:1 ratio between imagejobs and podtemplates. A given imagejob is expected to have a corresponding podtemplate with the same name, in the same namespace as the controller-manager pod. If no such podtemplate exists, an error is returned and the reconciliation loop spins until the podtemplate is found.

The OwnerReference of the podtemplate is set to its corresponding imagejob. When the imagejob is deleted, the podtemplate will be cleaned up as well.

Finally, I made a change allowing the user to disable the immediate triggering of collector jobs. This was done to avoid situations in the tests where we needed to wait for/delete the initial imagejob in order to test imagelists. Commit 7b49359 should be reverted in a later PR, after #568 has been addressed.

Signed-off-by: Peter Engelbert pmengelbert@gmail.com

@pmengelbert pmengelbert changed the title Use PodTemplates to support ImageJobs chore: Use PodTemplates to support ImageJobs Dec 19, 2022
@pmengelbert pmengelbert force-pushed the chore/podtemplate_to_circumvent_api_request_limits branch 3 times, most recently from 2769fd9 to dd3ee29 Compare December 19, 2022 17:37
@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2022

Codecov Report

Merging #555 (e308a0a) into main (5afc212) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #555   +/-   ##
=======================================
  Coverage   14.81%   14.81%           
=======================================
  Files          13       13           
  Lines        1512     1512           
=======================================
  Hits          224      224           
  Misses       1260     1260           
  Partials       28       28           
Flag Coverage Δ
unittests 14.81% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@pmengelbert pmengelbert force-pushed the chore/podtemplate_to_circumvent_api_request_limits branch 7 times, most recently from 4f8a58d to 7b49359 Compare December 29, 2022 18:54
@pmengelbert pmengelbert marked this pull request as ready for review December 29, 2022 19:30
@pmengelbert pmengelbert force-pushed the chore/podtemplate_to_circumvent_api_request_limits branch from 7b49359 to d1a82b0 Compare December 29, 2022 19:54
@pmengelbert pmengelbert marked this pull request as draft December 29, 2022 20:18
@pmengelbert pmengelbert marked this pull request as ready for review December 29, 2022 21:15
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Dec 29, 2022
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert pmengelbert mentioned this pull request Dec 29, 2022
@pmengelbert pmengelbert force-pushed the chore/podtemplate_to_circumvent_api_request_limits branch from 30f5dad to 130d466 Compare December 30, 2022 01:00
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Dec 30, 2022
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert pmengelbert force-pushed the chore/podtemplate_to_circumvent_api_request_limits branch from 130d466 to f5faff7 Compare December 30, 2022 14:45
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Dec 30, 2022
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert pmengelbert force-pushed the chore/podtemplate_to_circumvent_api_request_limits branch from f5faff7 to 55a9a67 Compare December 30, 2022 15:42
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Dec 30, 2022
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
This PR was caused by limits on the size of k8s API requests. By
default, `kubectl apply` will add a field, `metadata.annotations.kubectl."kubernetes.io/last-applied-configuration"`,
which is a JSON representation of the object. This effectively increases
the size of the request.

We ran into size limits when adding a `v1` of the eraser API. Because we
want to support a multi-version API, the request to create the CRD for
imagejobs was essentially doubled by the addition of the new version.

The component of the CRD that took up the most space was the
`ImageJob.Spec` field, a `PodTemplateSpec` type. The generated YAML for
this type is extremely long, with a lot of documentation. Being a core
kubernetes resource, a `PodTemplateSpec` is very long and complex.

This commit leverages the kubernetes resource `PodTemplate` in order to
remove the `ImageJob.Spec` field entirely. Instead of using the
`ImageJob` to specify configuration of the `eraser` or `collector` pods,
that responsibility is offloaded to a `PodTemplate`.

There is a 1:1 ratio between imagejobs and podtemplates. A given
imagejob is expected to have a corresponding podtemplate with the same
name, in the same namespace as the controller-manager pod. If no such
podtemplate exists, an error is returned and the reconciliation loop
spins until the podtemplate is found.

The OwnerReference of the podtemplate is set to its corresponding
imagejob. When the imagejob is deleted, the podtemplate will be cleaned
up as well.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
This code should be reverted once the timing logic has been sorted out.
See issue eraser-dev#568 for details.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Also, remove unnecessary goroutine.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert pmengelbert force-pushed the chore/podtemplate_to_circumvent_api_request_limits branch from 55a9a67 to 91c3382 Compare December 30, 2022 20:00
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Dec 30, 2022
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@@ -154,7 +155,13 @@ func add(mgr manager.Manager, r reconcile.Reconciler) error {
return err
}

go func() {
delay := *repeatPeriod
if *repeatImmediate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for? is this different than setting repeat period to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collect-scan-erase process can either begin immediately or after the specified delay. Our default has been to start immediately, with no option to change that behavior. This change makes it possible to do either.

The only reason I added this to this PR is because the collector pods were messing up the tests. Some tests (for example imagelist_rm_images) test the behavior of the imagelist, and have to jump through hoops to not conflict with the collector which begins immediately after deployment. example: https://github.com/Azure/eraser/blob/76aa58d70fb1cce6f2c61f466dd4b34dceb470b5/test/e2e/tests/imagelist_rm_images/eraser_test.go#L51

Ideally I'd like to move to a more robust scheduling system (example: allow the user to specify a cron string).

Copy link
Contributor Author

@pmengelbert pmengelbert Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for? is this different than setting repeat period to 0?

Sorry, misunderstood your question. Yes, it's different. Here's how it works:

  • If repeatPeriod = 30 min and repeatImmediate = true, then we will run one job at startup, then the next job 30 minutes after that
  • If repeatPeriod = 30 min and repeatImmediate = false, then we will wait 30 mins after startup to run the first job

On the other hand, setting repeatPeriod to 0 would start immediately and repeat endlessly with no delay

Copy link
Member

@sozercan sozercan Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, repeat seems a bit misleading. perhaps we can name it something like schedule-immediate, queue-immediate or erase-immediate (or anything else you can think of)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
Resources: corev1.ResourceRequirements{
Requests: corev1.ResourceList{
"cpu": resource.MustParse("7m"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we hardcoding these?

Copy link
Contributor Author

@pmengelbert pmengelbert Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been hardcoded for some time, the code just moved due to indentation changes, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part of the config we talked about before?

Copy link
Contributor Author

@pmengelbert pmengelbert Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be. I'll leave a link to this comment in the configmap issue #446

}()

log.Info("...Queued first ImageCollector reconcile")
})

return nil
}

//+kubebuilder:rbac:groups="batch",resources=jobs,verbs=get;list;create;delete;watch
Copy link
Member

@sozercan sozercan Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this or is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still see this, do we want to remove in this pr?

Copy link
Contributor Author

@pmengelbert pmengelbert Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, you meant the kubebuilder comment? Or the log.Info? Both have been removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, should have clarified. I meant kubebuilder marker

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
- rename `repeatImmediate` to `scheduleImmediate`
- remove kubebuilder comment for k8s `jobs` which we no longer use

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Mistakenly had the old flag, `--repeat-immediate` instead of the new,
`--schedule-immediate`

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Copy link
Member

@sozercan sozercan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM

pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Jan 5, 2023
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert pmengelbert merged commit 653a120 into eraser-dev:main Jan 5, 2023
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Jan 5, 2023
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Jan 6, 2023
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert pmengelbert deleted the chore/podtemplate_to_circumvent_api_request_limits branch January 6, 2023 16:14
pmengelbert added a commit to pmengelbert/eraser that referenced this pull request Jan 6, 2023
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
pmengelbert added a commit that referenced this pull request Jan 6, 2023
* Create v1 api

In order to create the v1 api for eraser, a number of additional changes
had to be made.

Our eraser-tooling image had to be modified to include the
`conversion-gen` tool. This is to generate type conversions between the
older `v1alpha1` API and the newer `v1`.

To facilitate conversion between the two, an `unversioned` api was
created. This becomes a common hub between `v1` and `v1alpha1` and
facilitates conversion. It also makes it easier for our code: with the
`unversioned` api, we no longer need to update large swaths of code when
a new api version is introduced.

In order to get the `conversion-gen` tool to work, I needed to add
several magic comments throughout the api code. Most importantly, I
needed to add a `doc.go` file to each of the api version directories.
`conversion-gen` looks for this file and will not work without it. Above
the package name in those files, I added the following two magic
comments:

```
+// +k8s:conversion-gen=github.com/Azure/eraser/api/unversioned
+// -external-types=github.com/Azure/eraser/api/v1
```

This tells `conversion-gen` to use the `unversioned` api as the
conversion hub.

To skip kubebuilder processing the `unversioned` api,
`+kubebuilder:skip` had to be added before the package declaration. To
specify which api version should be stored in etcd, the
`+kubebuilder:storageversion` comment was added to the `v1` types. In
addition, `kubebuilder:deprecatedversion:warning="v1alpha1 of the eraser API has been deprecated"`
was added to the `v1alpha1` API in order to warn the user to upgrade.

In the `groupversion_info.go` file of each version directory, I had to
add the line

```go
localSchemeBuilder = runtime.NewSchemeBuilder(SchemeBuilder.AddToScheme)
```

To be perfectly honest, I'm not sure what this does, but it is required
for the `conversion-gen` process to generate working code.

Finally, a new file was added to `test/e2e/test-data` which deploys an
imagelist using the new API version. One of the tests was updated to use
this file instead.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Re-run generation and make manifests

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Remove unused lines

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* For now, use --server-side to circumvent CRD size problems

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Rebase on PodTemplate changes (#555) and regenerate

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Remove ImageJobSpec everywhere

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Address linter concerns

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Add empty package

In order to avoid changing the Makefile every time  a new version is
added, add an empty package to the `./api` directory in order to use
the `...` wildcard.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Remove --server-side flag from Makefile

This flag will no longer be needed, because we have reduced the size of
the CRDs.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Remove --server-side flag in Makefile (again)

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Instruct user to migrate to v1 in deprecation warning

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

* Re-generate manifests and codegen after rebasing

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants