Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mass pruning tool for unused RHCOS AMIs in rh-dev #3373

Merged
merged 4 commits into from
Dec 21, 2022

Conversation

jupierce
Copy link
Contributor

@jupierce jupierce commented Dec 8, 2022

No description provided.

@jupierce jupierce force-pushed the clean-up-snapshots branch 2 times, most recently from db3ce8f to a081589 Compare December 8, 2022 18:23
@jupierce jupierce force-pushed the clean-up-snapshots branch 2 times, most recently from cc930b9 to d12a701 Compare December 8, 2022 22:29
@mike-nguyen
Copy link
Contributor

This is great! Is there any way to get graph data for 4.5 and prior?

@jupierce
Copy link
Contributor Author

@mike-nguyen - Not that the data is gone, but thankfully I ran this before the release controller team archived older releases. You can see that the <= 4.5 releases are mentioned in the analysis I posted: https://drive.google.com/file/d/19BVPz7qn-K_ADO8WZKSQBgXHB-81-3ND/view?usp=share_link . I'll build off this cache if we run this. I'll tag AMIs that are preserved so we don't have to do any fancy work to remember this fact in the future.

@jlebon
Copy link
Contributor

jlebon commented Dec 12, 2022

Would be good to add a top-level docstring documenting the approach of this tool. IIUC, it's crawling through the update graph and then collating the AMIs to keep by looking at each node and its associated rhcos.json metadata in the corresponding openshift/installer repo commit.

I think it's a valid approach, but I think a better approach long-term is to prune at the cosa build level since it's not just AMIs that we want to prune. The code here could build a list of cosa build IDs we want to keep, and a GC process could prune all the cosa builds (and the referenced remote resources within, which include AMIs) which aren't on that list and older than X. The larger issue for that is tracked at coreos/fedora-coreos-tracker#99 and parts of the functionality exists in https://github.com/coreos/coreos-assembler/blob/main/src/cmd-remote-prune.

As is, this code will invalidate AMIs referenced in the meta.json of their associated cosa builds which isn't a huge concern, but isn't great either. I wouldn't consider it a blocker to do it this way.

@mike-nguyen
Copy link
Contributor

@mike-nguyen - Not that the data is gone, but thankfully I ran this before the release controller team archived older releases. You can see that the <= 4.5 releases are mentioned in the analysis I posted: https://drive.google.com/file/d/19BVPz7qn-K_ADO8WZKSQBgXHB-81-3ND/view?usp=share_link . I'll build off this cache if we run this. I'll tag AMIs that are preserved so we don't have to do any fancy work to remember this fact in the future.

I looked at the file in my browser and didn't realize it only loaded partially until I scrolled down more.

Copy link
Contributor

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems legit

# No machine-os version information in the image. In old releases, use image info for the
# machine-os-content image.
print(' Querying for machine-os-content info')
moc_pullspec = list(filter(lambda tag: tag['name'] == 'machine-os-content', release_info['references']['spec']['tags']))[0]['from']['name']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty soon we will cease to have a machine-os-content. a refinement here would look up the container configured as primary in group.yml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add that, but the method itself is a fallback. In current builds, the rhcos version is in the payload metadata and doesn't require inspecting the rhcos image: https://github.com/openshift/aos-cd-jobs/pull/3373/files#diff-a3f55c4ee0db478ae07381ad1394f41bc69ea3b33c5f606ace1e946afd082c27R87

# machine-os-content image.
print(' Querying for machine-os-content info')
moc_pullspec = list(filter(lambda tag: tag['name'] == 'machine-os-content', release_info['references']['spec']['tags']))[0]['from']['name']
if 'registry.svc' in moc_pullspec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. that wouldn't ever happen in a release image we promoted would it? doesn't hurt to have this as failsafe but seems unnecessary. likewise below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because some ancient CI payload did. Nothing ART promoted.

Comment on lines 153 to 157
if region_name not in installer_ami_ids:
installer_ami_ids[region_name] = {}
installer_ami_ids[region_name]['images'] = []
ami_id = region_entry['image']
installer_ami_ids[region_name]['images'].append(ami_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not important but i just prefer brevity

Suggested change
if region_name not in installer_ami_ids:
installer_ami_ids[region_name] = {}
installer_ami_ids[region_name]['images'] = []
ami_id = region_entry['image']
installer_ami_ids[region_name]['images'].append(ami_id)
ami_id = region_entry['image']
installer_ami_ids.setdefault(region_name, {}).setdefault('images', []).append(ami_id)

cache_entry = yaml.dump({
payload_key: rhcos_in_use[payload_key]
})
# Write, line by line, to the cache file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor note that yaml.dump includes a line break.

for a moment i was confused thinking each entry would be one line. perhaps better to say:

Suggested change
# Write, line by line, to the cache file
# Write to the cache file separated by a blank line

Comment on lines 209 to 223
if payload_key_references['payload_version'] in ami_info:
print(f'Will not clean up RHCOS AMI {image_id} ({image_name}) since it is referenced by a release controller release payload {payload_key}')
matches_payload_keys.append(payload_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually care about release payload references to AMIs? we only use AMIs at install time, right? and i'm pretty sure we don't reference AMIs anywhere besides build metadata and the installer so don't we only need to keep things that the installer refers to?

the difference isn't going to be huge though if you want to be cautious here and keep both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the difference isn't going to be huge though if you want to be cautious here and keep both.
I don't think we do need them. So, yes, this is just a cautionary step.

if matches_payload_keys:
will_preserve.add(f'{aws_region}:{image_id}')

if not matches_payload_keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not matches_payload_keys:
else:

? :-)

Comment on lines 79 to 91
if '.ci' in payload_version or '.nightly' in payload_version:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... we don't want to preserve everything in a nightly for the indefinite future, but we do want to preserve it for long enough to become a promoted release. So should there also just be a blanket exemption for AMIs until they're old enough (say, a month)? Presumably garbage collection config would cover that, perhaps that's why it's not built in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - If we ran this again the future, older nightlies would have been garbage collected from the release controller. As written, this tool would not try to preserve them.

@mike-nguyen
Copy link
Contributor

I was just working on some boot image bumps and realized that this tool could mark images for deletion that are in openshift/installer pull requests that have not merged. I don't think that should affect this PR but just something to be aware of when this is setup to run.

With the new RHCOS pipeline, we won't be replicating to all regions unless we run the cloud-replicate job (which will only be used when creating boot images) so that should cut down on the number of snapshots on AWS.

- AMIs less than 30 days old will not be considered for pruning.
- Recycle bin enabling in all regions
- Tags are now being applied to images to indicate whether they should
  be garbage collected or preserved.
@jupierce
Copy link
Contributor Author

@mike-nguyen . In the latest push, I've excluded images created in the last 30 days.

RHCOS suggested that an exhaustive scan of installer commits
was safer than isolating AMIs associated with active releases.

git_depth = 0
while True:
installer_commit: Commit = f'upstream/master~{git_depth}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going down master only. I think we need to go down from all the release branches too, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call. I wasn't considering post-cut changes. Updated.

# or older format like https://github.com/openshift/installer/blob/bfdba2d19e3fa1105bab93aa4cce8bb7f44b4b2c/data/data/rhcos.json
amis_in_file = list(re.findall('ami-[a-z0-9]+', str(gitshow_process.stdout)))
amis_in_use.update(amis_in_file)
git_depth += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of visiting every commit on the branch, it'd be more efficient to ask git for only the commits which touch the files we care about and then only visit those. Something like:

git log --format=%H <branch> -- data/data/coreos/rhcos.json data/data/rhcos.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Updated.

@jupierce
Copy link
Contributor Author

That latest push scans all branches and finds 2159 AMIs from the installer repo.

Copy link
Contributor

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks sane to me! One thing I double-checked was whether we've used rhcos.json even from 4.1.0, which we did.

@jupierce jupierce merged commit d029e71 into openshift-eng:master Dec 21, 2022
@mike-nguyen
Copy link
Contributor

LGTM to run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants