Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Retrospective] Release Version 2.17.0 #4909

Open
github-actions bot opened this issue Aug 3, 2024 · 11 comments
Open

[Retrospective] Release Version 2.17.0 #4909

github-actions bot opened this issue Aug 3, 2024 · 11 comments
Assignees

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2024

Related release issue?

#4908

How to use this issue?

Please add comments to this issue, they can be small or large in scope. Honest feedback is important to improve our processes, suggestions are also welcomed but not required.

What will happen to this issue post release?

There will be a discussion(s) about how the release went and how the next release can be improved. Then this issue will be updated with the notes of that discussion along side action items.

@github-actions github-actions bot added release untriaged Issues that have not yet been triaged v2.17.0 labels Aug 3, 2024
@gaiksaya gaiksaya removed the untriaged Issues that have not yet been triaged label Aug 15, 2024
@gaiksaya gaiksaya self-assigned this Aug 21, 2024
@gaiksaya gaiksaya pinned this issue Aug 28, 2024
@msfroh
Copy link

msfroh commented Sep 5, 2024

I think our best bet for the future is to cut the release branch on core as the "freeze" trigger. Maybe core should freeze a couple of days before plugins?

Then backports to the release branch on core would only be for critical bug fixes. This would be in comparison to the past ~24 hours, where a flood of features get merged to main, backported to 2.x, and backported to release.

With the update to release+1 across main and 2.x, it would only disrupt merging of a small number of critical bug fixes (versus the current flood of last-minute features).

@msfroh
Copy link

msfroh commented Sep 5, 2024

(Of course we should also stop trying to squeeze features in the last 24-48 hours before a release, but try telling that to people whose paychecks depend on shipping features.)

@harshavamsi
Copy link

Coming from #4908 (comment), I think that it would be nicer to have a trigger for the manifest update job to publish artifacts for branches that have been cut. Once a branch has been cut, the expectation is that the branch is ready to build and artifacts can be published. We then increment the version on 2.x to the next minor version and bwc gets an update as well. Once 2.x is onto the next minor version and has the current minor version as bwc, gradle checks on main that trigger the bwc test expect to have build artifacts ready for the current minor version. But since the manifest update runs on a cron job, it can take some time for this to happen and PRs on main start failing. This happened adter the 2.16 branch was cut as well as on 2.17.

A better way would be to trigger a manifest update along with the branch cut so that the artifacts are available immediately.

@gaiksaya
Copy link
Member

gaiksaya commented Sep 5, 2024

Delay in merging version increment PRs causing build failures and delay in RC generation.
reportsDashboards: opensearch-project/dashboards-reporting#404

@cwperks
Copy link
Member

cwperks commented Sep 10, 2024

ISM and ISM Dashboards had failing tests in 2.16, that again got flagged in 2.17 release testing. The tests were failing due to logic in ISM tests that would cleanup test suites by deleting ISM system indices.

Originally planned in 2.16 (now 2.17), the security plugin introduced a change to identify if a request matches system indices by checking a central System Index Registry from core. The index patterns in the registry are indices registered with the SystemIndexPlugin.getSystemIndexDescriptors extension point. Before 2.17, the security plugin relied on an opensearch.yml setting to be aware of the system index patterns in the cluster. The list that is provided in the demo configuration can be found here and has never included ISM indices. As a result, ISM indices were treated as regular indices (from security plugin perspective) and regular users could access the indices as any other index. Security gives special protections to system indices and forbids regular users from having direct access. Actions like writing to the index and deleting the index are strictly forbidden for regular users. Plugins needing system index access have mechanisms for accessing the system index when needed.

ISM tests were flagged in 2.17 again because the CI Checks in ISM and ISM Dashboards repos are insufficient.

The existing ISM tests with security are only operating on a subset of integ tests. Opened a PR here to address the ISM plugin issues.

ISM Dashboards has cypress tests that are only testing w/o security for PR checks. I opened a PR to address the failing checks for ISM Dashboards, but there should be a follow-up change to add a PR check to run cypress tests with security. There was also a change made in the FTR repo where there is similar test cleanup logic that removes system indices.

@vikasvb90
Copy link

@cwperks The real problem is not ISM not making required changes on time to support breaking change pushed by security but security making breaking changes without any campaign. We have seen other instances in case of CA certs where changes were pushed by security which led to failures in other plugins. Adding cypress test in PR is still a reactive approach where a plugin spends time investigating, figures out that issue is not related to the plugin, follows up with security and gets the issue fixed. I don't see any value in this approach although I agree on the overall AI of executing tests on PR workflow runs.

@cwperks
Copy link
Member

cwperks commented Sep 11, 2024

@vikasvb90 The ISM tests are an issue that needs to be addressed. If the ism indices are supposed to be treated like regular indices then there is no need to register them with the SystemIndexPlugin.getSystemIndexDescriptors() extension point.

An index is a system index if its important to the integrity of the cluster and would cause a cluster to enter a corrupted state if deleted. For example, the security index (.opendistro_security) is a system index that contains the security posture of a cluster. Deleting the security index would render a cluster corrupt because the nodes would not be able to read the security posture.

All plugins should run integ tests with and without security to avoid catching test failures only at release time.

For the demo cert renewal, the demo certs were initially updated to modify the SAN to include IPv6 loopback address. @DarshitChanpura updated the certs and opened PRs on all repos that maintained copies of the certs, but ISM was missed in 2.13: opensearch-project/security#4061

In a future release, I'd like to run a campaign across plugins to remove copies of demo certs and instead to refer to the certs centrally. See example of how SQL plugin pulls them from the security repo here: https://github.com/opensearch-project/sql/blob/4303a2ab755d53903094dd94a5100572677a27a1/integ-test/build.gradle#L107-L111

@vikasvb90
Copy link

@cwperks I am not questioning the intention of the change. I do agree that change needs to be there but they need to be called out proactively as a campaign or as any other mechanism. By the time plugin finds this out and calls out, it is already too late.
Also, I understand that issues happen and we can take reactive measures to fix them but attempts should at least be made to reduce their possibility by taking proactive measures before pushing breaking changes.

@DarshitChanpura
Copy link
Member

@vikasvb90 One of the biggest pro-active measures to catch these early-on is to add a CI check that runs tests with security enabled, since we run security-enabled tests for all RCs. This would enable plugin owners to debug any failures, and/or reach out to security team to get these addressed as soon as possible.

@vikasvb90
Copy link

vikasvb90 commented Sep 12, 2024

@DarshitChanpura That's not being pro-active. If you rely on foreign plugins to come and tell you that something is buggy in security then that is being reactive. And it ends up wasting a lot of dev cycles. Tests in other plugins should just be treated as last line of defense.

@gaiksaya
Copy link
Member

#5016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Now(This Quarter)
Development

No branches or pull requests

6 participants