[Retrospective] Release Version 2.17.0 #4909

github-actions · 2024-08-03T00:19:42Z

Related release issue?

How to use this issue?

Please add comments to this issue, they can be small or large in scope. Honest feedback is important to improve our processes, suggestions are also welcomed but not required.

What will happen to this issue post release?

There will be a discussion(s) about how the release went and how the next release can be improved. Then this issue will be updated with the notes of that discussion along side action items.

msfroh · 2024-09-05T06:59:37Z

I think our best bet for the future is to cut the release branch on core as the "freeze" trigger. Maybe core should freeze a couple of days before plugins?

Then backports to the release branch on core would only be for critical bug fixes. This would be in comparison to the past ~24 hours, where a flood of features get merged to main, backported to 2.x, and backported to release.

With the update to release+1 across main and 2.x, it would only disrupt merging of a small number of critical bug fixes (versus the current flood of last-minute features).

msfroh · 2024-09-05T07:01:20Z

(Of course we should also stop trying to squeeze features in the last 24-48 hours before a release, but try telling that to people whose paychecks depend on shipping features.)

harshavamsi · 2024-09-05T07:04:05Z

Coming from #4908 (comment), I think that it would be nicer to have a trigger for the manifest update job to publish artifacts for branches that have been cut. Once a branch has been cut, the expectation is that the branch is ready to build and artifacts can be published. We then increment the version on 2.x to the next minor version and bwc gets an update as well. Once 2.x is onto the next minor version and has the current minor version as bwc, gradle checks on main that trigger the bwc test expect to have build artifacts ready for the current minor version. But since the manifest update runs on a cron job, it can take some time for this to happen and PRs on main start failing. This happened adter the 2.16 branch was cut as well as on 2.17.

A better way would be to trigger a manifest update along with the branch cut so that the artifacts are available immediately.

gaiksaya · 2024-09-05T20:20:33Z

Delay in merging version increment PRs causing build failures and delay in RC generation.
reportsDashboards: opensearch-project/dashboards-reporting#404

cwperks · 2024-09-10T21:02:39Z

ISM and ISM Dashboards had failing tests in 2.16, that again got flagged in 2.17 release testing. The tests were failing due to logic in ISM tests that would cleanup test suites by deleting ISM system indices.

Originally planned in 2.16 (now 2.17), the security plugin introduced a change to identify if a request matches system indices by checking a central System Index Registry from core. The index patterns in the registry are indices registered with the SystemIndexPlugin.getSystemIndexDescriptors extension point. Before 2.17, the security plugin relied on an opensearch.yml setting to be aware of the system index patterns in the cluster. The list that is provided in the demo configuration can be found here and has never included ISM indices. As a result, ISM indices were treated as regular indices (from security plugin perspective) and regular users could access the indices as any other index. Security gives special protections to system indices and forbids regular users from having direct access. Actions like writing to the index and deleting the index are strictly forbidden for regular users. Plugins needing system index access have mechanisms for accessing the system index when needed.

ISM tests were flagged in 2.17 again because the CI Checks in ISM and ISM Dashboards repos are insufficient.

The existing ISM tests with security are only operating on a subset of integ tests. Opened a PR here to address the ISM plugin issues.

ISM Dashboards has cypress tests that are only testing w/o security for PR checks. I opened a PR to address the failing checks for ISM Dashboards, but there should be a follow-up change to add a PR check to run cypress tests with security. There was also a change made in the FTR repo where there is similar test cleanup logic that removes system indices.

vikasvb90 · 2024-09-11T01:21:53Z

@cwperks The real problem is not ISM not making required changes on time to support breaking change pushed by security but security making breaking changes without any campaign. We have seen other instances in case of CA certs where changes were pushed by security which led to failures in other plugins. Adding cypress test in PR is still a reactive approach where a plugin spends time investigating, figures out that issue is not related to the plugin, follows up with security and gets the issue fixed. I don't see any value in this approach although I agree on the overall AI of executing tests on PR workflow runs.

cwperks · 2024-09-11T04:17:44Z

@vikasvb90 The ISM tests are an issue that needs to be addressed. If the ism indices are supposed to be treated like regular indices then there is no need to register them with the SystemIndexPlugin.getSystemIndexDescriptors() extension point.

An index is a system index if its important to the integrity of the cluster and would cause a cluster to enter a corrupted state if deleted. For example, the security index (.opendistro_security) is a system index that contains the security posture of a cluster. Deleting the security index would render a cluster corrupt because the nodes would not be able to read the security posture.

All plugins should run integ tests with and without security to avoid catching test failures only at release time.

For the demo cert renewal, the demo certs were initially updated to modify the SAN to include IPv6 loopback address. @DarshitChanpura updated the certs and opened PRs on all repos that maintained copies of the certs, but ISM was missed in 2.13: opensearch-project/security#4061

In a future release, I'd like to run a campaign across plugins to remove copies of demo certs and instead to refer to the certs centrally. See example of how SQL plugin pulls them from the security repo here: https://github.com/opensearch-project/sql/blob/4303a2ab755d53903094dd94a5100572677a27a1/integ-test/build.gradle#L107-L111

vikasvb90 · 2024-09-11T04:22:26Z

@cwperks I am not questioning the intention of the change. I do agree that change needs to be there but they need to be called out proactively as a campaign or as any other mechanism. By the time plugin finds this out and calls out, it is already too late.
Also, I understand that issues happen and we can take reactive measures to fix them but attempts should at least be made to reduce their possibility by taking proactive measures before pushing breaking changes.

DarshitChanpura · 2024-09-12T14:32:02Z

@vikasvb90 One of the biggest pro-active measures to catch these early-on is to add a CI check that runs tests with security enabled, since we run security-enabled tests for all RCs. This would enable plugin owners to debug any failures, and/or reach out to security team to get these addressed as soon as possible.

vikasvb90 · 2024-09-12T15:28:00Z

@DarshitChanpura That's not being pro-active. If you rely on foreign plugins to come and tell you that something is buggy in security then that is being reactive. And it ends up wasting a lot of dev cycles. Tests in other plugins should just be treated as last line of defense.

gaiksaya · 2024-09-13T04:39:22Z

#5016

msfroh · 2024-09-16T20:25:16Z

@vikasvb90 -- what procedural change are you suggesting? You are saying no to improved CI checks in one plugin (which seems like a ridiculous argument to be making -- "No! I don't want to add more tests to cover my plugin in more scenarios!"). What do you think would be better?

Should the security plugin have an integration test that runs all plugins' integ tests with security enabled?

If you rely on foreign plugins to come and tell you that something is buggy in security then that is being reactive.

Also, what's buggy about the security plugin trying to protect system indices? That seems like a good idea and the intended behavior of the change. What's the bug?

gaiksaya · 2024-09-16T21:29:42Z

Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue.

OpenSearch-Dashboards. See data here

Core
observabilityDashboards
securityAnalyticsDashboards
notificationsDashboards
indexManagementDashboards (Had to run several times)
reportsDashboards (Was fixed recently but still need to check if that was the actual cause)

OpenSearch components: See data here

skills
ml-commons
alerting
index-management
geospatial

vikasvb90 · 2024-09-17T02:18:32Z

@msfroh My intention is to handle common cases like this gracefully by running a campaign where security can ask for sign-offs with a reasonable amount of bake time from all the dependent plugins (List can easily be fetched by looking at build.gradle). Beyond the bake time, security can take a call to merge the change. No response from dependent plugins can be treated as a NO_BREAKING_CHANGE as well.

msfroh · 2024-09-17T02:21:24Z

@msfroh My intention is to handle common cases like this gracefully by running a campaign where security can ask for sign-offs with a reasonable amount of bake time from all the dependent plugins (List can easily be fetched by looking at build.gradle). Beyond the bake time, security can take a call to merge the change. No response from dependent plugins can be treated as a NO_BREAKING_CHANGE as well.

Yeah -- that seems like a reasonable ask. I would skip asking for sign-offs, personally. I would open issues saying "We're going to do X in a few days unless we hear back from you." Essentially, it would be opt-out instead of opt-in.

sandeshkr419 · 2024-09-18T22:52:45Z

@gaiksaya Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue. ....

I think we also need a campaign to reduce the flaky tests to 0. The negative impact of existing flaky tests is multi-fold (impacting all components) on development and code merges, with developers waiting/hoping (sometimes days) on retries to succeed because of gradle check failures due to flaky tests.

navneet1v · 2024-09-19T07:16:53Z

@gaiksaya Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue. ....

I think we also need a campaign to reduce the flaky tests to 0. The negative impact of existing flaky tests is multi-fold (impacting all components) on development and code merges, with developers waiting/hoping (sometimes days) on retries to succeed because of gradle check failures due to flaky tests.

+1 on this. Core having flaky tests have been flagged so many times. @msfroh one reason of stalled PRs in core could because of gradle checks taking forever to complete and then failing due to flaky tests. I know there has been a lot of brainstorming on this, I have one suggestion that might help reducing the time for gradle checks:

I think we should split core test runs into multiple CI runs to ensure that more tests can run in parallel and also if a re-run is required we are re-running only some tests.

navneet1v · 2024-09-19T07:22:38Z

k-NN team had a learning in this release where we identified that all the integTests that runs during CI runs without heap CB enabled but Jenkins pipeline uses the RC to create the cluster. This RC candidate default settings is different from integTest cluster. Heap CB not enabled was one and this lead to integTest failures for k-NN in jenkins pipeline while same tests was successful in the CIs.
More details can be found on this GH issue: opensearch-project/OpenSearch#15849
I have added the capability to override the setting to enable heap CB if required by plugins ref: opensearch-project/OpenSearch#15906. But I think this is something which all plugins should enable in their CIs.

kolchfa-aws · 2024-09-19T19:37:21Z

The doc team received a lot of last-minute PRs and even issues that were entered beyond first RC date. We are suggesting the following to ensure that the RC entrance criteria from the docs side are met:

To adhere to the RC entrance criteria, provide a PR by the first RC date
Open a documentation issue as soon as possible for the feature you're working on so we can load-balance effectively
For features requiring significant documentation, communicate with the doc team in advance so we can provide support
For frontend features (Dashboards), provide a video demo of the feature by the first RC date
Work on a branch in your fork instead of main to allow maintainers to push to the same PR

gaiksaya · 2024-09-19T21:41:20Z

Hi,

I tried to summarize the points mentioned in this issue under what needs to improve section on the retrospective board https://github.com/orgs/opensearch-project/projects/205/views/16?filterQuery=category%3A%22v2.17.0+Retro%22+

Please feel free to review, add and update if required.
Thanks!

gaiksaya · 2024-09-21T00:25:29Z

Thank you for all your comments and suggestion. We have created issues and added the existing ones in the actions items section in the above board.
Closing this one.

github-actions bot added release untriaged Issues that have not yet been triaged v2.17.0 labels Aug 3, 2024

github-project-automation bot added this to Engineering Effectiveness Board Aug 3, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Aug 3, 2024

gaiksaya removed the untriaged Issues that have not yet been triaged label Aug 15, 2024

gaiksaya self-assigned this Aug 21, 2024

gaiksaya pinned this issue Aug 28, 2024

gaiksaya mentioned this issue Sep 3, 2024

Increase build frequency for 2.17 #4988

Merged

gaiksaya mentioned this issue Sep 16, 2024

[RELEASE] Release version 2.17.0 opensearch-project/geospatial#672

Closed

23 tasks

This was referenced Sep 17, 2024

[RELEASE] Release version 2.17.0 #4908

Closed

Add 2.17 restrospective meeting opensearch-project/project-website#3321

Merged

gaiksaya mentioned this issue Sep 18, 2024

Update the build documentation and remove duplicates #5039

Merged

gaiksaya mentioned this issue Sep 21, 2024

[Bug]: Release branch needs to be cut minutes before generating first RC #5045

Closed

gaiksaya closed this as completed Sep 21, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Engineering Effectiveness Board Sep 21, 2024

gaiksaya unpinned this issue Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Retrospective] Release Version 2.17.0 #4909

[Retrospective] Release Version 2.17.0 #4909

github-actions bot commented Aug 3, 2024

msfroh commented Sep 5, 2024

msfroh commented Sep 5, 2024

harshavamsi commented Sep 5, 2024

gaiksaya commented Sep 5, 2024

cwperks commented Sep 10, 2024 •

edited

Loading

vikasvb90 commented Sep 11, 2024

cwperks commented Sep 11, 2024

vikasvb90 commented Sep 11, 2024

DarshitChanpura commented Sep 12, 2024

vikasvb90 commented Sep 12, 2024 •

edited

Loading

gaiksaya commented Sep 13, 2024

msfroh commented Sep 16, 2024 •

edited

Loading

gaiksaya commented Sep 16, 2024 •

edited

Loading

vikasvb90 commented Sep 17, 2024

msfroh commented Sep 17, 2024

sandeshkr419 commented Sep 18, 2024

navneet1v commented Sep 19, 2024

navneet1v commented Sep 19, 2024

kolchfa-aws commented Sep 19, 2024 •

edited

Loading

gaiksaya commented Sep 19, 2024

gaiksaya commented Sep 21, 2024

[Retrospective] Release Version 2.17.0 #4909

[Retrospective] Release Version 2.17.0 #4909

Comments

github-actions bot commented Aug 3, 2024

Related release issue?

How to use this issue?

What will happen to this issue post release?

msfroh commented Sep 5, 2024

msfroh commented Sep 5, 2024

harshavamsi commented Sep 5, 2024

gaiksaya commented Sep 5, 2024

cwperks commented Sep 10, 2024 • edited Loading

vikasvb90 commented Sep 11, 2024

cwperks commented Sep 11, 2024

vikasvb90 commented Sep 11, 2024

DarshitChanpura commented Sep 12, 2024

vikasvb90 commented Sep 12, 2024 • edited Loading

gaiksaya commented Sep 13, 2024

msfroh commented Sep 16, 2024 • edited Loading

gaiksaya commented Sep 16, 2024 • edited Loading

vikasvb90 commented Sep 17, 2024

msfroh commented Sep 17, 2024

sandeshkr419 commented Sep 18, 2024

navneet1v commented Sep 19, 2024

navneet1v commented Sep 19, 2024

kolchfa-aws commented Sep 19, 2024 • edited Loading

gaiksaya commented Sep 19, 2024

gaiksaya commented Sep 21, 2024

cwperks commented Sep 10, 2024 •

edited

Loading

vikasvb90 commented Sep 12, 2024 •

edited

Loading

msfroh commented Sep 16, 2024 •

edited

Loading

gaiksaya commented Sep 16, 2024 •

edited

Loading

kolchfa-aws commented Sep 19, 2024 •

edited

Loading