Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Retrospective] Release Version 2.17.0 #4909

Closed
github-actions bot opened this issue Aug 3, 2024 · 21 comments
Closed

[Retrospective] Release Version 2.17.0 #4909

github-actions bot opened this issue Aug 3, 2024 · 21 comments
Assignees

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2024

Related release issue?

#4908

How to use this issue?

Please add comments to this issue, they can be small or large in scope. Honest feedback is important to improve our processes, suggestions are also welcomed but not required.

What will happen to this issue post release?

There will be a discussion(s) about how the release went and how the next release can be improved. Then this issue will be updated with the notes of that discussion along side action items.

@github-actions github-actions bot added release untriaged Issues that have not yet been triaged v2.17.0 labels Aug 3, 2024
@gaiksaya gaiksaya removed the untriaged Issues that have not yet been triaged label Aug 15, 2024
@gaiksaya gaiksaya self-assigned this Aug 21, 2024
@gaiksaya gaiksaya pinned this issue Aug 28, 2024
@msfroh
Copy link

msfroh commented Sep 5, 2024

I think our best bet for the future is to cut the release branch on core as the "freeze" trigger. Maybe core should freeze a couple of days before plugins?

Then backports to the release branch on core would only be for critical bug fixes. This would be in comparison to the past ~24 hours, where a flood of features get merged to main, backported to 2.x, and backported to release.

With the update to release+1 across main and 2.x, it would only disrupt merging of a small number of critical bug fixes (versus the current flood of last-minute features).

@msfroh
Copy link

msfroh commented Sep 5, 2024

(Of course we should also stop trying to squeeze features in the last 24-48 hours before a release, but try telling that to people whose paychecks depend on shipping features.)

@harshavamsi
Copy link

Coming from #4908 (comment), I think that it would be nicer to have a trigger for the manifest update job to publish artifacts for branches that have been cut. Once a branch has been cut, the expectation is that the branch is ready to build and artifacts can be published. We then increment the version on 2.x to the next minor version and bwc gets an update as well. Once 2.x is onto the next minor version and has the current minor version as bwc, gradle checks on main that trigger the bwc test expect to have build artifacts ready for the current minor version. But since the manifest update runs on a cron job, it can take some time for this to happen and PRs on main start failing. This happened adter the 2.16 branch was cut as well as on 2.17.

A better way would be to trigger a manifest update along with the branch cut so that the artifacts are available immediately.

@gaiksaya
Copy link
Member

gaiksaya commented Sep 5, 2024

Delay in merging version increment PRs causing build failures and delay in RC generation.
reportsDashboards: opensearch-project/dashboards-reporting#404

@cwperks
Copy link
Member

cwperks commented Sep 10, 2024

ISM and ISM Dashboards had failing tests in 2.16, that again got flagged in 2.17 release testing. The tests were failing due to logic in ISM tests that would cleanup test suites by deleting ISM system indices.

Originally planned in 2.16 (now 2.17), the security plugin introduced a change to identify if a request matches system indices by checking a central System Index Registry from core. The index patterns in the registry are indices registered with the SystemIndexPlugin.getSystemIndexDescriptors extension point. Before 2.17, the security plugin relied on an opensearch.yml setting to be aware of the system index patterns in the cluster. The list that is provided in the demo configuration can be found here and has never included ISM indices. As a result, ISM indices were treated as regular indices (from security plugin perspective) and regular users could access the indices as any other index. Security gives special protections to system indices and forbids regular users from having direct access. Actions like writing to the index and deleting the index are strictly forbidden for regular users. Plugins needing system index access have mechanisms for accessing the system index when needed.

ISM tests were flagged in 2.17 again because the CI Checks in ISM and ISM Dashboards repos are insufficient.

The existing ISM tests with security are only operating on a subset of integ tests. Opened a PR here to address the ISM plugin issues.

ISM Dashboards has cypress tests that are only testing w/o security for PR checks. I opened a PR to address the failing checks for ISM Dashboards, but there should be a follow-up change to add a PR check to run cypress tests with security. There was also a change made in the FTR repo where there is similar test cleanup logic that removes system indices.

@vikasvb90
Copy link
Contributor

@cwperks The real problem is not ISM not making required changes on time to support breaking change pushed by security but security making breaking changes without any campaign. We have seen other instances in case of CA certs where changes were pushed by security which led to failures in other plugins. Adding cypress test in PR is still a reactive approach where a plugin spends time investigating, figures out that issue is not related to the plugin, follows up with security and gets the issue fixed. I don't see any value in this approach although I agree on the overall AI of executing tests on PR workflow runs.

@cwperks
Copy link
Member

cwperks commented Sep 11, 2024

@vikasvb90 The ISM tests are an issue that needs to be addressed. If the ism indices are supposed to be treated like regular indices then there is no need to register them with the SystemIndexPlugin.getSystemIndexDescriptors() extension point.

An index is a system index if its important to the integrity of the cluster and would cause a cluster to enter a corrupted state if deleted. For example, the security index (.opendistro_security) is a system index that contains the security posture of a cluster. Deleting the security index would render a cluster corrupt because the nodes would not be able to read the security posture.

All plugins should run integ tests with and without security to avoid catching test failures only at release time.

For the demo cert renewal, the demo certs were initially updated to modify the SAN to include IPv6 loopback address. @DarshitChanpura updated the certs and opened PRs on all repos that maintained copies of the certs, but ISM was missed in 2.13: opensearch-project/security#4061

In a future release, I'd like to run a campaign across plugins to remove copies of demo certs and instead to refer to the certs centrally. See example of how SQL plugin pulls them from the security repo here: https://github.com/opensearch-project/sql/blob/4303a2ab755d53903094dd94a5100572677a27a1/integ-test/build.gradle#L107-L111

@vikasvb90
Copy link
Contributor

@cwperks I am not questioning the intention of the change. I do agree that change needs to be there but they need to be called out proactively as a campaign or as any other mechanism. By the time plugin finds this out and calls out, it is already too late.
Also, I understand that issues happen and we can take reactive measures to fix them but attempts should at least be made to reduce their possibility by taking proactive measures before pushing breaking changes.

@DarshitChanpura
Copy link
Member

@vikasvb90 One of the biggest pro-active measures to catch these early-on is to add a CI check that runs tests with security enabled, since we run security-enabled tests for all RCs. This would enable plugin owners to debug any failures, and/or reach out to security team to get these addressed as soon as possible.

@vikasvb90
Copy link
Contributor

vikasvb90 commented Sep 12, 2024

@DarshitChanpura That's not being pro-active. If you rely on foreign plugins to come and tell you that something is buggy in security then that is being reactive. And it ends up wasting a lot of dev cycles. Tests in other plugins should just be treated as last line of defense.

@gaiksaya
Copy link
Member

#5016

@msfroh
Copy link

msfroh commented Sep 16, 2024

@vikasvb90 -- what procedural change are you suggesting? You are saying no to improved CI checks in one plugin (which seems like a ridiculous argument to be making -- "No! I don't want to add more tests to cover my plugin in more scenarios!"). What do you think would be better?

Should the security plugin have an integration test that runs all plugins' integ tests with security enabled?

If you rely on foreign plugins to come and tell you that something is buggy in security then that is being reactive.

Also, what's buggy about the security plugin trying to protect system indices? That seems like a good idea and the intended behavior of the change. What's the bug?

@gaiksaya
Copy link
Member

gaiksaya commented Sep 16, 2024

Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue.

OpenSearch-Dashboards. See data here

  • Core
  • observabilityDashboards
  • securityAnalyticsDashboards
  • notificationsDashboards
  • indexManagementDashboards (Had to run several times)
  • reportsDashboards (Was fixed recently but still need to check if that was the actual cause)

OpenSearch components: See data here

  • skills
  • ml-commons
  • alerting
  • index-management
  • geospatial

@vikasvb90
Copy link
Contributor

@msfroh My intention is to handle common cases like this gracefully by running a campaign where security can ask for sign-offs with a reasonable amount of bake time from all the dependent plugins (List can easily be fetched by looking at build.gradle). Beyond the bake time, security can take a call to merge the change. No response from dependent plugins can be treated as a NO_BREAKING_CHANGE as well.

@msfroh
Copy link

msfroh commented Sep 17, 2024

@msfroh My intention is to handle common cases like this gracefully by running a campaign where security can ask for sign-offs with a reasonable amount of bake time from all the dependent plugins (List can easily be fetched by looking at build.gradle). Beyond the bake time, security can take a call to merge the change. No response from dependent plugins can be treated as a NO_BREAKING_CHANGE as well.

Yeah -- that seems like a reasonable ask. I would skip asking for sign-offs, personally. I would open issues saying "We're going to do X in a few days unless we hear back from you." Essentially, it would be opt-out instead of opt-in.

@sandeshkr419
Copy link

@gaiksaya Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue. ....

I think we also need a campaign to reduce the flaky tests to 0. The negative impact of existing flaky tests is multi-fold (impacting all components) on development and code merges, with developers waiting/hoping (sometimes days) on retries to succeed because of gradle check failures due to flaky tests.

@navneet1v
Copy link
Contributor

@gaiksaya Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue. ....

I think we also need a campaign to reduce the flaky tests to 0. The negative impact of existing flaky tests is multi-fold (impacting all components) on development and code merges, with developers waiting/hoping (sometimes days) on retries to succeed because of gradle check failures due to flaky tests.

+1 on this. Core having flaky tests have been flagged so many times. @msfroh one reason of stalled PRs in core could because of gradle checks taking forever to complete and then failing due to flaky tests. I know there has been a lot of brainstorming on this, I have one suggestion that might help reducing the time for gradle checks:

  1. I think we should split core test runs into multiple CI runs to ensure that more tests can run in parallel and also if a re-run is required we are re-running only some tests.

@navneet1v
Copy link
Contributor

k-NN team had a learning in this release where we identified that all the integTests that runs during CI runs without heap CB enabled but Jenkins pipeline uses the RC to create the cluster. This RC candidate default settings is different from integTest cluster. Heap CB not enabled was one and this lead to integTest failures for k-NN in jenkins pipeline while same tests was successful in the CIs.
More details can be found on this GH issue: opensearch-project/OpenSearch#15849
I have added the capability to override the setting to enable heap CB if required by plugins ref: opensearch-project/OpenSearch#15906. But I think this is something which all plugins should enable in their CIs.

@kolchfa-aws
Copy link

kolchfa-aws commented Sep 19, 2024

The doc team received a lot of last-minute PRs and even issues that were entered beyond first RC date. We are suggesting the following to ensure that the RC entrance criteria from the docs side are met:

  • To adhere to the RC entrance criteria, provide a PR by the first RC date
  • Open a documentation issue as soon as possible for the feature you're working on so we can load-balance effectively
  • For features requiring significant documentation, communicate with the doc team in advance so we can provide support
  • For frontend features (Dashboards), provide a video demo of the feature by the first RC date
  • Work on a branch in your fork instead of main to allow maintainers to push to the same PR

@gaiksaya
Copy link
Member

Hi,

I tried to summarize the points mentioned in this issue under what needs to improve section on the retrospective board https://github.com/orgs/opensearch-project/projects/205/views/16?filterQuery=category%3A%22v2.17.0+Retro%22+

Please feel free to review, add and update if required.
Thanks!

@gaiksaya
Copy link
Member

Thank you for all your comments and suggestion. We have created issues and added the existing ones in the actions items section in the above board.
Closing this one.

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Engineering Effectiveness Board Sep 21, 2024
@gaiksaya gaiksaya unpinned this issue Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

No branches or pull requests

9 participants