Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix part of #5312, part of #59: Introduce better script execution support #5313

Merged
merged 46 commits into from
Mar 15, 2024

Conversation

BenHenning
Copy link
Member

@BenHenning BenHenning commented Jan 19, 2024

Explanation

Fixes part of #5312
Fixes part of #59

This PR helps prepare for changes coming in #5315 and #4929 (the latter of which is the start of the main upcoming Bazel migration PR chain) by introducing one main scripts-based change: ScriptBackgroundCoroutineDispatcher: a Kotlin coroutine dispatcher for executing asynchronous tasks in scripts that also supports proper Java executor service shutdown (so that scripts don't hang). This dispatcher is multi-threaded to help simplify executing large numbers of parallel background tasks.

All scripts have been migrated over to running their primary operations within the context of this new dispatcher. Relevant script utilities have been updated to use it, including CommandExecutor (though this is mainly a placeholder change for the main executor changes which are coming in #4929).

Miscellaneous details to note:

  1. A bunch of 'e.g.' typos were fixed in GenerateMavenDependenciesList.kt and wiki/Updating-Maven-Dependencies.md. These aren't functionally needed, they were just something I noticed while developing.
  2. kotlinx-coroutines-core was updated from 1.4.1 to 1.4.3 in order to work around StackOverflow in DispatchedContinuation Kotlin/kotlinx.coroutines#2371 which was causing flakiness in one of the new dispatcher tests.
  3. testClose_pendingTaskLongerThanCloseTimeout_taskIsNotRun intentionally takes ~2 seconds to run in order to provide some assurance that, without cancellation, the task would run and the test would fail (this has been manually verified in a few different situations of the dispatcher and/or test changing; some changes won't result in a failure due to how cancellation works internally for executor service & the converted coroutine dispatcher).

Note that historically these changes were originally part of #4929, but they were split out so that they could be used by #5315 (which ended up being convenient to include prior to #4929).

Essential Checklist

  • The PR title and explanation each start with "Fix #bugnum: " (If this PR fixes part of an issue, prefix the title with "Fix part of #bugnum: ...".)
  • Any changes to scripts/assets files have their rationale included in the PR explanation.
  • The PR follows the style guide.
  • The PR does not contain any unnecessary code changes from Android Studio (reference).
  • The PR is made from a branch that's not called "develop" and is up-to-date with "develop".
  • The PR is assigned to the appropriate reviewers (reference).

For UI-specific PRs only

This PR doesn't include any user-facing changes since it only impacts scripts.

These issues were found after I started using a new development
environment.
ProfileAndDeviceIdFragmentTest had been updated to use a newer fragment
initialization pattern, but that's no longer needed and seems to be
causing what appears to be timing discrepancies between local dev and
CI.
The issue ultimately arose from test parameters being initialized after
they're needed in the launched UI. This type of change was tried earlier
in the branch, but reverted since it didn't seem necessary. It is,
however, necessary when there are environment differences (e.g. local
vs. CI) or when running certain tests individually.

Due to the difficulty in finding this issue, ActivityScenarioRule has
been added as a prohibited pattern in the static regex checks (along
with ActivityTestRule since that's deprecated and discouraged, anyway).
The test was suffering from some proto encoding inconsistencies that
seem to occur between some development machines vs. on CI. The fix
improves the test's robustness by extracting the raw encoded string,
verifying that the other outputs in the intent message correctly
correspond to that string, and that the string (as a parsed proto)
contains the correct values. As a result, the test no longer depends on
a hardcoded encoding value to be present for verification. This does
result in a bit more logic than is generally good to have in a test (and
it lengthened the test code quite a bit), but it seems necessary in this
particular case.
This is a script-specific dispatcher which will allow for better
asynchronous support in upcoming PRs (especially for command execution).
This change serves to prepare for those changes.
The main change here is ensuring that Bazel 4.0.0 is used & bzlmod
disabled in newer versions of Bazel when running operations in a test
Bazel environment.

This commit also introduces some more timing tweaks on CommandExecutor
for some tests, though these only affect very specific tests (as many
script tests directly call a script's main() function and thus don't
overwrite its executor behavior).

This commit attempted to introduce "--batch" mode to runs, but the
isolation didn't actually seem to improve stability and, instead,
substantially slowed down some of the tests.
…ipt-execution-support

Conflicts:
	scripts/src/java/org/oppia/android/scripts/ci/ComputeAffectedTests.kt
	scripts/src/javatests/org/oppia/android/scripts/ci/ComputeAffectedTestsTest.kt
	scripts/src/javatests/org/oppia/android/scripts/common/BazelClientTest.kt
Plus, actually makes use of the new script background dispatcher in
CommandExecutorImpl to make the new wiring make more sense (though its
real utility will come in a follow-up PR).
Copy link

oppiabot bot commented Jan 29, 2024

Hi @BenHenning, I'm going to mark this PR as stale because it hasn't had any updates for 7 days. If no further activity occurs within 7 days, it will be automatically closed so that others can take up the issue.
If you are still working on this PR, please make a follow-up commit within 3 days (and submit it for review, if applicable). Please also let us know if you are stuck so we can help you!

@oppiabot oppiabot bot added the stale Corresponds to items that haven't seen a recent update and may be automatically closed. label Jan 29, 2024
@oppiabot oppiabot bot closed this Feb 5, 2024
This was done by removing the //testing dependency and, instead, having
instrumentation targets depend on the direct module within //testing
that they need to build. This module & its corresponding implementation
binding (and tests) needed to be moved out of //testing and into their
own /firebase package.
Also includes some minor clean-ups within TestBazelWorkspaceTest.
Copy link

oppiabot bot commented Feb 27, 2024

Hi @BenHenning, I'm going to mark this PR as stale because it hasn't had any updates for 7 days. If no further activity occurs within 7 days, it will be automatically closed so that others can take up the issue.
If you are still working on this PR, please make a follow-up commit within 3 days (and submit it for review, if applicable). Please also let us know if you are stuck so we can help you!

@oppiabot oppiabot bot added the stale Corresponds to items that haven't seen a recent update and may be automatically closed. label Feb 27, 2024
@oppiabot oppiabot bot removed the stale Corresponds to items that haven't seen a recent update and may be automatically closed. label Mar 4, 2024
Copy link
Collaborator

@adhiamboperes adhiamboperes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BenHenning!

This was pretty clear to read. I had a question, but non-blocking. PTAL.

Copy link

oppiabot bot commented Mar 7, 2024

Hi @BenHenning, this PR is ready to be merged. Please address any remaining comments prior to merging, and feel free to merge this PR once the CI checks pass and you're happy with it. Thanks!

@BenHenning
Copy link
Member Author

Latest develop changes merged cleanly, and your comment appeared to be non-blocking so I resolved it @adhiamboperes (please let me know if you do want follow-up work done here).

@BenHenning BenHenning requested a review from seanlip March 14, 2024 23:33
@BenHenning BenHenning assigned seanlip and unassigned BenHenning Mar 14, 2024
@BenHenning
Copy link
Member Author

@seanlip PTAL for codeowners.

I've also set auto-merge in case there are no problems found.

@BenHenning BenHenning enabled auto-merge (squash) March 14, 2024 23:34
Copy link
Member

@seanlip seanlip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for scripts/assets/maven_dependencies.textproto

@oppiabot oppiabot bot unassigned seanlip Mar 15, 2024
Copy link

oppiabot bot commented Mar 15, 2024

Unassigning @seanlip since they have already approved the PR.

Copy link

oppiabot bot commented Mar 15, 2024

Hi @BenHenning, this PR is ready to be merged. Please address any remaining comments prior to merging, and feel free to merge this PR once the CI checks pass and you're happy with it. Thanks!

@BenHenning BenHenning merged commit e2f94e4 into develop Mar 15, 2024
44 checks passed
@BenHenning BenHenning deleted the introduce-better-script-execution-support branch March 15, 2024 01:43
@BenHenning BenHenning restored the introduce-better-script-execution-support branch March 19, 2024 23:59
@BenHenning BenHenning deleted the introduce-better-script-execution-support branch March 20, 2024 00:00
BenHenning added a commit that referenced this pull request Mar 20, 2024
)

## Explanation
Fixes #5312

This updates the TODO open check script to be locally runnable rather
than requiring the developer to manually download the list of issues
from GitHub to analyze. This simplifies things and allows the script to
be easily run within the ``static_checks.sh`` script. This is being done
via interacting directly with GitHub using its RESTful API (see
https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues)
in conjunction with the user's local auth token used to set up their
``gh`` tool (which needs to be set up, hence the changes to the wiki
documentation and clear error messages from the new ``GitHubClient``
utility).

To further simplify things, a ``regenerate`` mode was added to
regenerate the TODO exemptions textproto file (which is helpful for
#4929 hence why this comes before that PR).

The new command syntax to perform the TODO check is:

```sh
bazel run //scripts:todo_open_check -- <path_to_dir_root> <path_to_proto_binary> [regenerate]
```

With a specific example to just perform the check:

```sh
bazel run //scripts:todo_open_check -- $(pwd) scripts/assets/todo_open_exemptions.pb
```

And an example to also perform regeneration:

```sh
bazel run //scripts:todo_open_check -- $(pwd) scripts/assets/todo_open_exemptions.pb regenerate
```

Some other things specifically to note:
- TODO exemptions needed to be updated in this PR due to TODO utility &
test changes. The new file was created using the new regenerate
functionality.
- The TODO check has been added to the end of ``static_checks.sh``.
- The GitHub CI workflow was updated to use the new script syntax
appropriately.
- This is the first time scripts have been updated to integrate with
Retrofit, and this setup is going to be reused in the future for other
services.
- The data model for issues has been updated to better represent the
remote data structure.
- Moshi is being used along with Retrofit for an easier interaction with
GitHub as a remote endpoint. All of this has been wrapped in a new
``GitHubClient``.
- ``GitHubClient`` is designed to download all issues regardless of
length (whereas before the manual download step was limited to the first
2000 issues of the repository) using pagination.
- New tests were added to verify the regenerate flow (and properly set
up the mock OkHttp server since the script now relies on an HTTP
endpoint to download the GitHub issues itself).
- ``GitHubIssue`` is exempted from tests since it's just a basic data
structure, so there's no specific logic to test.
- ``GitHubService`` is exempted from tests since it's a template to
generate code via Retrofit's annotation processor, so there's no
specific logic to test.
- All scripts proto libraries were updated to use normal Java (rather
than Java lite) generation to provide text format support.
- The file paths included in ``TodoOpenCheck``'s output has been
simplified to be relative to the repository root rather than using
absolute paths (for parity with many other app scripts).
- Since ``GitHubClient``'s tests required interacting with ``gh``, a new
``FakeCommandExecutor`` was added (along with its own tests) to provide
support for orchestrating local utilities. This may be useful in other
tests in the future, though some of those script tests intentionally
integrate with environment commands like ``git`` and ``bazel``.

## Essential Checklist
- [x] The PR title and explanation each start with "Fix #bugnum: " (If
this PR fixes part of an issue, prefix the title with "Fix part of
#bugnum: ...".)
- [x] Any changes to
[scripts/assets](https://github.com/oppia/oppia-android/tree/develop/scripts/assets)
files have their rationale included in the PR explanation.
- [x] The PR follows the [style
guide](https://github.com/oppia/oppia-android/wiki/Coding-style-guide).
- [x] The PR does not contain any unnecessary code changes from Android
Studio
([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#undo-unnecessary-changes)).
- [x] The PR is made from a branch that's **not** called "develop" and
is up-to-date with "develop".
- [x] The PR is **assigned** to the appropriate reviewers
([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#clarification-regarding-assignees-and-reviewers-section)).

## For UI-specific PRs only
This is an infrastructure-only PR.

---------

Co-authored-by: Adhiambo Peres <59600948+adhiamboperes@users.noreply.github.com>
Co-authored-by: Sean Lip <sean@seanlip.org>
BenHenning added a commit that referenced this pull request May 22, 2024
## Explanation
Fix part of #59
Fix part of #3926

As part of developing downstream PRs for #59, it was discovered that PRs
which change a LOT of files (such as #4937) can run into problems where
ComputeAffectedTests simply times out trying to compute the entire list
of affected targets.

**Critical performance and compatibility fixes**

There have been past efforts to optimize the affected tests workflows
(bucketing, breaking up some of the computations), but it was discovered
that the most expensive part of the process is running the
``rbuildfiles`` query to figure out which BUILD files were affected by
changed files. It was known this was an expensive step in the past, but
it wasn't clear until this PR exactly how to address it. This PR changes
the script to now:
- Filter ``rbuildfiles`` to only run under first-party targets (since it
shouldn't be possible for third-party BUILD files to be affected by
first-party changes). This reduces the search graph.
- Introduce Bazel command sharding for this step like the script already
does for others. This breaks up the command to run on a subset of files
multiple times, combining the output rather than running a single
command with a large input. It seems that ``rbuildfiles`` does not scale
linearly with its input size, so this drastically improves the script's
performance. It's thought that this approach is also more logically
correct due to more correct sibling matching semantics, but it's a bit
hard to reason about ``bazel query`` behavior at times so I can't be
100% confident in this. Nevertheless, the existing tests pass and I
haven't seen any issues from using these changes in downstream PRs.

Separately, another issue was discovered wherein some commands
(including certain Bazel commands) can actually cause
``CommandExecutorImpl`` to soft-lock and always time out. This was due
to an issue in the previous implementation wherein it would wait to read
a command's stdout until after the timeout has been completed (i.e. it
assumed the process would finish). This isn't correct, however: stdout
is blocking I/O, and some commands are implemented to only continue
execution after their standard output is read. The new implementation
makes use of coroutine actors to consume stdout and stderr at the same
time as waiting for execution to ensure all commands can continue
execution and that they finish within the desired timeout. Note that the
new ``ScriptBackgroundCoroutineDispatcher`` was actually introduced (in
#5313) specifically to support this new ``CommandExecutorImpl``
implementation, though it has since been found to have lots of other
nice benefits by providing scripts with a reliable mechanism for
performing asynchronous operations without having to manage their own
execution dispatcher.

Command execution for Bazel commands has also been updated to time out
after 5 minutes rather than the previous 1 minute. Despite the
optimizations and robustness improvements above, some commands still
take quite some time to run for especially large and complex cases.
While this change may result in a slower failure turnaround in cases
when commands are soft-locked, it should result in better CI and script
robustness in the long-term.

**Better support for chained PRs & possibly merge queues**

ComputeAffectedTests was also updated to use a merge base commit rather
than a reference to the develop branch (where this new commit hash is
provided by the CI workflow). The idea behind this is that the merge
base commit is:
- More logically correct (as ``ComputeAffectedTests`` is meant to run in
contexts where a branch wants to be merged into a destination).
- Better compatible with chained PRs. This allows for **significantly**
better CI performance in chained PR situations since now only the tests
relevant to the child PR will be run rather than all tests for the child
& its parental hierarchy (see downstream PRs' CI runs to see this
working in action).
- Hopefully better compatibility with merge queues (#3926). This hasn't
been verified, but the flexibility in customizing the destination for
affected tests should be the main prerequisite to properly supporting
merge queues.

**Other changes**

``GitClient`` was updated to have a peace-of-mind check to ensure the
base commit (provided as explained in the previous section) matches the
merge base of the current branch. This should always be true (except
maybe in merge queues--this will need to be verified). Note that this is
only a soft warning, not an assertion failure.

``RepositoryFile`` was cleaned up slightly to be a bit more consistent
with other directory management approaches done in scripts. I can see
this being refactored more in the future. Callsite behavior isn't
expected to be affected by these changes.

Some script tests were updated to have consistent formatting (which
required updating the TODO exemptions). ``TodoIssueResolvedCheckTest``
and ``TodoOpenCheckTest`` also had some of their test file management
cleaned up a bit.

**A note on testing**

These are inherently difficult things to test. I've verified what I
could via CI and general observation, but I've also largely relied on
existing tests to catch regressions (and many were caught during changes
to the script). Since these are mainly implementation and not behavioral
changes, I'm comfortable with the level of testing that was done.

## Essential Checklist
- [x] The PR title and explanation each start with "Fix #bugnum: " (If
this PR fixes part of an issue, prefix the title with "Fix part of
#bugnum: ...".)
- [x] Any changes to
[scripts/assets](https://github.com/oppia/oppia-android/tree/develop/scripts/assets)
files have their rationale included in the PR explanation.
- [x] The PR follows the [style
guide](https://github.com/oppia/oppia-android/wiki/Coding-style-guide).
- [x] The PR does not contain any unnecessary code changes from Android
Studio
([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#undo-unnecessary-changes)).
- [x] The PR is made from a branch that's **not** called "develop" and
is up-to-date with "develop".
- [x] The PR is **assigned** to the appropriate reviewers
([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#clarification-regarding-assignees-and-reviewers-section)).

## For UI-specific PRs only
N/A -- This is an infrastructure-only change.

---------

Co-authored-by: Adhiambo Peres <59600948+adhiamboperes@users.noreply.github.com>
Co-authored-by: Sean Lip <sean@seanlip.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants