Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub actions and workflows #12085

Closed
wants to merge 18 commits into from
Closed

GitHub actions and workflows #12085

wants to merge 18 commits into from

Conversation

robandpdx
Copy link
Contributor

@robandpdx robandpdx commented Nov 21, 2023

This pull request converts the CircleCI workflows to GitHub actions workflows. Github Actions Importer was used to convert the workflows initially, then I edited them manually to corect errors in translation. Many of the CircleCI command ended up as actions, some were no longer needed. For example install-cmake-on-macos is no longer needed because cmake in preinstalled on MacOS GitHub runners.

The GitHub actions workflows need runner groups with larger runners configured for the organization.

The following runner groups are needed:

  • large
  • xlarge
  • 2xlarge
  • 2xlargeplus

For my testing, I have these runner groups populated with the following runners:

Group Runner size OS
large 8-cores · 32 GB RAM · 300 GB SSD Ubuntu Latest
xlarge 16-cores · 64 GB RAM · 600 GB SSD Ubuntu Latest
2xlarge 32-cores · 128 GB RAM · 1200 GB SSD Ubuntu Latest
2xlargeplus 64-cores · 256 GB RAM · 2040 GB SSD Ubuntu Latest

Issues
There are issue with some of the workflow that someone smarter than me needs to address.

  1. facebook/rocksdb/benchmark-linux -> benchmark-linux
    This job fails because it cannot find the report.tsv file. It looks like the LOGs cannot be found also...
grep: /home/runner/work/_temp/rocksdb-benchmark-datadir/LOG: No such file or directory
grep: /home/runner/work/_temp/rocksdb-benchmark-datadir/LOG: No such file or directory
failed																							overwrite.t1.s0				
Completed overwrite (ID: ) in 1 seconds
ops_sec	mb_sec	lsm_sz	blob_sz	c_wgb	w_amp	c_mbps	c_wsecs	c_csecs	b_rgb	b_wgb	usec_op	p50	p99	p99.9	p99.99	pmax	uptime	stall%	Nstall	u_cpu	s_cpu	rss	test	date	version	job_id	githash
tail: cannot open '/home/runner/work/_temp/benchmark-results/8.9.0/report.tsv' for reading: No such file or directory
cp: cannot stat '/home/runner/work/_temp/rocksdb-benchmark-datadir/LOG*': No such file or directory
gzip: /home/runner/work/_temp/benchmark-results/8.9.0/LOG*: No such file or directory
awk: fatal: cannot open file `/home/runner/work/_temp/benchmark-results/8.9.0/report.tsv' for reading: No such file or directory
awk: fatal: cannot open file `/home/runner/work/_temp/benchmark-results/8.9.0/report.tsv' for reading: No such file or directory
Traceback (most recent call last):
  File "/home/runner/work/rocksdb/rocksdb/./tools/benchmark_ci.py", line 182, in <module>
    sys.exit(main())
  File "/home/runner/work/rocksdb/rocksdb/./tools/benchmark_ci.py", line 174, in main
    results(version_str, config)
  File "/home/runner/work/rocksdb/rocksdb/./tools/benchmark_ci.py", line 100, in results
    shutil.copyfile(
  File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/_temp/benchmark-results/8.9.0/report.tsv'

Fixing may require edits to tools/benchmark_ci.py and/or tools/benchmark.sh.

  1. facebook/rocksdb/job-java -> build-linux-java-static
    This seems to be some issue with the docker container evolvedbinary/rocksjava:centos6_x64-be. Maybe the runner needs to be centos also, rather than ubuntu? I actually have no idea as this is super far outside of my wheelhouse.
/usr/bin/docker exec  e73953ff5fa79612d662eecb712a488cd34785d0cbf15c8557d04928a7075be7 sh -c "cat /etc/*release | grep ^ID"
/__e/node20/bin/node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.7' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.16' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by /__e/node20/bin/node)
  1. facebook/rocksdb/jobs-linux-arm
    Currently, GitHub hosted runners do not come in the ARM flavor. Perhaps they will in the future. For now, if we want these jobs to work, you'll need to create some ARM based self-hosted runners.

  2. facebook/rocksdb/jobs-linux-other-checks -> build-linux-mini-crashtest
    The job fails with the following message:

stderr:
Failed setting up expected state with error: IO error: No space left on device: While appending to file: /dev/shm/rocksdb._tyd/rocksdb_crashtest_expected/.LATEST.state.tmp: No space left on device

make: *** [crash_test.mk:70: blackbox_crash_test_with_atomic_flush] Error 2

Running out of space seems crazy on a machine that has 300GB of disk. I did see a warning in the Makefile that "Parallel can fill your /dev/shm" so maybe that's what's happening. Again, way outside of my expertise here.

  1. facebook/rocksdb/jobs-linux-run-tests -> build-linux
    This job seems to fail with a failed unit test due to a disk full:
 1 FAILED TEST
parallel: Error: Output is incomplete. Cannot append to buffer file in /dev/shm/rocksdb.eE0v. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
Warning: unable to close filehandle properly: No space left on device during global destruction.
Seq	Host	Starttime	JobRuntime	Send	Receive	Exitval	Signal	Command
2	:	1700539242.394	     0.348	0	0	1	0	t/run-db_test-DBTest.FileCreationRandomFailure >& t/log-run-db_test-DBTest.FileCreationRandomFailure || bash -c "cat t/log-run-db_test-DBTest.FileCreationRandomFailure; exit $?"
make[1]: *** [Makefile:997: check_0] Error 1
17	:	1700539242.494	     0.263	0	0	1	0	t/run-write_prepared_transaction_test-OneWriteQueue-SeqAdvanceConcurrentTest.SeqAdvanceConcurrent-2 >& t/log-run-write_prepared_transaction_test-OneWriteQueue-SeqAdvanceConcurrentTest.SeqAdvanceConcurrent-2 || bash -c "cat t/log-run-write_prepared_transaction_test-OneWriteQueue-SeqAdvanceConcurrentTest.SeqAdvanceConcurrent-2; exit $?"
make[1]: Leaving directory '/__w/rocksdb/rocksdb'
make: *** [Makefile:1053: check] Error 2
Error: Process completed with exit code 2.

Yeah, filling up 1200 GB seems loco. See me comment above about the warning in the Makefile.

  1. facebook/rocksdb/jobs-linux-run-tests -> build-linux-gcc-7-with-folly
    Another disk full thing.

  2. facebook/rocksdb/jobs-linux-run-tests -> build-linux-encrypted_env-no_compression
    Another disk full thing.

  3. facebook/rocksdb/jobs-linux-run-test-san
    The 4 jobs in this workflow all seem to fail with the disk full issue.

  4. facebook/rocksdb/jobs-macos -> build-macos-cmake even tests
    This job seems to fail due to a test failure.

  5. facebook/rocksdb/nightly -> build-format-compatible
    I have no idea what is causing this failure.

  6. facebook/rocksdb/nightly -> build linux-arm-test-full
    Need an ARM runner.

  7. facebook/rocksdb/nightly -> build-linux-microbench
    Segmentation fault.

  8. facebook/rocksdb/nightly -> build-linux-clang-13-asan-ubsan-with-folly
    Disk full.

  9. facebook/rocksdb/nightly -> build-linux-valgrind.
    Disk full.

Other than all that, everything is working great!

https://fburl.com/workplace/f6mz6tmw

@adamretter
Copy link
Collaborator

What is the purpose of moving from CircleCI to GitHub Actions?

@bigfootjon
Copy link
Member

What is the purpose of moving from CircleCI to GitHub Actions?

Meta open source projects are consolidating on GitHub Actions over the next year

@adamretter
Copy link
Collaborator

adamretter commented Nov 21, 2023

Meta open source projects are consolidating on GitHub Actions over the next year

@bigfootjon Thanks for replying, I was not aware of that.

facebook/rocksdb/benchmark-linux -> benchmark-linux

This is a Continuous Benchmark system we configured for RocksDB that uses a Custom CircleCI runner on dedicated hardware, are you guys aware of that? I am not sure how easy it will be to port this to GitHub Actions yet...

facebook/rocksdb/job-java -> build-linux-java-static
This seems to be some issue with the docker container evolvedbinary/rocksjava:centos6_x64-be. Maybe the runner needs to be centos also, rather than ubuntu?

This is a Linux Docker container, so it doesn't need anything particularly special from the host, we have run this just fine under Ubuntu, CentOS, and macOS.
I have no idea about the errors you are seeing on GitHub Actions here, but it looks like for some reason it is trying to run Node.js version 20+, I think that is unrelated to the Docker container.

@pdillinger pdillinger self-requested a review November 22, 2023 16:47
@pdillinger
Copy link
Contributor

@robandpdx Why don't I see any results for the proposed new configuration? What are we supposed to do to debug if we can't see any results?
Screenshot 2023-11-22 at 9 07 30 AM

@robandpdx
Copy link
Contributor Author

@robandpdx Why don't I see any results for the proposed new configuration? What are we supposed to do to debug if we can't see any results? ![Screenshot 2023-11-22 at 9 07 30 AM]

I would recommend creating either a fork or create a new repo in this org add an origin to your clone. Then merge this branch to main in that new repo. You'll then see the workflows run and be able to troubleshoot as needed. When you get it working, squash and push the changes to the branch I'm using in this repo.

@pdillinger
Copy link
Contributor

pdillinger commented Nov 29, 2023

OK I dug deeper and found that the pull request trigger on our CircleCI jobs was not migrated. The draft PR here only had a trigger on pushes to main. I've fixed that, but now I'm getting weird failures on most jobs with no diagnostics @robandpdx :
Screenshot 2023-11-29 at 2 48 34 PM

@robandpdx
Copy link
Contributor Author

OK I dug deeper and found that the pull request trigger on our CircleCI jobs was not migrated. The draft PR here only had a trigger on pushes to main. I've fixed that, but now I'm getting weird failures on most jobs with no diagnostics

These jobs require runner groups as described above...

  build-linux:
    runs-on: 
      group: 2xlarge
...

@pdillinger
Copy link
Contributor

but now I'm getting weird failures on most jobs with no diagnostics

Apparently the diagnostics are on the "summary" page, where you have to scroll down with your pointer NOT over the main content (where scrolling does nothing so seems to indicate there is nothing to scroll to). Clicking individual failures does not take you to the diagnostics. And "summary" seems like the worst place to put failure details when you have produced failure pages for each job.

@pdillinger
Copy link
Contributor

@adamretter @robandpdx It looks like evolvedbinary's docker image doesn't work with any reasonable version of the checkout action, as seen here: https://github.com/facebook/rocksdb/actions/runs/7227817623/job/19697909322?pr=12085 Whether it's using node16 or node20, both appear to be bound to GLIBC versions (>= 2.14) unavailable to the image that indicates it is from CentOS 6 (glibs 2.12).

I don't see any GitHub Actions documentation about these kinds of limitations on docker images. Am I missing it somewhere?

And I can't even ssh in to debug, because (a) that would only get me into the docker environment, and (b) the ssh action also fails: https://github.com/facebook/rocksdb/actions/runs/7228263740/job/19697647417

What's our next step?

@robandpdx
Copy link
Contributor Author

What's our next step?

@pdillinger You could run the container locally to debug. Another option is to get a reverse shell into the container running on the actions runner using the method I have published here.

@pdillinger
Copy link
Contributor

pdillinger commented Dec 18, 2023

@robandpdx

You could run the container locally to debug.

Hmm, this does not make sense to me. Obviously running the commands we want to run in the container works as expected, based on the CircleCI results. The problem is GHA trying to run what it wants to run to get things set up in the container. How do I debug that locally? I haven't found any GHA documentation that seems relevant. Best I can tell, the only way to test GHA workflows is guess-and-test (on the server side).

Another option is to get a reverse shell into the container running on the actions runner ...

I don't think the debug output from the failed jobs provides sufficient context to know how to reproduce the error seen. For example, the last command seen before the failure is /usr/bin/docker exec 2fb3041c0eded458e7b8580645da667548d5e9a7503ff0204a264a4a240f9df0 sh -c "cat /etc/*release | grep ^ID" but I don't see how this would generate the failure messages about /__e/node16/bin/node: /lib64/libc.so.6: version 'GLIBC_2.16' not found (required by /__e/node16/bin/node). What is the command line to run a node-based action?

@robandpdx
Copy link
Contributor Author

@pdillinger I'll see if I can get some help internally to figure out a way forward here. I agree, troubleshooting github actions and workflows is less than ideal.

@adamretter
Copy link
Collaborator

@adamretter @robandpdx It looks like evolvedbinary's docker image doesn't work with any reasonable version of the checkout action, as seen here: https://github.com/facebook/rocksdb/actions/runs/7227817623/job/19697909322?pr=12085 Whether it's using node16 or node20, both appear to be bound to GLIBC versions (>= 2.14) unavailable to the image that indicates it is from CentOS 6 (glibs 2.12).

There is nothing special about our Docker Image that I am aware of. Yes the Image very intentionally packages an older version of CentOS so that we can build a version of RocksDB that has a wide glibc compatibility. We do not expect anything to be executed inside the container apart from the script java/crossbuild/docker-build-linux-centos.sh or the exactly equivalent steps, this is setup in the Makefile here: https://github.com/facebook/rocksdb/blob/v8.9.1/Makefile#L2333

When running the Docker container locally, it expects the source code to mounted from the Host via a volume bind. That is not possible in CircleCI or GitHub Actions. Which the checkout step appears to work fine in CircleCI, it seems clear that the checkout step in GitHub Actions is not compatible with older glibc versions.

For GitHub Actions, I think it should be fine to remove the checkout step and replace it with a git clone plus some interpolated arguments.

@robandpdx
Copy link
Contributor Author

robandpdx commented Dec 19, 2023

@adamretter I think I was able to get past the checkout issue by installing git in the docker image and using it directly to clone and checkout. My workflows are running now. I'll report back when they complete.

@adamretter
Copy link
Collaborator

@robandpdx I pushed new Docker images under the same tag that have git pre-installed for you.

@pdillinger
Copy link
Contributor

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Dec 19, 2023
Summary: Largely based on facebook#12085 but grouped into one large workflow
because of bad GHA UI design (see comments).

Test Plan: TODO
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Dec 19, 2023
Summary: Largely based on facebook#12085 but grouped into one large workflow
because of bad GHA UI design (see comments).

Test Plan: TODO
@pdillinger
Copy link
Contributor

@adamretter
Copy link
Collaborator

@pdillinger So I just checked, previously as there was no git binary inside the Docker image, it seems that CircleCI install their own; as I see this message in the CI log:

Either git or ssh (required by git to clone through SSH) is not installed in the image. Falling back to CircleCI's native git client but the behavior may be different from official git. If this is an issue, please use an image that has official git and ssh installed.
Cloning git repository

The Docker images now have git version 1.7.1, and it seems that an incorrect argument is being passed to that version of git:

From github.com:facebook/rocksdb
 * [new branch]      refs/pull/12153/head -> origin/pull/12153
Checking out branch
error: unknown switch `B'
usage: git checkout [options] <branch>
   or: git checkout [options] [<branch>] -- <file>...

    -q, --quiet           be quiet
    -b <new branch>       branch
    -l                    log for new branch
    -t, --track           track
    -2, --ours            stage
    -3, --theirs          stage
    -f, --force           force
    -m, --merge           merge
    --conflict <style>    conflict style (merge or diff3)
    -p, --patch           select hunks interactively


exit status 129

It looks to me that -B is only present in newer versions of git. The git man page for version 2.42.0 claims that it is transactionally equivalent however to:

$ git branch -f <branch> [<start-point>]
$ git checkout <branch>

So I think we just need to send a PR to fix the current CircleCI config to be compatible with the newer Docker Images in a similar manner to what @robandpdx has done for GitHub Actions. Would you like me to prepare such a PR @pdillinger ?

@pdillinger
Copy link
Contributor

@adamretter

So I think we just need to send a PR to fix the current CircleCI config to be compatible with the newer Docker Images

The call to git is not in our CircleCI config. I'm pretty sure it's built into CircleCI's checkout step, so we'd have to roll our own to get it to work. If you can get it to work, go for it!

@robandpdx
Copy link
Contributor Author

The Docker images now have git version 1.7.1, and it seems that an incorrect argument is being passed to that version of git:

git 1.7.1 is super old and I'm seeing other issues when I try to fetch the remote ref. Any way to get a newer version of git on the docker image?

@adamretter
Copy link
Collaborator

Any way to get a newer version of git on the docker image?`

@robandpdx That is the approach I have also been taking to fix the CircleCI builds. I have been working on it yesterday, last night, and today, I think I am almost there. It involves compiling quite a few things from source code as part of a Multi-stage Docker build. I hope to have more news shortly...

@adamretter
Copy link
Collaborator

adamretter commented Dec 20, 2023

@robandpdx @pdillinger Okay I was able to publish a new Docker Image for 'CentOS 6 RocksDB Build Environment x64' that now includes the latest version of Git (2.43.0) and its dependencies: curl, and nghttp2; all built from source code.

I just re-ran the CircleCI job and it is passing again - https://app.circleci.com/pipelines/github/facebook/rocksdb/35889/workflows/05ee91a4-a4ff-46f8-9331-749319c99307/jobs/719117

So hopefully we now have something that is compatible with both CircleCI and GitHub Actions?

facebook-github-bot pushed a commit that referenced this pull request Dec 21, 2023
Summary:
* Largely based on #12085 but grouped into one large workflow because of bad GHA UI design (see comments).
* Windows job details consolidated into an action file so that those jobs can easily move between per-pr-push and nightly.
* Simplify some handling of "CIRCLECI" environment and add "GITHUB_ACTIONS" in the same places
* For jobs that we want to go in pr-jobs or nightly there are disabled "candidate" workflows with draft versions of those jobs.
* ARM jobs are disabled waiting on full GHA support.
* build-linux-java-static needed some special attention to work, due to GLIBC compatibility issues (see comments).

Pull Request resolved: #12163

Test Plan:
Nightly jobs can be seen passing between these two links:
https://github.com/facebook/rocksdb/actions/runs/7266835435/job/19799390061?pr=12163
https://github.com/facebook/rocksdb/actions/runs/7269697823/job/19807724471?pr=12163

And per-PR jobs of course passing on this PR.

Reviewed By: hx235

Differential Revision: D52335810

Pulled By: pdillinger

fbshipit-source-id: bbb95196f33eabad8cddf3c6b52f4413c80e034d
@pdillinger
Copy link
Contributor

Thanks @robandpdx and @adamretter . This is now obsolete with #12163

@pdillinger pdillinger closed this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants