CI tests failing with Julia nightly / Ubuntu #2187

fingolfin · 2023-04-02T14:21:53Z

Sometimes it is timeouts (with the Ubuntu run taking muuuch longer than the macOS run). Sometimes it is crashes, e.g. in this run or this one:

Test Summary:                          | Pass  Total  Time
primary decomposition of ideal sheaves |   14     14  7.7s
-> Testing AlgebraicGeometry/Schemes/IdealSheaves.jl took: runtime 23.529 seconds + compilation 13.797 seconds + recompilation 0.000 seconds, 740.78 MB

[1864] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.

For yet another run we have this crash log:

Test Summary:       | Pass  Total  Time
Oscar.blow_up_chart |    6      6  4.6s

[1866] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.

Perhaps there was a change in Julia master that requires a rebuild of libjulia_jll and all the JLLs made with it.

Perhaps something worse changed sigh.

It was still passing in

https://github.com/oscar-system/Oscar.jl/actions/runs/4575250672 from two days ago ("Fri, 31 Mar 2023 12:55:45 GMT")
https://github.com/oscar-system/Oscar.jl/actions/runs/4576318304/jobs/8080292374 ("Fri, 31 Mar 2023 14:51:33 GMT")
https://github.com/oscar-system/Oscar.jl/actions/runs/4576367418/jobs/8080421270 ("Fri, 31 Mar 2023 15:08:26 GMT")
https://github.com/oscar-system/Oscar.jl/actions/runs/4577837408/jobs/8083735354 ("Fri, 31 Mar 2023 17:59:55 GMT")

The first failures I see are in:

https://github.com/oscar-system/Oscar.jl/actions/runs/4578290383/jobs/8092350256 ("Sat, 01 Apr 2023 08:34:13 GMT")
https://github.com/oscar-system/Oscar.jl/actions/runs/4578574214/jobs/8085350953 ("Fri, 31 Mar 2023 19:38:31 GMT")

that could narrow down which Julia changes are responsible (if any), though of course it might also be that we were just lucky those times...

It seems for Julia's Nanosoldier the package tests are limited to 45 minutes. As a result, Oscar is currently failing Nanosoldier testing anyway. That's obviously bad for us, because it means the Julia won't notice if they break Oscar. I guess that's yet another strong reason to make our (default) test suite faster out of the box....

The text was updated successfully, but these errors were encountered:

fingolfin · 2023-04-02T14:32:07Z

Ah, the first two failing runs I linked above used Julia built from JuliaLang/julia@8b19f5fa276 resp. JuliaLang/julia@1eee6ef7c83. The last good one I could locate used JuliaLang/julia@cd394eb68d4 which directly precedes JuliaLang/julia@8b19f5fa276 ...

The means the failures started with JuliaLang/julia@8b19f5fa276 which in turn comes from JuliaLang/julia#49185

fingolfin · 2023-04-02T14:35:44Z

(Though this could still be a coincidence; even if Julia causes it, it could be a previous PR that introduced the problem and just doesn't always fail. Or I might have screwed up the analysis somehow)

benlorenz · 2023-04-02T14:40:33Z

These errors look very much like a memory problem, there were also some runs where some of schemes tests took extremely long which also might have been due to memory starvation.

The commit you mentioned (or at least parts of it) was backported to 1.9rc2, which was tagged yesterday but did not show up in the github actions yet. So if this is to blame we should see this happen for 1.9 as well soon.

Edit: I found the extremely long one again:

Test Summary:                                      | Pass  Total      Time
resolution of singularities of elliptic fibrations |    3      3  50m09.2s
-> Testing AlgebraicGeometry/Schemes/K3.jl took: runtime 2935.774 seconds + compilation 73.669 seconds + recompilation 0.000 seconds, 39549.45 MB
Test Summary: | Pass  Total     Time
Toric schemes |   13     13  5m51.4s
-> Testing AlgebraicGeometry/Schemes/ToricSchemes.jl took: runtime 130.279 seconds + compilation 224.928 seconds + recompilation 0.000 seconds, 1141.45 MB

In a successful one also the toric schemes file, that runs after the K3 tests, took a lot shorter:

Test Summary:                                      | Pass  Total     Time
resolution of singularities of elliptic fibrations |    3      3  7m03.1s
-> Testing AlgebraicGeometry/Schemes/K3.jl took: runtime 387.557 seconds + compilation 35.557 seconds + recompilation 0.000 seconds, 39559.17 MB
Test Summary: | Pass  Total     Time
Toric schemes |   13     13  1m50.0s
-> Testing AlgebraicGeometry/Schemes/ToricSchemes.jl took: runtime 87.058 seconds + compilation 22.924 seconds + recompilation 0.000 seconds, 964.97 MB

benlorenz · 2023-04-02T21:51:11Z

I did some tests in a memory-restricted cgroup (to about 9.5GB) and got this crash:

-> Testing AlgebraicGeometry/Schemes/WeilDivisor.jl took: runtime 6.341 seconds + compilation 9.645 seconds + recompilation 0.000 seconds, 1839.21 MB
Test Summary: | Pass  Total  Time
blowups       |    2      2  2.8s
Test Summary:       | Pass  Total  Time
Oscar.blow_up_chart |    6      6  0.6s
GC: pause 4853.59ms. collected 2331.602373MB. full
Killed

This does match one of the crashes of the CI failures, except that the error is different, mine was killed like by the cgroup manager:

[23582.938724] Memory cgroup out of memory: Killed process 15123 (julia) total-vm:14826076kB, anon-rss:9201796kB, file-rss:372256kB, shmem-rss:0kB, UID:2779 pgtables:21704kB oom_score_adj:0

In the CI we get Terminated messages instead but the cause might be similar. Note that the ubuntu-runners only have 7GB physical memory.
(macOS has 14GB which might explain why we don't see this there.)

benlorenz · 2023-04-04T14:02:38Z

I did some more testing:

The good news is that even though these commits were backported to 1.9, version 1.9.0-rc2 seems unaffected (both in my local tests and in the CI).
The bad news is that this means there are more commits involved for this to trigger and I don't really know which.

With julia 1.9-rc2 I can run the full Oscar testsuite in a cgroup that is limited to 6GB, stats from close to the end of the tests:

GC live:    1440.386 MiB 
JIT:          74.373 MiB 
Max. RSS:   5675.137 MiB 
Free mem:     96.613 MiB 
...
     Testing Oscar tests passed

If I try the same with julia master it fails like this:

GC live:    3460.915 MiB
JIT:          19.764 MiB
Max. RSS:   5959.289 MiB
Free mem:      0.090 MiB
...
ERROR: Package Oscar errored during testing (received signal: KILL)

And even if I increase the cgroup limit to 10GB it still fails:

GC live:    4345.790 MiB
JIT:          56.893 MiB
Max. RSS:   9572.812 MiB
Free mem:     99.105 MiB
Test Summary:                | Pass  Total  Time
fraction fields of varieties |    6      6  0.3s
ERROR: Package Oscar errored during testing (received signal: KILL)

Still on julia master but once I revert the second commit of JuliaLang/julia#49185 the tests pass again:

GC live:    1400.914 MiB
JIT:          74.460 MiB
Max. RSS:   5743.348 MiB
Free mem:     80.625 MiB
...
     Testing Oscar tests passed

gbaraldi · 2023-04-06T15:20:02Z

This is quite odd because rc2 has both the commits, so maybe there's two things going on JuliaLang/julia#48935. It's the last two commits here. It does look like we might be swapping or something like that.

gbaraldi · 2023-04-06T15:22:38Z

@benlorenz can you try both cases with GC.enable_logging() and possibly also 1.9-rc1? Maybe even 1.8?

benlorenz · 2023-04-08T15:13:40Z

@gbaraldi Sorry for the delay, I have now created logs including the GC: pause ... output for master, master with revert of 450c1faf, 1.9.0-rc2, and 1.9.0-rc1:

https://gist.github.com/benlorenz/379cb72e6760d2f0d6f71e1e10fa082a
(I recommend downloading them as zip and reading them locally, the Github webpage seems to struggle with the large log files)

I hope the filenames are self-explanatory, all runs that were killed by the oom-killer should have Killed in the filename and include the corresponding line from the kernel log at the end.

cgroups

The 6gb logs were run with:

echo 6000000000 >> /sys/fs/cgroup/memory/julialimit/memory.limit_in_bytes
echo 6500000000 >> /sys/fs/cgroup/memory/julialimit/memory.memsw.limit_in_bytes

The 10gb logs with 10000000000 (10500000000) as limit.
I added that memsw limit to avoid too much swapping. More swap is probably the reason why the Github CI jobs don't die as fast but struggle along and are killed after a timeout of 2.5 hours. The usual runtime for the full testsuite on github runners is around 1-1.5 hours.

other stats output in the logs

For each file in the Oscar testsuite the output contains a short memory summary from before the include and a line similar to @time after the file was done.

retries for 1.9rc1 and 1.9rc2

For both 1.9 versions the tests oom-crashed during the first try with 6gb but succeeded on the second try.

failed runs on master

For master I tried 6gb and 10gb each 6 times and all of them failed, the 6gb ones quite early (1000 lines of log output, instead of about 3000 for a successful run).

master+revert

For the current tests I reverted only the Fix interval for many pointers commit as this sufficed to avoid the issues.
I could also repeat the tests with both commits from that PR reverted or only the other one (Add bit to the GC tag) reverted.

v1.8

I also tried running it in 1.8 but that didn't work with my cgroup setup as it seems to use the free physical memory instead of the constrained memory:

Free mem:  116958.266 MiB
Free pmem: 116958.266 MiB

Then the job is killed after only a few minutes.
Maybe I can get it to run if I manage to switch to a proper cgroupv2 configuration instead of the unified that my system seems to be using currently. I will try to add this in the next days.

github actions

I also ran the oscar testsuite with GC.enable_logging and the memory stats on github actions here: https://github.com/oscar-system/Oscar.jl/actions/runs/4591003902/jobs/8107508082
This failed not directly due to running out of memory but probably because it was swapping hard and then some timing-test failed (horribly) with 140 seconds instead of less than 10.
I have attached the file here to make sure the log doesn't disappear:
github-actions-oscar-julia-nightly-swapping.txt
This contains some very big pauses:

GC: pause 128272.24ms. collected 2852.374413MB. full

Edit:

Until I can run the 1.8 tests locally, here is a run with GC logging on github actions for 1.8.5, 1.9-rc2, and nightly. (Nightly is still running with very long GC pauses...)

gbaraldi · 2023-04-12T18:32:23Z

It seems my heuristic might be a bit too greedy which can be adjusted, but I'm not sure why it's so much worse on master, I would look into why master is allocating so much more memory because it seems to be a separate regression.

fingolfin · 2023-04-20T19:09:24Z

Well, we bisected it to that commit, but of course it is always possible that the issue we are experiencing is caused by multiple commits combined; then the commit we flagged is just the last puzzle piece, but maybe not the core cause?

fingolfin · 2023-04-20T19:30:01Z

Of course there is also always the possibility that there were API/ABI changes in the Julia kernel and we need to recompile all our JLLs for latest Julia (and for that we need a new libjulia_jll).

However, the fact that Julia nightly on macOS is not regressed is peculiar and kinda contradicts that theory. Hmm

gbaraldi · 2023-04-21T14:50:33Z

The change did make the GC a bit more greedy because in a specific workload it was being too agressive, maybe here's the opposite and it's being a bit too lazy, I will take a further look.

GC: pause 128272.24ms. collected 2852.374413MB. full we probably should've run a GC earlier here.

benlorenz · 2023-04-21T17:07:19Z

The latest changes to the GC on master seem to have fixed (or at least significantly improved) the issue, maybe JuliaLang/julia#49315 ?

From last night, this one took 1h 44min

Julia Version 1.10.0-DEV.1081
Commit 499647e95d7 (2023-04-20 15:53 UTC)

while a few commits later with

Julia Version 1.10.0-DEV.1084
Commit f84fb5b26f8 (2023-04-20 16:52 UTC)

the tests ran in 59 minutes. And all following runs were similar, no crashes anymore so far. That approximately one hour duration matches the other julia versions (only 1.6 is significantly slower since it includes the doctests).

I have run it in my memory-constrained cgroup again and it also did succeed 3 times in a row with the low limit of 6GB.

benlorenz · 2023-04-28T11:33:58Z

The latest changes to the GC on master seem to have fixed (or at least significantly improved) the issue, maybe JuliaLang/julia#49315 ?

I don't understand the details how this works but this PR did fix the issue. I haven't seen any timeouts in the past week.

lkastner mentioned this issue Apr 2, 2023

Reduce runtimes for tests #1929

Open

topolarity mentioned this issue Apr 6, 2023

Add bit to the GC tag to turn GC in image search O(1) and also increase gc interval when encountering many pointers JuliaLang/julia#49185

Merged

lkastner added the testsuite Everything concerning testsuite label Apr 7, 2023

benlorenz closed this as completed Apr 28, 2023

benlorenz mentioned this issue May 5, 2023

Crashes or hangs running doctests in CI (possibly related to GAP.jl and/or Singular.jl) #2336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI tests failing with Julia nightly / Ubuntu #2187

CI tests failing with Julia nightly / Ubuntu #2187

fingolfin commented Apr 2, 2023

fingolfin commented Apr 2, 2023

fingolfin commented Apr 2, 2023

benlorenz commented Apr 2, 2023 •

edited

Loading

benlorenz commented Apr 2, 2023

benlorenz commented Apr 4, 2023

gbaraldi commented Apr 6, 2023 •

edited

Loading

gbaraldi commented Apr 6, 2023 •

edited

Loading

benlorenz commented Apr 8, 2023 •

edited

Loading

gbaraldi commented Apr 12, 2023

fingolfin commented Apr 20, 2023

fingolfin commented Apr 20, 2023

gbaraldi commented Apr 21, 2023 •

edited

Loading

benlorenz commented Apr 21, 2023

benlorenz commented Apr 28, 2023

CI tests failing with Julia nightly / Ubuntu #2187

CI tests failing with Julia nightly / Ubuntu #2187

Comments

fingolfin commented Apr 2, 2023

fingolfin commented Apr 2, 2023

fingolfin commented Apr 2, 2023

benlorenz commented Apr 2, 2023 • edited Loading

benlorenz commented Apr 2, 2023

benlorenz commented Apr 4, 2023

gbaraldi commented Apr 6, 2023 • edited Loading

gbaraldi commented Apr 6, 2023 • edited Loading

benlorenz commented Apr 8, 2023 • edited Loading

cgroups

other stats output in the logs

retries for 1.9rc1 and 1.9rc2

failed runs on master

master+revert

v1.8

github actions

Edit:

gbaraldi commented Apr 12, 2023

fingolfin commented Apr 20, 2023

fingolfin commented Apr 20, 2023

gbaraldi commented Apr 21, 2023 • edited Loading

benlorenz commented Apr 21, 2023

benlorenz commented Apr 28, 2023

benlorenz commented Apr 2, 2023 •

edited

Loading

gbaraldi commented Apr 6, 2023 •

edited

Loading

gbaraldi commented Apr 6, 2023 •

edited

Loading

benlorenz commented Apr 8, 2023 •

edited

Loading

gbaraldi commented Apr 21, 2023 •

edited

Loading