Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pkg does not complete precompilation, randomly hangs on different packages #44338

Closed
truedichotomy opened this issue Feb 25, 2022 · 8 comments · Fixed by #44345
Closed

Pkg does not complete precompilation, randomly hangs on different packages #44338

truedichotomy opened this issue Feb 25, 2022 · 8 comments · Fixed by #44345
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version
Milestone

Comments

@truedichotomy
Copy link

Since #43990 was merged with Julia master, I noticed that Pkg precompilation often hangs on some packages that I have installed, but there is no apparent pattern. It's also possible that the precompilation is just taking a really long time and appears to be hang. The first package I encountered this issue with was Parsers.jl. When this happens, ^C then repeat the precompilation usually completes the precompilation for the problematic packages. The circle animation spins indefinitely during the hanging episodes.

Here is one set:

Precompiling project...
  Progress [=======================================> ]  476/496
  ◑ StrideArraysCore
  ◐ MLJModels
^C Interrupted: Exiting precompilation...

(@ocean) pkg> update
    Updating registry at `~/.julia/registries/General.toml`
    Updating git-repo `https://github.com/Alexander-Barth/ROMS.jl`
  No Changes to `~/.julia/environments/ocean/Project.toml`
  No Changes to `~/.julia/environments/ocean/Manifest.toml`
Precompiling project...
  Progress [======================================>  ]  19/20
  ◐ MLJ
^C Interrupted: Exiting precompilation...
  18 dependencies successfully precompiled in 1098 seconds. 476 already precompiled. 1 skipped during auto due to previous errors.
  1 dependency errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the package

julia> Pkg.precompile()
Precompiling project...
  2 dependencies successfully precompiled in 12 seconds. 494 already precompiled.

And another set:

Precompiling project...
  Progress [========================================>]  525/526
  ◓ IJulia
  ✗ JuliaFormatter
^C Interrupted: Exiting precompilation...

(@ocean) pkg> update
    Updating registry at `~/.julia/registries/General.toml`
  No Changes to `~/.julia/environments/ocean/Project.toml`
  No Changes to `~/.julia/environments/ocean/Manifest.toml`
Precompiling project...
  1 dependency successfully precompiled in 2 seconds. 524 already precompiled. 1 skipped during auto due to previous errors.

and another set:

(@ocean) pkg> update
    Updating registry at `~/.julia/registries/General.toml`
  No Changes to `~/.julia/environments/ocean/Project.toml`
  No Changes to `~/.julia/environments/ocean/Manifest.toml`
Precompiling project...
  Progress [========================================>]  515/526
  ◓ GeometryBasics
  ◑ PlotThemes
^C Interrupted: Exiting precompilation...

(@ocean) pkg> update
    Updating registry at `~/.julia/registries/General.toml`
  No Changes to `~/.julia/environments/ocean/Project.toml`
  No Changes to `~/.julia/environments/ocean/Manifest.toml`
Precompiling project...
  11 dependencies successfully precompiled in 28 seconds. 514 already precompiled. 1 skipped during auto due to previous errors.

versioninfo():

julia> versioninfo()
Julia Version 1.9.0-DEV.84
Commit 6b29ebda35 (2022-02-24 22:52 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
@oscardssmith oscardssmith added bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version labels Feb 25, 2022
@truedichotomy
Copy link
Author

truedichotomy commented Feb 25, 2022

I'm not sure yet if this is directly related yet, but it seems that each time when this happens, there are julia instances spun up and consumes nearly 100% on the CPU. I have to manually kill them, just quitting Julia doesn't stop them.

top:

Processes: 705 total, 5 running, 700 sleeping, 3562 threads                                    23:28:45
Load Avg: 3.04, 4.55, 6.27  CPU usage: 2.66% user, 22.81% sys, 74.52% idle
SharedLibs: 833M resident, 119M data, 62M linkedit.
MemRegions: 222332 total, 6967M resident, 369M private, 3240M shared.
PhysMem: 28G used (2535M wired), 3246M unused.
VM: 261T vsize, 3780M framework vsize, 67886(0) swapins, 74240(0) swapouts.
Networks: packets: 54475963/39G in, 43433866/36G out.
Disks: 99733137/1454G read, 49669421/1195G written.

PID    COMMAND      %CPU  TIME     #TH    #WQ  #PORT MEM    PURG   CMPRS PGRP  PPID  STATE
87535  julia        99.9  29:16.49 12/1   0    23    247M   0B     80M   87011 1     running
94782  julia        99.7  11:39.52 12/1   0    23    255M   0B     0B    94355 94355 running
0      kernel_task  11.0  13:38:08 566/11 0    0     2160K  0B     0B    0     0     running

@KristofferC KristofferC added this to the 1.8 milestone Feb 25, 2022
@fredrikekre
Copy link
Member

Do you get errors also when triggering precompilation with using instead of Pkg.precompile?

@timholy
Copy link
Member

timholy commented Feb 25, 2022

I've set up a project with these contents:

(pkgs) pkg> st
Status `/tmp/tim/pkgs/Project.toml`
  [5c1252a2] GeometryBasics v0.4.1
  [7073ff75] IJulia v1.23.2
  [98e50ef6] JuliaFormatter v0.22.4
  [add582a8] MLJ v0.17.1
  [d491faf4] MLJModels v0.15.4
  [ccf2f8ad] PlotThemes v2.0.1
  [7792a7ef] StrideArraysCore v0.2.12

These are all the packages you show in your demos above. I've not yet been able to reproduce it. (I'm running under rr, so as soon as I reproduce it we should be able to diagnose the problem.) Can you share more details about your specific environment? (E.g., the Project.toml file.)

@timholy
Copy link
Member

timholy commented Feb 25, 2022

I seem to have captured a trace. What I did was keep deleting all the files in ~/.julia/compiled/v1.9 and rerun compilation. It got stuck on Crayons and then I searched for a running julia process with output-incremental=yes and checked to see which PID was precompiling Crayons. Then I attached gdb to that pid and got this backtrace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95	../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007fd47d754492 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3417
#2  0x00007fd47b64b597 in jl_getFunctionInfo_impl (frames_out=0x7fd471c888a8, pointer=140550614607693, skipC=0, noInline=0) at /home/tim/src/julia-master/src/debuginfo.cpp:1175
#3  0x00007fd47c907f80 in jl_print_native_codeloc (ip=140550614607693) at /home/tim/src/julia-master/src/stackwalk.c:636
#4  0x00007fd47c90806f in jl_print_bt_entry_codeloc (bt_entry=0x7fd47e038010) at /home/tim/src/julia-master/src/stackwalk.c:657
#5  0x00007fd47c923360 in jl_critical_error (sig=11, context=0x7fd471c88ac0, ct=0x7fd47218c010) at /home/tim/src/julia-master/src/signal-handling.c:324
#6  0x00007fd47c921684 in sigdie_handler (sig=11, info=0x7fd471c88bf0, context=0x7fd471c88ac0) at /home/tim/src/julia-master/src/signals-unix.c:234
#7  0x00007fd47c921935 in segv_handler (sig=11, info=0x7fd471c88bf0, context=0x7fd471c88ac0) at /home/tim/src/julia-master/src/signals-unix.c:358
#8  <signal handler called>
#9  _int_malloc (av=av@entry=0x7fd47daa5c40 <main_arena>, bytes=bytes@entry=4096) at malloc.c:3789
#10 0x00007fd47d7513cd in __GI___libc_malloc (bytes=4096) at malloc.c:3075
#11 0x00007fd47c974ded in ptrhash_lookup_bp_r (h=0x7fd47cd93b80 <external_mis>, key=0x7fd4740a8510, ctx=0x0) at /home/tim/src/julia-master/src/support/ptrhash.c:26
#12 0x00007fd47c974f5e in ptrhash_put_r (h=0x7fd47cd93b80 <external_mis>, key=0x7fd4740a8510, val=0x7fd4740a8510, ctx=0x0) at /home/tim/src/julia-master/src/support/ptrhash.c:26
#13 0x00007fd47c975213 in ptrhash_put (h=0x7fd47cd93b80 <external_mis>, key=0x7fd4740a8510, val=0x7fd4740a8510) at /home/tim/src/julia-master/src/support/ptrhash.c:26
#14 0x00007fd47c8d1ae7 in queue_external_mis (list=0x7fd46c52e910 <jl_system_image_data+74229520>) at /home/tim/src/julia-master/src/dump.c:307
#15 0x00007fd47c8dc12b in ijl_save_incremental (fname=0x564191c155a0 "/home/tim/.julia/compiled/v1.9/Crayons/jl_JgafZ8", worklist=0x7fd47445d810) at /home/tim/src/julia-master/src/dump.c:2659
#16 0x00007fd47c9028f8 in jl_write_compiler_output () at /home/tim/src/julia-master/src/precompile.c:65
#17 0x00007fd47c8c6ac6 in ijl_atexit_hook (exitcode=0) at /home/tim/src/julia-master/src/init.c:207
#18 0x00007fd47c920c48 in jl_repl_entrypoint (argc=1, argv=0x7ffe8b0260d8) at /home/tim/src/julia-master/src/jlapi.c:707
#19 0x00007fd47d4abbe0 in jl_load_repl (argc=14, argv=0x7ffe8b0260d8) at /home/tim/src/julia-master/cli/loader_lib.c:271
#20 0x000056419176099e in main (argc=14, argv=0x7ffe8b0260d8) at /home/tim/src/julia-master/cli/loader_exe.c:59
(gdb) f 14
#14 0x00007fd47c8d1ae7 in queue_external_mis (list=0x7fd46c52e910 <jl_system_image_data+74229520>) at /home/tim/src/julia-master/src/dump.c:307
307	                            ptrhash_put(&external_mis, mi, mi);

It looks like there's some kind of crash in malloc and that ends up calling __lll_lock_wait_private which ends up hanging (for me the CPU is not running at high load, consistent with it waiting for lock aquisition).

I don't understand the fundamental issue, but there seems to be a pretty easy workaround: use the C equivalent of sizehint! to ensure there's enough space in that hash table at the beginning. I have implemented this and am about to test it.

@truedichotomy
Copy link
Author

I realize a fix already being tested, but just to address the earlier question, I have the following in my Project.toml:

[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
Clustering = "aaaa29a8-35af-508c-8bc3-b662a17a0fe5"
ColorSchemes = "35d6a980-a343-548e-a6ea-1d62b119f2f4"
Colors = "5ae59095-9a9b-59fe-a467-6f913c188581"
Conda = "8f4d0f93-b110-5947-807f-2305c1781a2d"
Dash = "1b08a953-4be3-4667-9a23-3db579824955"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DifferentialEquations = "0c46a032-eb83-5123-abaf-570d42b7fbaa"
DimensionalData = "0703355e-b756-11e9-17c0-8b28908087d0"
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
FITSIO = "525bcba6-941b-5504-bd06-fd0dc1a4d2eb"
Feather = "becb17da-46f6-5d3c-ad1b-1c5fe96bc73c"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
Flux3D = "432009dd-59a1-4b72-8c93-6462ce9b220f"
Formatting = "59287772-0a20-5a39-b81b-1366585eb4c0"
GDAL = "add2ef01-049f-52c4-9ee2-e494f65e021a"
GMT = "5752ebe1-31b9-557e-87aa-f909b540aa54"
GRIB = "b16dfd50-4035-11e9-28d4-9dfe17e6779b"
Genie = "c43c736e-a2d1-11e8-161f-af95117fbd1e"
GeoDatasets = "ddc7317b-88db-5cb5-a849-8449e5df04f9"
GeoInterface = "cf35fbd7-0cd7-5166-be24-54bfbe79505f"
GeoJSON = "61d90e0f-e114-555e-ac52-39dfb47a3ef9"
Geodesy = "0ef565a4-170c-5f04-8de2-149903a85f3d"
GitHub = "bc5e4493-9b4d-5f90-b8aa-2b2bcaad7a26"
Glob = "c27321d9-0574-5035-807b-f59d2c89b15c"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
Interpolations = "a98d9a8b-a2ab-59e6-89dd-64a1c18fca59"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
LLVM = "929cbde3-209d-540e-8aea-75f648917ca0"
LibSerialPort = "a05a14c7-6e3b-5ba9-90a2-45558833e1df"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
Measurements = "eff96d63-e80a-5855-80a2-b1b0885c5ab7"
Measures = "442fdcdd-2543-5da2-b0f3-8c86c306513e"
Metalhead = "dbeba491-748d-5e0e-a39e-b530a07fa0cc"
Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
NCDatasets = "85f8d34a-cbdd-5861-8df4-14fed0d494ab"
NaNMath = "77ba4419-2d1f-58cd-9bb1-8ffee604a2e3"
NetCDF = "30363a11-5582-574a-97bb-aa9a979735b9"
Oceananigans = "9e8cae18-63c1-5223-a75c-80ca9d6e9a09"
OnlineStats = "a15396b6-48d5-5d58-9928-6d29437db91e"
Optim = "429524aa-4258-5aef-a3af-852621145aeb"
Pandas = "eadc2687-ae89-51f9-a5d9-86b5a6373a9c"
Parsers = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
PhysOcean = "3725be50-bbbd-5592-92c3-2f0e82159c3e"
PlotlyBase = "a03496cd-edff-5a9b-9e67-9cda94a718b5"
PlotlyJS = "f0f68f2c-4968-5e81-91da-67840de0976a"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
Pluto = "c3e4b0f8-55cb-11ea-2926-15256bba5781"
Preferences = "21216c6a-2e73-6563-6e65-726566657250"
PyCall = "438e738f-606a-5dbb-bf0a-cddfbfd45ab0"
Quaternions = "94ee1d12-ae83-5a48-8b1c-48b8ff168ae0"
ROMS = "da6b2582-fac1-11ea-b7c4-a321ee60f94e"
RemoteS = "26020ff4-e92a-488e-a723-09c28e89e04a"
ScikitLearn = "3646fa90-6ef7-5e7e-9f22-8aca16db6324"
ShallowWaters = "56019723-2d87-4a65-81ff-59d5d8913e3c"
Shapefile = "8e980c4a-a4fe-5da2-b3a7-4b4b0353a2f4"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TimeZones = "f269a46b-ccf7-5d73-abea-4c690281aa53"
WebIO = "0f1e0344-ec1d-5b48-a673-e5cf874b6c29"
XGBoost = "009559a3-9522-5dbb-924b-0b6ed2b22bb9"
Zarr = "0a941bbe-ad1d-11e8-39d9-ab76183a1d99"
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"

[preferences.NetCDF_jll]
libnetcdf_path = "/Users/gong/miniforge3/lib/libnetcdf.19.dylib"

I think the number of packages is large enough such that when I precompile them, the bug is almost always triggered at some point, and it's usually toward the end of precompilation.

@timholy
Copy link
Member

timholy commented Feb 25, 2022

This is very helpful. While I can't quite compile all those packages (NetCDF and GMT seem to require some extra configuration and I'm not bothering with that), with #44345 I don't get the hang in 5 runs. So, I think it's fixed once we merge that PR.

@truedichotomy
Copy link
Author

I can confirm that the CPU racing condition is directly associated with the failed/hung precompilations. When I kill those racing instances, the precompilation mark the packages that it's hung on as failed (cross mark) and move onto the next package.

@IanButterworth
Copy link
Member

IanButterworth commented Feb 25, 2022

In case it's helpful to report, I just hit this after wiping the compile cache dir

% julia
[ Info: Precompiling Revise [295af30f-e4ad-537b-8983-00126c2a3abe]
julia(84886,0x114fad600) malloc: Incorrect checksum for freed object 0x7fa3ba9132d8: probably modified after being freed.
Corrupt value: 0x3
julia(84886,0x114fad600) malloc: *** set a breakpoint in malloc_error_break to debug

signal (6): Abort trap: 6
in expression starting at none:0

signal (4): Illegal instruction: 4
in expression starting at none:0
┌ Warning: Failed to precompile Revise [295af30f-e4ad-537b-8983-00126c2a3abe] to /Users/ian/.julia/compiled/v1.9/Revise/jl_Gxk3FC.
└ @ Main ~/.julia/config/startup.jl:8
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.0-DEV.93 (2022-02-25)
 _/ |\__'_|_|_|\__'_|  |  Commit a2e4226d2e (0 days old master)
|__/                   |

julia> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants