Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running distributed processes #3

Open
TragerJoswig-Jones opened this issue Oct 18, 2021 · 3 comments
Open

Error when running distributed processes #3

TragerJoswig-Jones opened this issue Oct 18, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@TragerJoswig-Jones
Copy link
Collaborator

TragerJoswig-Jones commented Oct 18, 2021

When running distributed functions on certain computers a EXCEPTION_ACCESS_VIOLATION or ReadOnlyMemoryError error occurs on one or more worker processes. This seems to occur when running the line

Distributed.@everywhere using OPFLearn

with multiple workers that have not already had OPFLearn initialized.

Example Error:

julia> include("dist_test.jl")
      From worker 8:
      From worker 8:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 8:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 8:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 8:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 8:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 8:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
      From worker 8:    in expression starting at none:1
      From worker 8:    unknown function (ip: 0000000000000000)
      From worker 8:    in expression starting at none:1
      From worker 8:    unknown function (ip: 0000000000000000)
      From worker 8:    in expression starting at none:1
      From worker 8:
      From worker 8:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 8:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
      From worker 8:    in expression starting at none:1
      From worker 8:
      From worker 8:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 8:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
      From worker 8:    in expression starting at none:1
      From worker 8:
      From worker 8:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 8:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
      From worker 8:    in expression starting at none:1
      From worker 7:
      From worker 7:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 7:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 7:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 7:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 7:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 7:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 7:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 7:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 7:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 7:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
      From worker 7:    in expression starting0000none:1
      From worker 7:    �0000)
      From worker 7:    �n expression starting at none:1
      From worker 7:    unknown function (ip: 00000000000unknown function (ip: 0000000000000000)
      From worker 7:    in expression starting at none:1
      From worker 7:    �0000)
      From worker 7:    in expression starting at none:1
      From worker 7:    in expression starting at none:1
      From worker 6:
      From worker 6:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 6:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 6:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 6:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 6:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 6:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 --
      From worker 6:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 6:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
      From worker 6:    in expression starting at none:1
      From worker 6:    unknown function (ip: 0000000000000000)
      From worker 6:    in expression starting at none:1
      From worker 6:    �0000)
      From worker 6:    in expression starting at none:1
      From worker 6:    in expression starting at none:1
Worker 8 terminated.
Worker 7 terminated.
Worker 6 terminated.
ERROR: LoadError: Distributed.ProcessExitedException(6)

...and 2 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:369
 [2] macro expansion
   @ .\task.jl:388 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\macros.jl:223
 [4] top-level scope
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\macros.jl:207
 [5] include(fname::String)
   @ Base.MainInclude .\client.jl:444
 [6] top-level scope
   @ REPL[2]:1
in expression starting at C:\Users\joswi\Documents\SULI_Sum21\test_env\dist_test.jl:11
@TragerJoswig-Jones TragerJoswig-Jones added the bug Something isn't working label Oct 18, 2021
@TragerJoswig-Jones
Copy link
Collaborator Author

TragerJoswig-Jones commented Oct 18, 2021

@TragerJoswig-Jones
Copy link
Collaborator Author

TragerJoswig-Jones commented Oct 18, 2021

This seems to be due to IPOPT or other packages used not being thread-safe. Currently, the test suite uses Sys.CPU_THREADS - 1 to set the number of worker processes to create and uses during distributed processing.

Testing with PC with a AMD FX(tm)-8350 Eight-Core Processor, 4000 Mhz, 4 Core(s), 8 Logical Processor(s), I can reliably run distributed sample creation functions with nproc specified as 5 (This results in 4 workers being used). Increasing this number by one resulted in a ReadOnlyMemoryError().

On a laptop with an Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz, 2112 Mhz, 4 Core(s), 8 Logical Processor(s), nproc=8 consistently works though even though the processor has the same number of cores.

Note that running with a fewer number of cores (equal or less than the number of physical cores) does not appear to prevent the same error occuring when running dist_create_samples in the test suite as opposed to running the commands manually..

@TragerJoswig-Jones
Copy link
Collaborator Author

TragerJoswig-Jones commented Nov 12, 2021

Possibly relevant when trying to automatically set up distributed processes: https://github.com/lanl-ansi/PowerModelsSecurityConstrained.jl/blob/master/src/scripts/distributed.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant