Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DictChannel not defined error #16091

Closed
tkelman opened this issue Apr 28, 2016 · 31 comments
Closed

DictChannel not defined error #16091

tkelman opened this issue Apr 28, 2016 · 31 comments
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@tkelman
Copy link
Contributor

tkelman commented Apr 28, 2016

@tkelman tkelman added the test This change adds or pertains to unit tests label Apr 28, 2016
@amitmurthy
Copy link
Contributor

The change seems harmless.....I would suspect a race condition. Any recent changes to the code loading logic for Windows?

@amitmurthy
Copy link
Contributor

Is there a way to search all AV logs under Julia for "LoadError: UndefVarError: DictChannel not defined" to find the earliest logged instance on this error?

@tkelman
Copy link
Contributor Author

tkelman commented Apr 28, 2016

I believe so, there is an API reference at http://www.appveyor.com/docs/api/projects-builds though our history before we switched from the StefanKarpinski account to the JuliaLang account was unfortunately deleted.

@tkelman
Copy link
Contributor Author

tkelman commented May 1, 2016

might have something to do with heavy load, it happened on an osx buildbot: https://build.julialang.org/builders/build_osx10.9-x64/builds/705/steps/shell_2/logs/stdio

@amitmurthy
Copy link
Contributor

Unless code loading has an asynchronous part that completes after include returns, I don't see an obvious reason for this error to occur. I propose merging #16095 and seeing if it helps identify the cause.

@vtjnash
Copy link
Member

vtjnash commented May 2, 2016

I tried to queue a few more builds, but it looks like Appveyor's configuration is broken right now

@tkelman
Copy link
Contributor Author

tkelman commented May 2, 2016

Doesn't include do something funny now in case the file gets modified while running? Nothing should be getting modified here but maybe the workaround for that is causing issues.

tkelman added a commit that referenced this issue May 3, 2016
@vtjnash
Copy link
Member

vtjnash commented May 3, 2016

@vtjnash vtjnash changed the title DictChannel not defined error on AppVeyor DictChannel not defined error ~on AppVeyor~ May 3, 2016
@vtjnash vtjnash changed the title DictChannel not defined error ~on AppVeyor~ DictChannel not defined error May 3, 2016
@tkelman
Copy link
Contributor Author

tkelman commented May 3, 2016

should've merged #16095 a little earlier, that merge commit was on 0821d57 so just missed it

@vtjnash
Copy link
Member

vtjnash commented May 4, 2016

I suspect this may be caused by the same issue as #15923

vtjnash added a commit that referenced this issue May 4, 2016
this made it unreliable for the WeakKeyDict these are typically put in (client_refs)
to have trouble finding them to cleanup the dictionary later
since their hash and identity changed

fixes #15923
reverts workaround #14456 (doesn't break #14295 due to previous commit)
may also fix #16091
@vtjnash vtjnash closed this as completed in 7348cfe May 9, 2016
amitmurthy added a commit that referenced this issue May 9, 2016
amitmurthy added a commit that referenced this issue May 9, 2016
* Revert "one more check"

This reverts commit 48e969c.

* Revert "detect cause of #16091"

This reverts commit 6359f1d.

* move remote ref finalizer test into regular CI
@vtjnash vtjnash reopened this May 11, 2016
@tkelman
Copy link
Contributor Author

tkelman commented May 11, 2016

did you see this again? link?

@vtjnash
Copy link
Member

vtjnash commented May 12, 2016

yeah, locally

@tkelman tkelman added the heisenbug This bug occurs unpredictably label May 12, 2016
@tkelman
Copy link
Contributor Author

tkelman commented Jun 27, 2016

@ranjanan saw this today in https://ci.appveyor.com/project/JuliaLang/julia/build/1.0.3375/job/tjrdj3erbwok88bm, looks like the new form this can sometimes take is MethodError: no method matching Base.Serializer.__deserialized_types__.##269()

@vtjnash
Copy link
Member

vtjnash commented Jun 27, 2016

Ah, this error makes far more sense. The old bug is probably that this conditional is sometimes wrong:

julia/base/serialize.jl

Lines 747 to 752 in a543e0c

elseif isdefined(mod, name)
tn = getfield(mod, name).name
# TODO: confirm somehow that the types match
name = tn.name
mod = tn.module
makenew = false

I spent a while trying to reproduce this, before I realized that Amit included a patch for this in

elseif mod !== __deserialized_types__ && isdefined(mod, name)
which makes it far harder to trigger manually. This change was also my first thought to correct this, until I saw it was already being done that way and still failing.

Upon reflecting further, I realized this may only be half of the problem.

There seem to be a few candidates for the other half. For one, deserialize_cycle appears to be implemented incorrectly for TypeName (called too late), which could corrupt the backref list. For another, the assignment to known_object_data appears to also happen much too late, which might result in the creation of multiple TypeNames for a Type.

(A third issue, not encountered yet here due to the aforementioned bugs, but encountered previously by dump.c, the general deserializer for DataType.types could encounter instances of DataType before the call to jl_new_datatype, resulting in either the error error("Expected object in cache. Not found.") or equivalently, an Undef access of typename.primary)

@vtjnash vtjnash removed heisenbug This bug occurs unpredictably test This change adds or pertains to unit tests labels Jul 27, 2016
@vtjnash
Copy link
Member

vtjnash commented Jul 27, 2016

https://ci.appveyor.com/project/JuliaLang/julia/build/1.0.4678/job/o2l19ya94nreab0j

min repro (in a fresh julia session):

julia> f = ()->1
(::#1) (generic function with 1 method)

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> fetch(@spawnat 2 @eval f = ()->2)
(::#1) (generic function with 1 method)

julia> @spawnat 2 f()
Future(2,1,5,Nullable{Any}())

julia> fetch(ans) # should return 1
2

@amitmurthy
Copy link
Contributor

This example is expected behavior. You have 2 different definitions for f on master and 2.

@JeffBezanson
Copy link
Member

I think what's happening is that it tries to send over the local f, but the target machine sees it already has a function with the same name #1 so uses that instead. This might be the wrong thing to do for anonymous functions in Main.

@vtjnash
Copy link
Member

vtjnash commented Jul 27, 2016

While I used the same variable name, note that these are anonymous functions, not the function f

@amitmurthy
Copy link
Contributor

I don't understand, isn't f bound to two different functions on master and 2?

@spawnat 2 f() actually executes an anonymous function ()->f() on 2, and it expects f to be already defined on 2, which it is.

@vtjnash
Copy link
Member

vtjnash commented Jul 27, 2016

Might be easier to see if you call them f1 and f2

@JeffBezanson
Copy link
Member

@spawnat captures f from its environment. This example is clearer:

julia> f()
1

julia> remotecall_fetch(x->x(), 2, f)
2

I tried to explicitly send my f to processor 2, and it didn't work.

@amitmurthy
Copy link
Contributor

It is even more confusing now.

On master:

julia> whos()
                          Base  36383 KB     Module
                          Core  12471 KB     Module
                          Main  43285 KB     Module
                           ans      0 bytes  Base.#whos
                            f1      0 bytes  ##1#2

On 2:

julia>  From worker 2:                            Base  34723 KB     Module
    From worker 2:                            Core  12352 KB     Module
    From worker 2:                            Main  41369 KB     Module
    From worker 2:                              f2      0 bytes  ##1#2

How can @spawnat 2 f1() ever return 2? It seems like it is being evaluated locally. remotecall_fetch has the correct behavior.

remotecall_fetch(()->f1(), 2)
ERROR: On worker 2:
UndefVarError: f1 not defined

@amitmurthy
Copy link
Contributor

OTOH

julia> remotecall_fetch(f1, 2)
2

@amitmurthy
Copy link
Contributor

@spawnat captures f from its environment.

My understanding was that globals are not captured. Guess the behavior is different for functions vs data?

@vtjnash
Copy link
Member

vtjnash commented Jul 27, 2016

I posted the min repro after seeing what needs to be added to my PR to fix this. The conditional at https://github.com/JuliaLang/julia/pull/17619/files#diff-aec7b7bfc3115e923e4137ad38876951R741
should have the following corrected branch structure:

elseif mod !== __deserialized_types__ && mod !== Main
    @assert isdefined(mod, name)
    ...
else
    mod = __deserialized_types__
    ....
end

@JeffBezanson
Copy link
Member

My understanding was that globals are not captured.

julia> macroexpand(:(@spawnat 2 f()))
:((Base.spawnat)(2,(((f,)->(()->begin  # multi.jl, line 1658:
                    f()
                end)))(f)))

This behavior is a bit controversial though.

should have the following corrected branch structure

+1, though I think there's a chance we should only exclude Main for anonymous functions.

@vtjnash
Copy link
Member

vtjnash commented Jul 27, 2016

Right. That brings up the other possibility, which is that the serializer should redirect them into __deserialized_types__ instead of repeating the ishidden test in the deserializer. Any thoughts / preference?

@amitmurthy
Copy link
Contributor

Ah! Just to reconfirm, this is due to Base.localize_vars localizing all variables, correct?

@JeffBezanson
Copy link
Member

this is due to Base.localize_vars localizing all variables, correct

Yes.

@JeffBezanson
Copy link
Member

serializer should redirect them into deserialized_types instead of repeating the ishidden test in the deserializer

Sounds like either would work.

vtjnash added a commit that referenced this issue Jul 28, 2016
also reduce code duplication for easier maintenance

ref #16091
vtjnash added a commit that referenced this issue Jul 28, 2016
should_send_whole_type was sending too many regular type that weren't functions
and deserialize was trying to hide that error,
but was then not deserializing things that were anonymous functions

fix #16091
vtjnash added a commit that referenced this issue Jul 29, 2016
also reduce code duplication for easier maintenance

ref #16091
vtjnash added a commit that referenced this issue Aug 4, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)
vtjnash added a commit that referenced this issue Aug 5, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)
tkelman pushed a commit that referenced this issue Aug 11, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)

(cherry picked from commit cd8be65)
ref #16204
mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
intended to help with anonymous functions that don't have names

and JuliaLang#16091
mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
should_send_whole_type was sending too many regular type that weren't functions
and deserialize was trying to hide that error,
but was then not deserializing things that were anonymous functions

fix JuliaLang#16091
mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close JuliaLang#14445
fix JuliaLang#16550
reverts workaround JuliaLang#14456 (shouldn't break JuliaLang#14295, due to new locks)
should fix JuliaLang#16091 (with JuliaLang#17619)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants