Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A large number of @spawnat calls hangs #15923

Closed
amitmurthy opened this issue Apr 18, 2016 · 7 comments
Closed

A large number of @spawnat calls hangs #15923

amitmurthy opened this issue Apr 18, 2016 · 7 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@amitmurthy
Copy link
Contributor

Recent changes have rendered workaround #14456 moot.

The below test (run under JULIA_TESTFULL=1) hangs

for n = 1:10^6
      fetch(@spawnat myid() myid())
end

A Ctrl-C prints

^CERROR: InterruptException:
 in &(::Int64, ::Int64) at ./int.jl:131
 in rehash!(::Dict{Any,Any}, ::Int64) at ./dict.jl:551
 in rehash!(::Dict{Any,Any}, ::Int64) at ./dict.jl:562 (repeats 2 times)
 in _setindex!(::Dict{Any,Any}, ::Bool, ::WeakRef, ::Int64) at ./dict.jl:691
 in setindex!(::Dict{Any,Any}, ::Bool, ::WeakRef) at ./dict.jl:708
 in setindex!(::WeakKeyDict{Any,Any}, ::Bool, ::Future) at ./dict.jl:876
 in test_existing_ref(::Future) at ./multi.jl:509
 in Future(::Int64, ::Int64, ::Int64) at ./multi.jl:477
 in Future(::Int64) at ./multi.jl:545
 in remotecall(::Any, ::LocalProcess) at ./multi.jl:799
 in remotecall(::Any, ::Int64) at ./multi.jl:811
 [inlined code] from ./REPL[1]:2
 in anonymous at ./<no file>:4294967295
 in eval(::Module, ::Any) at ./boot.jl:236

The underlying cause is probably related to #14445 (comment)

@amitmurthy
Copy link
Contributor Author

It is an issue with Dicts most probably due to a memory corruption.

The looping is here -

julia/base/dict.jl

Lines 550 to 552 in fb81034

while slots[index] != 0
index = (index & (newsz-1)) + 1
end

I tested with the below modification of that block.

            lc = 0
            while slots[index] != 0
                index = (index & (newsz-1)) + 1
                lc = lc+1
                if lc > 10000
                    @assert lc < 10000
                elseif lc > 1000
                    println("Looping...index[", index, "], newsz[", newsz, "]")
                end
            end

and it loops around the following values:

Looping...index[4499], newsz[65536]
Looping...index[4500], newsz[65536]
Looping...index[4501], newsz[65536]
Looping...index[3773], newsz[65536]  <---- This shouldn't happen
Looping...index[3774], newsz[65536]
Looping...index[3775], newsz[65536]

which leads me to suspect a corruption somewhere.

cc: @yuyichao

@amitmurthy
Copy link
Contributor Author

Disabling gc allows for the above test to run fine. The odd thing, however, is that re-enabling gc() does not release memory allocated previously during the run.

@amitmurthy
Copy link
Contributor Author

This is a regression and is due to an interplay between the Dict code and GC which I don't fully understand. Tagging as 0.5.

@amitmurthy amitmurthy added the regression Regression in behavior compared to a previous version label Apr 29, 2016
@amitmurthy amitmurthy added this to the 0.5.0 milestone Apr 29, 2016
@vtjnash
Copy link
Member

vtjnash commented May 4, 2016

seems like it's a performance regression? adding a typeassert in hashindex seems to fix the performance bottleneck in rehash!, although calling isequal on such a large dictionary of WeakRef objects is remains a huge performance bottleneck:

diff --git a/base/dict.jl b/base/dict.jl
index 4ac15ae..d9bf6b1 100644
--- a/base/dict.jl
+++ b/base/dict.jl
@@ -9,15 +9,7 @@ haskey(d::Associative, k) = in(k,keys(d))
 function in(p::Pair, a::Associative, valcmp=(==))
     v = get(a,p[1],secret_table_token)
     if !is(v, secret_table_token)
-        if valcmp === is
-            is(v, p[2]) && return true
-        elseif valcmp === (==)
-            ==(v, p[2]) && return true
-        elseif valcmp === isequal
-            isequal(v, p[2]) && return true
-        else
-            valcmp(v, p[2]) && return true
-        end
+        valcmp(v, p[2]) && return true
     end
     return false
 end
@@ -515,7 +507,7 @@ function convert{K,V}(::Type{Dict{K,V}},d::Associative)
 end
 convert{K,V}(::Type{Dict{K,V}},d::Dict{K,V}) = d

-hashindex(key, sz) = ((hash(key)%Int) & (sz-1)) + 1
+hashindex(key, sz) = (((hash(key)%Int) & (sz-1)) + 1)::Int

 isslotempty(h::Dict, i::Int) = h.slots[i] == 0x0
 isslotfilled(h::Dict, i::Int) = h.slots[i] == 0x1
@@ -616,7 +608,7 @@ function ht_keyindex{K,V}(h::Dict{K,V}, key)
         if isslotempty(h,index)
             break
         end
-        if !isslotmissing(h,index) && isequal(key,keys[index])
+        if !isslotmissing(h,index) && (key === keys[index] || isequal(key,keys[index]))
             return index
         end

@@ -650,7 +642,7 @@ function ht_keyindex2{K,V}(h::Dict{K,V}, key)
                 # in case "key" already exists in a later collided slot.
                 avail = -index
             end
-        elseif isequal(key, keys[index])
+        elseif key === keys[index] || isequal(key, keys[index])
             return index
         end

@@ -957,15 +949,7 @@ function in(key_value::Pair, dict::ImmutableDict, valcmp=(==))
     key, value = key_value
     while isdefined(dict, :parent)
         if dict.key == key
-            if valcmp === is
-                is(value, dict.value) && return true
-            elseif valcmp === (==)
-                ==(value, dict.value) && return true
-            elseif valcmp === isequal
-                isequal(value, dict.value) && return true
-            else
-                valcmp(value, dict.value) && return true
-            end
+            valcmp(value, dict.value) && return true
         end
         dict = dict.parent
     end

I think this may have been due to changes to the multi.jl code? After finishing, we see that the dictionary was 256x larger:

v0.4
julia> Base.client_refs|>length
12139
julia> Base.client_refs.ht.keys|>length
65536
master
julia> Base.client_refs|>length
829344
julia> Base.client_refs.ht.keys|>length
16777216

furthermore, upon calling gc(), on v0.4, client_refs drops to 0 after a very brief pause. on master, client_refs drops only slightly after a very long pause (further calls to gc have no effect).

@vtjnash vtjnash added the performance Must go faster label May 4, 2016
vtjnash added a commit that referenced this issue May 4, 2016
this made it unreliable for the WeakKeyDict these are typically put in (client_refs)
to have trouble finding them to cleanup the dictionary later
since their hash and identity changed

fixes #15923
reverts workaround #14456 (doesn't break #14295 due to previous commit)
may also fix #16091
@amitmurthy
Copy link
Contributor Author

I think this may have been due to changes to the multi.jl code? After finishing, we see that the dictionary was 256x larger:

The number of entries in Base.client_refs should not be different in master vs 0.4 . I think it is differences in gc behavior between master and 0.4 resulting in a slower collection of objects.

@vtjnash
Copy link
Member

vtjnash commented May 5, 2016

The number of entries in Base.client_refs should not be different in master vs 0.4

yes, agreed – both should be 0

I think it is differences in gc behavior between master and 0.4

it appears that the order of finalizers is less predictable on master than it was on v0.4, yes

resulting in a slower collection of objects.

not necessarily – I think master may actually be faster at collecting them, the bug was in the finalizer (see 8d1970e)

@vtjnash vtjnash closed this as completed in 7348cfe May 9, 2016
@amitmurthy
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

2 participants