Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

julep/rfc: proposal for unified Dict get/insert/has interface #12157

Open
vtjnash opened this issue Jul 15, 2015 · 26 comments
Open

julep/rfc: proposal for unified Dict get/insert/has interface #12157

vtjnash opened this issue Jul 15, 2015 · 26 comments
Labels
collections Data structures holding multiple items, e.g. sets julep Julia Enhancement Proposal

Comments

@vtjnash
Copy link
Member

vtjnash commented Jul 15, 2015

problem statement:

a common idiom when working with Associative objects is to do a set of has / insert / lookup operations:

d = Dict()
if !(key in keys(d))
  d[key] = initial_value
end
value = d[key]

this syntax has the advantage of only using basic operations, for clarity, but it's also 3x slower than necessary since the dict lookup gets repeated three times. that concern has lead to the introduction of a get! method that accepts either an object or function, to insert a value if the lookup fails, so that the above code can be rewritten as:

d = Dict()
value = get!(d, key, initial_value)

or

d = Dict()
value = get!(d, key, () -> initial_value)

however, this implementation is still not necessarily faster, since it involves precomputing initial_value (fine it it's just a constant, but bad if it expensive to compute or creates unnecessary garbage) or invoking a lambda.

the code to support this one code pattern isn't exactly short, requires a fair amount of code duplication, and even a macro definition is provided (https://github.com/JuliaLang/julia/blob/master/base/dict.jl#L632-L690)

proposal

define the Ref operation on a Dict to return an intelligent indexer object. the above code pattern could then be written as:

d = Dict()
bp = Ref(d, key)
if isempty(bp)
  bp[] = initial_value
end
value = bp[]

implementation sketch

type RefDict{K, V, D <: Dict} <: Ref{V}
  d::D
  k::K
  last_htindex::Int
  last_state::Int
end
function RefDict{K,V}(d::Dict{K,V}, k)
  key = convert(K, k)
  isequal(key, k) || throw(ArgumentError("k is not a usable key"))
  return RefDict{K,V,typeof(d)}(d, k, 0, 0)
end
function isempty(r::RefDict)
  if (d.last_state != d.changecounter)
    r.last_htindex = ht_keyindex2(d, key)
    r.last_state = d.changecounter
  end
  return r.last_htindex > 0
end

ref #12035 (comment)

@yuyichao
Copy link
Contributor

Is this the one you briefly mentioned on saturday night during the JuliaCon?

@carnaval
Copy link
Contributor

Yep. And yes please (again) !

(btw s/in/haskey/ in your first example)

@yuyichao
Copy link
Contributor

Also would this still has benefit over the get! version if we have fast anonymous function?

@vtjnash
Copy link
Member Author

vtjnash commented Jul 15, 2015

Is this the one you briefly mentioned on saturday night during the JuliaCon?

yes. i figured i should finally try to see what it would look like

@carnaval
Copy link
Contributor

Also would this still has benefit over the get! version if we have fast anonymous function?

It is more general since you can use the hash table before the insertion. Like

if isempty(bp)
    println("Inserting into ", d)
    bp[] = x
end

We could also still have the lambda version for compactness I guess.

@carnaval
Copy link
Contributor

(well in this example I guess you could capture d in the lambda, but my point is you can carry the ref around for some time before committing to the insertion)

@JeffBezanson
Copy link
Member

Good idea!

I think my main concern would be allocating an extra object for each dictionary operation. It could get quite expensive. Also unfortunately the code for the common idiom is not shorter.

@yuyichao
Copy link
Contributor

(well in this example I guess you could capture d in the lambda,

Was just going to say that.

but my point is you can carry the ref around for some time before committing to the insertion)

I see. (not sure if it is good to bring up but I guess it should be as useful as c++ reference.....)

@JeffBezanson
Copy link
Member

Related: #2108

@kmsquire
Copy link
Member

I'll point out that @StephenVavasis implemented a Token interface for SortedDicts in DataStructures.jl, which plays a similar roll. This one sounds better, however.

(He even started to use Ref in a similar way, before I discouraged him because it wasn't used yet in Base (sorry, Stephen).)

@kmsquire
Copy link
Member

my point is you can carry the ref around for some time before committing to the insertion

We might need to be careful about asynchronous updates to the hash table causing it to be resized, especially if we ever get threading.

@yuyichao
Copy link
Contributor

We might need to be careful about asynchronous updates to the hash table causing it to be resized, especially if we ever get threading.

I think in the example above ref[] = value basically fall back to ref.dict[ref.key] = value if it detects that the dictionary have changed so we should just be as careful as doing regular dict assignment?

There might be one additional race possibility but IMHO it shouldn't need more care than the dict itself.

@carnaval
Copy link
Contributor

Yes, Jameson's proposal handles the single thread rehash case by graceful performance degradation. Multi threaded modification of the same data structure will likely be defined as a race condition => undefined behavior. We'll have to implement a family of concurrent data structures. (the java std lib has state of the art implementation of those things, probably good inspiration).

@StephenVavasis
Copy link
Contributor

Since Kevin pinged me on this topic, I guess I can add my two cents about tokens and Ref.

(1) The primary purpose of tokens in my code is to provide a mechanism for stepping through a SortedDict (or SortedSet or SortedMultiSet) in the sort-order of the keys. Obviously, this is not an issue for Dict.

(2) A second purpose of a token is to provide a handle to an entry that can be dereferenced in O(1) time. Again, this is not so much an issue for Dict because the hash-lookup is already supposed to be O(1).

(3) Using a Ref object to point to a spot in the structure where an item would go if it would be inserted later seems like a reasonable idea; as mentioned earlier, there is a performance hit because the Ref object is heap-allocated. However, I am thinking of changing my balanced-tree implementation so that the most recent look-up is cached because often the next lookup is nearby. Couldn't Dict do the same with caching?

(4) Going off on a tangent, I have an issue with the equivalence between haskey(d,k) and in(k, keys(d)) ... see my recent posting here:

JuliaCollections/DataStructures.jl#103

(5) Finally, with regard to the problem of the awkward if-block versus the need to evaluate 'initial_value' in a get! function, wouldn't it be OK to use a macro and to ask anyone who writes an Associative type to support a macro version of get! that solves this problem?

@StefanKarpinski
Copy link
Member

I don't really like the syntax much. I've proposed adding a syntax for get! like d[k] ?= v. However, that begs the question of what we should lower that to. If we lower it to the form that uses a closure, then we haven't solved the performance issue (although I'm not entirely convinced that this matters since it will be fast eventually, right?), so maybe it could lower to something like what you're proposing. There certainly are cases within the implementation of Dict where having a more usable "intelligent index" type would be helpful in implementing higher level operations.

@dbeach24
Copy link
Contributor

dbeach24 commented Aug 29, 2016

I have an alternate proposed solution to this which I (unwittingly) opened in a separate ticket #18282.

TLDR: What about building a function that handles the "update this value in a dictonary" case, as follows:

update!(f::Function, a::Associative, key, default) = (a[key] = f(get(a, key, default)))

(except that it would be specialized to avoid recomputation of the hash/slot.)

This would allow code like:

update!(x -> x+1, my_counters, key, 0)  # where 0 is the default value for a counter, or
update!(x -> x+value, my_totals, key, 0.0) # where value is accumulated

@KristofferC
Copy link
Member

Sounds like #15630

@KristofferC
Copy link
Member

@dbeach24
Copy link
Contributor

dbeach24 commented Aug 29, 2016

@KristofferC Pardon my ignorance. Do LinearSlow AbstractArrays and Associatives share a common interface / behavior pattern?

EDIT: I think I understand a bit better now from seeing UpdateIndex.jl

@ViralBShah ViralBShah added the julep Julia Enhancement Proposal label Aug 30, 2016
@KristofferC
Copy link
Member

It is similar since you want to "hold on" to the result of the lookup thas has to be made when indexing.

@dbeach24
Copy link
Contributor

dbeach24 commented Aug 30, 2016

@KristofferC Thanks for sharing these links. FWIW, I agree that Julia would benefit from a way to directly overload the various ?= operators. (Python has this capability, and numpy leverages it heavily. Why would Julia want to skip over this functionality?) However, I suppose that discussion belongs on #15630.

I think the dictionary case differs slightly because, unlike arrays, dictionaries are not guaranteed to have values over any particular range of keys, and there is no default value that can be assumed for missing keys, at least not in the general case.

In your proposal:

updateindex!(A, op, s, i...)
  • op is a binary operator, and
  • s is the RHS operand.

Could these be combined into a single function, f(x) = op(x, s)? By placing this first, we can permit "do" syntax, and this makes our proposals look very similar, indeed.

updateindex!(f, A, i...)
update!(f, dict, key, default)

Not sure how to deal with the fact that arrays need multiple indices and dicts have just one. Can the array indices be grouped into a tuple, or will this create overhead?

@eschnett
Copy link
Contributor

In general, creating the new entry (i.e. what is called default here) might be expensive, or might be a transaction that should only occur when the new value is actually needed. get has a form that accepts a function instead of a default value, and that function is only called when needed. This function is also the first argument. So we can do this:

update!(op, mkdefault, coll, ids...)

where both op and mkdefault are functions, coll is a collection, and ids are indices or keys.

@dbeach24
Copy link
Contributor

dbeach24 commented Aug 30, 2016

@eschnett That's a good point.

  1. Is it confusing to have update! accept two functions?
  2. Which one of them should be available for "do" syntax.
  3. When the default value is not supplied by a function, it typically comes last. Should the two functions really be the first two arguments?

Current proposal would look like this:

update!(init_counter, counters, key) x do
    x + 1
end

What about moving the default function to the end?

update!(counters, key, init_counter) x do
   x + 1
end

Edit: I like the way update!(counters...) reads in this last example. It clearly states which collection is being updated. (But of course that only holds if you use the "do" syntax.)

Edit 2: Could we support both cases where the default is an object or a function? i.e.

update!(f::Function, coll::Associative, key, default)
update!(f::Function, coll::Associative, key, default::Function)

...or is that an abuse of overloading?

@martinholters
Copy link
Member

Could we support both cases where the default is an object or a function

For a dict holding functions as values, that might give surprising/undesired behavior.

@dbeach24
Copy link
Contributor

@martinholters Thank you for speaking up, I was looking for someone who might disagree.

Does that use case compel us to move the default function to another argument position? (arguably less clear and less consistent?)

Of course, if someone was keeping a dict of functions, they could still write:

update!(mydict, key, () -> my_default_value_which_is_a_function) do x
    decorate_function(x)
end

@algorithmx
Copy link

See also https://github.com/KristofferC/UpdateIndex.jl

Hi Kristofer, are there any follow-up improvements on this ? I am hitting a performance barrier now. The bottleneck is to count the number of occurrence of various patterns in a "string". I used a Dict{pattern_type,Int} to do the job, and I wish to make it faster. What is the state-of-art technique for that ? Otherwise I'll just patch my code with this one.

@nsajko nsajko added the collections Data structures holding multiple items, e.g. sets label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets julep Julia Enhancement Proposal
Projects
None yet
Development

No branches or pull requests