Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: Map method suggestions #21199

Closed
pcostanza opened this issue Jul 28, 2017 · 20 comments
Closed

sync: Map method suggestions #21199

pcostanza opened this issue Jul 28, 2017 · 20 comments

Comments

@pcostanza
Copy link

pcostanza commented Jul 28, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.9rc1

What operating system and processor architecture are you using (go env)?

macOS, x86_64

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

This is not an error report, but a report that I am convinced there are methods that still need to be added to the sync.Map data structure.

  1. A general-purpose Modify method is missing that allows modifying a value that is already stored in the map to a new value that depends on the old value. For example:
m.Modify(key, func(value interface{}, ok bool) (newValue interface{}, store bool) {
  if ok {
    newValue = value.(int) + 1
  } else {
    newValue = 1
  }
  store = true // false means: delete the current entry
  return
})

It's important that the key-value-pair currently stored in the map is not concurrently changed while the user-supplied function is being executed.

  1. Similarly, a DeleteOrStore method is missing:

actual, loaded := m.DeleteOrStore(key, value interface{})
// loaded = true means there was already a value in the map, which is now deleted
// loaded = false means there wasn't a value in the map, and value is now stored (== actual)

DeleteOrStore is useful if you are looking for pairs of entries that share the same key in a stream of entries: When the first entry arrives, it is stored in the map because no entry in the map with the same value already existed; when the second entry arrives, the first entry is retrieved and deleted from the map, and the pair of entries can now be processed outside of the map. A real-world use case is the processing of read pairs in DNA sequencing pipelines. (I can provide more details and code if there is interest.)

  1. I also believe that LoadOrCompute could be useful:

m.LoadOrCompute(key, func() interface{} { return some computed value })
This could be useful if computing a new value is expensive for some reason. Likewise, a DeleteOrCompute is potentially useful.

I don't know good use cases for LoadOrCompute and DeleteOrCompute, though, unlike for Modify and DeleteOrStore.

What did you expect to see?

I expected to see at least methods Modify and DeleteOrStore on sync.Map

What did you see instead?

These methods are not defined.

@mvdan mvdan changed the title sync.Map methods incomplete sync: Map method suggestions Jul 28, 2017
@mvdan
Copy link
Member

mvdan commented Jul 28, 2017

Note that there have been previous discussions on adding methods to Map: #20680 #20884

I also seem to recall that functions were avoided for a reason, but I can't find that now.

CC @bcmills

@pcostanza
Copy link
Author

OK, my apologies for not having checked the previous discussions. I understand your motivation not to include something like Modify, but DeleteOrStore is something we actually really need...

@bcmills
Copy link
Contributor

bcmills commented Jul 28, 2017

LoadOrStore is part of the discussion in #20884. (See #20884 (comment) in particular for the technical concerns.)

Modify seems like a generalization of that that, but is an even more complex and more generalized API. I think the same concerns would apply more strongly to it, and I would also be concerned that it's too hard to use correctly. (What happens if you return a non-nil newValue with store = false?)

@bcmills
Copy link
Contributor

bcmills commented Jul 28, 2017

So let's focus on DeleteOrStore. It seems like a really esoteric function to me; could you explain a bit more about your use-case? If I understand correctly, you want it to return the removed element, so on that assumption I'm going to call it RemoveOrStore. (To me, Delete implies that you don't care about the thing being deleted.)

I suspect it's more likely that we would add the more general CompareAndSwap and CompareAndDelete (see the discussion starting at #18177 (comment)), which could be used to implement RemoveOrStore as long as the values you have in mind are comparable and do not include the nil interface{} value:

func RemoveOrStore(m *sync.Map, k, v interface{}) (prev interface{}, deleted bool) {
  for {
    prev, ok := m.Load(k)
    if ok {
      if m.CompareAndDelete(k, prev) {
        return prev, true
      }
    } else {
      if m.CompareAndSwap(k, nil, v) {
        return nil, false
      }
    }
  {
}

That said, sync.Map is not really optimized for sets of keys that can change (see #21032 and #21035). Do you expect to be toggling back and forth between states for the same keys repeatedly? If so, you can store a pointer to a value with its own synchronization (e.g. atomic.CompareAndSwapPointer) and update that rather than storing and deleting it.

@josharian
Copy link
Contributor

It's quite late to have this thought, but I wonder whether we should rename sync.Map to sync.Cache or something else narrower. That might reduce these common feature requests and also leaves open the door for a future sync.Map with a different feature set.

@bcmills
Copy link
Contributor

bcmills commented Jul 28, 2017

@josharian The sync.Map API mostly follows the regular map API, and I think the same API can be made efficient for a range of use-cases well beyond just caching.

I do want to try to encourage more-specialized maps outside the standard library, though: the range of "concurrent map" use-cases is very wide, and sync.Map can't be ideal for all of them. It's important that the Map in the standard library address the use-cases in the standard library, and it's nice if it can address general use-cases beyond that, but there's really very little harm if folks want to fork and adapt it for specialized needs.

@josharian
Copy link
Contributor

there's really very little harm if folks want to fork and adapt it for specialized needs.

Yep. In fact, that'd be really helpful, to get a sense for what the set of common needs are, for future stdlib/language additions.

@pcostanza
Copy link
Author

pcostanza commented Jul 30, 2017

What happens if you return a non-nil newValue with store = false?)

That's not so different from having a function with a primary return value, and a secondary error value. If the secondary value is not nil, the primary return value is probably meaningless.

To me, Delete implies that you don't care about the thing being deleted.

I'm not sure I understand this remark correctly. I care that the storage can be reclaimed by the garbage collector, so the map shouldn't keep a reference to the value that prevents this from happening. (If there are other references to the value being held by the code that invoked DeleteOrStore, that's fine, it should just be in control of that code.)

This is also why a solution based on some compare-and-swap functionality is not ideal, because compare-and-swap cannot observe concurrent deletes, it can only observe swaps to some conventional "value-missing" value which doesn't allow the value to be garbage collected.

Sidenote: I can live with the functionality not being added to sync.Map, because it's not too hard to define a custom Map that implements what is needed. However, my understanding was that sync.Map is supposed to be a concurrent version of the Go's built-in map type, and then I believe it should add the functionality I describe above to be complete. Maybe it's easier to rename sync.Map to avoid the false impression?

@pcostanza
Copy link
Author

Here are more details about the use case: DNA sequencers use chemical processes to create smaller DNA sequences, which can then be read and stored in files. Each "read" consists of two fragments, one at the left end and one at the right end of the read, which are initially stored in FASTQ files, and eventually in SAM files.

For some processing steps in DNA sequencing pipelines, it is important to find the two fragments that belong to a single read. They can be identified by two fields, most importantly the QNAME field (see section 1.4 in the SAM file format). It is known that a particular QNAME only occurs twice in a SAM file, so one way to identify a pair of fragments is by using the QNAME as key in a map and store it the first time you encounter it, and retrieve and delete it again the second time you encounter it. By that time you have both fragments that belong to the pair.

A typical SAM file for human DNA samples can be around 50 GB compressed and 500 GB uncompressed data, so ensuring that values that are not used anymore can be gc'ed is quite important.

@bcmills
Copy link
Contributor

bcmills commented Jul 31, 2017

sync.Map is supposed to be a concurrent version of the Go's built-in map type, and then I believe it should add the functionality I describe above to be complete.

The built-in map type supports indexing (Load), assignment (Store), range loops (Range), deletion (Delete), and length (#20680).

sync.Map currently supports all of those except for length, because it isn't obvious that Len can be implemented in a way that is both correct and efficient. The built-in map type does not have any operation analogous to Modify or LoadOrCompute as you describe: those are compositions of map operations overlaid with a particular synchronization pattern.

sync.Map does add a LoadOrStore operation on top of the basic map operations: there was a recurring pattern of usage in the standard library supporting it. Arguments for similar higher-level methods need to identify similar patterns.

@bcmills
Copy link
Contributor

bcmills commented Jul 31, 2017

It is known that a particular QNAME only occurs twice in a SAM file, so one way to identify a pair of fragments is by using the QNAME as key in a map and store it the first time you encounter it, and retrieve and delete it again the second time you encounter it

If you know that the QNAME only occurs twice, LoadOrStore and Delete should suffice:

func Lookup(qname string, offset int64) (start, end int64, ok bool) {
  prev, ok := m.LoadOrStore(qname, offset)
  if !ok {
    return -1, -1, false
  } 
  m.Delete(qname)

  if prev > offset {
    start, end = offset, prev
  } else {
    start, end = prev, offset
  }
  return start, end, true
}

If you've already seen the second occurrence, then you know that there will not be any subsequent Store for the Delete to race with.

@glycerine
Copy link

sync.Map currently supports all of those except for length, because it isn't obvious that Len can be implemented in a way that is both correct and efficient.

Just out of curiosity, what would happen if one did an atomic.AddUint64(&length, 1) on add, and atomic.AddUint64(&length, -1) upon delete? Seems quite efficient. Return the length when queried with an atomic read. Would that not be correct?

@pcostanza
Copy link
Author

@bcmills Yes, this solution would work for this particular use case. I'd still be in favor to add DeleteOrStore, if not for performance reasons (potentially less synchronization necessary). But I have to admit I'm running out of good arguments.

@glycerine An issue with a length method, or similar, on concurrent data structures is that the information is potentially obsolete once you have it, so it's difficult to defend it's usefulness.

@bcmills
Copy link
Contributor

bcmills commented Aug 14, 2017

@glycerine It is not obvious to me that atomic.AddUint64 would even be correct. Do you update the counter before storing the pointer for the value? If so, you may over-count. But if you update the counter after storing the pointer, an immediate Delete of that same key could cause the count to go negative. So it seems like you would need some additional synchronization.

Even if you could find a correct algorithm, every Store or Delete that changes the set of keys would need to update the same counter in memory, reintroducing the very same cache contention problem that sync.Map is designed to avoid (albeit only for a subset of operations). That seems like a step in the wrong direction.

@glycerine
Copy link

Correctness: everyone knows that length is at a single snapshot in time, that it is subject to races, and that it may be stale, but this is the right of the user to decide how useful this is. For many applications this snapshot may be absolutely correct and exactly what is needed.

Efficiency: What's the alternative? When one needs the length, one must iterate through all available keys of the sync.Map and count them manually? An atomic store would be radically faster that all that data copying. Reading all stored keys from all shards will incur the cost of much more cache coherency logic than copying a single number, or even a single number per shard. Assuming you did a lazy update of the global length, and only updated shard lengths upon insert/delete, this would be a "only-pay-if-you-use-it" design for Len().

@ianlancetaylor
Copy link
Member

Your efficiency argument assumes that you need the length. It's clear from the current uses that many uses of this data structure do not need the length. You need to compare, not the efficiency of obtaining the length one way or another, but the efficiency of obtaining the length versus not obtaining the length.

For example, people who need the length could always insert and delete values using a helper function that maintains the length.

@bcmills
Copy link
Contributor

bcmills commented Aug 14, 2017

Correctness: everyone knows that length is at a single snapshot in time,

That's my point, though: without additional synchronization, it is not obviously a valid snapshot in time. If you have two concurrent Store calls and two concurrent Delete calls for the same keys, the caller should never observe the length as -2 even if the reported length eventually stabilizes at 0.

Similarly, if you have two concurrent Store calls with the same key, the caller should never observe the length as 2 even if it eventually stabilizes at the correct 1.

It is not obvious to me how you intend to maintain that property without additional locking (and the associated cache contention).

@bcmills
Copy link
Contributor

bcmills commented Aug 14, 2017

Assuming you did a lazy update of the global length, and only updated shard lengths upon insert/delete,

The current implementation does not use sharding at all (see https://golang.org/issue/20360#issuecomment-301372644).

(Or, you can equivalently think of it as having a shard per key, in which case your "lazy update of the global length" is exactly what you get from using Range to count the length.)

@andeya
Copy link

andeya commented Oct 13, 2017

From #22247

func (m *Map) Len() int

The following is a reference:

https://github.com/henrylee2cn/goutil/blob/master/map.go#L541

@ianlancetaylor
Copy link
Member

We aren't going to make any changes here, so closing. Please comment if you disagree, but please address the above comments.

@golang golang locked and limited conversation to collaborators Mar 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants