nsqd: channel sampling #223

elubow · 2013-06-13T18:59:18Z

As referenced in #219, creates a sampling channel that will discard messages on the server side if the sample threshold is met.

mreiferson · 2013-06-14T15:39:59Z

nsqd/channel.go

+		return int32(0)
+	}
+
+	return int32(0)


instead of adding this extra return you can just drop the last else case... cleaner

mreiferson · 2013-06-14T15:40:47Z

gonna write up some high level thoughts shortly...

mreiferson · 2013-06-14T17:53:08Z

Nice work with your first Go pull request @elubow :)

At a high level, I think I'm in favor of adding the concept of "sampling". My thoughts are fixated entirely around where this property should be applied.

I think it's useful to keep the property specification as part of the channel name like you've done. This avoids creating a situation where it needs to be added to every utility (nsq_to_http, nsq_tail, etc) as a command line option and client library as a configuration parameter. It also sets the precedent for how potentially similar concepts could be applied.

However, just because it's part of the channel name this doesn't require the property to be applied to the channel...

IMO "sampling" feels like a property of the client... for instance N clients (subscribed to the same topic/channel) should be able to sample at different rates. As implemented this doesn't support that.

I'm proposing for us to retain the channel name property specification you've implemented, but use the values internally in a different way.

ephemeral=true is a channel property and would continue to function in the way it has always been implemented
sample=[0-100] is a client property and would be parsed out in the SUB code path (ie. before the channel name is passed on for creation)

I must admit that going this route, I have reservations around the confusion and inconsistency it may create in the fact that this one (client) property needs to be set via the channel name spec. There are alternatives, but they have tradeoffs.

cc @jehiah

devdazed · 2013-06-14T18:13:33Z

The purpose behind putting sampling at the producer/channel level was to limit the amount of network traffic going to the consumer. If the consumer specifies the that there is a "sample" but does all the work of throwing out messages then a) there is no need to tell the channel it is sampled in the first place, and b) it is contradictory to the problem we were trying to solve in the first place.

mreiferson · 2013-06-14T18:16:08Z

That's not what I meant. More specifically, I think it makes (more) sense as a server-side property of a client rather than a server-side property of a channel.

It only really impacts where (in the server) you're discarding messages.

devdazed · 2013-06-14T18:29:43Z

i see, is there a concept of client level options on the channel, or will this have to be built in? and how will this affect the compatibility of the drivers?

mreiferson · 2013-06-14T18:40:16Z

technically no, but there is a concept of client level options that are specified via IDENTIFY that effectively impacts how data is consumed and clients are treated by nsqd.

with either implementation (with or without my suggested changes) the external API would remain the same and should not impact forwards/backwards compatibility of any existing utility or client library (good).

devdazed · 2013-06-14T18:49:01Z

ok, well i'm on board

devdazed · 2013-06-14T22:45:43Z

The primary issue I see with adding it to the client side is that if one client connects expecting 100% of the messages and another connects expecting 25%, then the one expecting 100% is actually going to get 37.5% of the messages and the other will get 12.5%. This is because each client gets only 50% of the messages (as there are 2) and then they will decide if it needs to be sent across the wire by using the sample rate.

When it is at the channel level, the 100% client receives all messages and the 25% receives 25%. Which is what I believe is expected (and what we are attempting to do in the pull request).

It's essentially creating a new channel for each different sample rate.

mreiferson · 2013-06-14T22:48:41Z

nsqd/channel.go

+}
+
+// Need to validate the sampleRate passed in on channel instantiation
+func (c *Channel) validateSampleRate(dirtySampleRate string) int32 {


two things

I think sample rate should be an int where 1 <= sampleRate <= 99, so we dont have to deal with floats on the server side

this should return an error too so that we can propagate back up the chain if things weren't correct (and fail fatally). unfortunately, this will need a bit of refactoring because we currently assume that NewChannel cannot fail (and it would need to propagate the error as well)

This is making the assumption that everyone will want to sample at a resolution of 1%. What about the possibility of someone wanting to sample at 33.33%? This makes send when people are doing billions of messages, as that extra decimal can make the difference of hundreds of thousands of messages. I would lean toward only allowing floats and not allowing the 0-100.

As it was implemented before it was the same resolution (it converted to int on the way out).

I personally don't care that much... a percentage is a percentage is a percentage... losing fractions of a percent granularity is meh IMO.

elubow · 2013-06-15T12:34:51Z

In order to propagate an error on NewChannel, we would need to make the clients (consumers) aware of a failure on channel creation. That is why I (initally) took the route of making a channel creation not fail on a bad sampling rate. Although I removed the create anyway scenario, before I go down the road of bubbling up a channel creation failure, are we planning on modifying every client to handle this?

mreiferson · 2013-06-15T12:55:31Z

after reviewing the code, perhaps putting this new validation into IsValidChannelName is more correct

mreiferson · 2013-06-28T00:22:47Z

Here are my thoughts after sitting on this for a bit.

I'm leaning towards treating this as a feature negotiated by each client as part of the IDENTIFY
handshake (like max_rdy_count and soon tls_v1, deflate, see #227).

My biggest concern about the channel name approach is that it sets a precedent I'm not fully
convinced we want.

I've considered @devdazed's point re: the potential confusion this can create in determining what %
of messages a given client will receive, but I don't agree that it isn't straightforward to
calculate. I think it boils down to a documentation issue. We must suggest best practices for
grouping clients that enable sampling (particularly suggesting that you include the % in the channel
name, for clarity).

Also, in terms of the approach we choose, I don't believe that the burden of updating client
libraries and packaged utilities should weigh as heavily as correctness.

The reason I prefer this approach is because it is consistent with the way other client features are
negotiated. Additionally, it closely represents the way in which one would implement sampling right
now (as a client-side discard). It simply moves the discarding into the server (for the
obvious efficiency benefit, which is all we're really asking for).

At a high level, the recommended approach would look like:

"clicks" (topic)
   |
   +---- "data_science_s25" (channel)
   |              |
   |              +---------- worker_a IDENTIFY { "sample_rate": 25 }
   |              |
   |              +---------- worker_b IDENTIFY { "sample_rate": 25 }
   |              |
   |              +---------- worker_c IDENTIFY { "sample_rate": 25 }
   |
   +---- "more_science_s70" (channel)
                  |
                  +---------- worker_a IDENTIFY { "sample_rate": 70 }
                  |
                  +---------- worker_b IDENTIFY { "sample_rate": 70 }
                  |
                  +---------- worker_c IDENTIFY { "sample_rate": 70 }

I can also imagine client library functionality that improves the developer experience (like
automatically creating helpful channel names based on sample rates).

In terms of actual changes, we would:

introduce sample_rate to the IDENTIFY payload (with the same semantics you've already
implemented.)
move the discard code into the client's messagePump() rather than the channels.

A couple other points...

I honestly don't care if it has floating pt precision. I've already explained my position, feel
free to go back to floats if you feel like that's important.
There's an interesting question of whether or not requeued messages should bypass the sampling
logic. For this initial implementation I think we should treat all messages as equal and simply
filter whatever comes down the pipe.

What are the downsides? The most obvious is that it is arguably easier to configure yourself into a
situation where you have consumers subscribed to the same channel with different sampling rates.

devdazed · 2013-06-28T14:34:16Z

I am on board with everything outlined. I would like to make it a point that I believe that the channel should hold meta-information about the sampling that will warn a client that attempts to connect to it with a differing sample rate. This will remove and confusion about the usage and allow a person with a use-case for it, to continue and ignore the warning.

mreiferson · 2013-12-15T23:26:35Z

Closing in favor of #286

elubow added 3 commits June 13, 2013 11:56

Adding channel sampling

ba160ed

Adding validation for possible sampling rates

6a18428

Go 1.0.3 can't detect shit

30e9eca

mreiferson reviewed Jun 14, 2013
View reviewed changes

elubow added 4 commits June 14, 2013 15:50

Removing else in favor of return

d79afe8

Adding vim files to .gitignore for the serious developers

6f4dfa9

Break out the channel parameter setting into it's own method

271e077

param => property AND terminology correction of string to slice

a75d82e

mreiferson reviewed Jun 14, 2013
View reviewed changes

elubow added 2 commits June 15, 2013 08:07

Slice => String for properties

9d0066d

Remove sampleRate's float'ing ability

9fe4b6d

mreiferson mentioned this pull request Jun 19, 2013

Create sampling stream using channel name modifier #219

Closed

elubow mentioned this pull request Dec 14, 2013

Allow for more types of channel names nsqio/go-nsq#20

Closed

elubow referenced this pull request in elubow/nsq Dec 15, 2013

Foundation code for adding sampling

041fdb3

mreiferson closed this Dec 15, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: channel sampling #223

nsqd: channel sampling #223

elubow commented Jun 13, 2013

mreiferson Jun 14, 2013

mreiferson commented Jun 14, 2013

mreiferson commented Jun 14, 2013

devdazed commented Jun 14, 2013

mreiferson commented Jun 14, 2013

devdazed commented Jun 14, 2013

mreiferson commented Jun 14, 2013

devdazed commented Jun 14, 2013

devdazed commented Jun 14, 2013

mreiferson Jun 14, 2013

devdazed Jun 15, 2013

mreiferson Jun 15, 2013

elubow commented Jun 15, 2013

mreiferson commented Jun 15, 2013

mreiferson commented Jun 28, 2013

devdazed commented Jun 28, 2013

mreiferson commented Dec 15, 2013

nsqd: channel sampling #223

nsqd: channel sampling #223

Conversation

elubow commented Jun 13, 2013

mreiferson Jun 14, 2013

Choose a reason for hiding this comment

mreiferson commented Jun 14, 2013

mreiferson commented Jun 14, 2013

devdazed commented Jun 14, 2013

mreiferson commented Jun 14, 2013

devdazed commented Jun 14, 2013

mreiferson commented Jun 14, 2013

devdazed commented Jun 14, 2013

devdazed commented Jun 14, 2013

mreiferson Jun 14, 2013

Choose a reason for hiding this comment

devdazed Jun 15, 2013

Choose a reason for hiding this comment

mreiferson Jun 15, 2013

Choose a reason for hiding this comment

elubow commented Jun 15, 2013

mreiferson commented Jun 15, 2013

mreiferson commented Jun 28, 2013

devdazed commented Jun 28, 2013

mreiferson commented Dec 15, 2013