Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: channel sampling #223

Closed
wants to merge 9 commits into from
Closed

nsqd: channel sampling #223

wants to merge 9 commits into from

Conversation

elubow
Copy link
Contributor

@elubow elubow commented Jun 13, 2013

As referenced in #219, creates a sampling channel that will discard messages on the server side if the sample threshold is met.

return int32(0)
}

return int32(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding this extra return you can just drop the last else case... cleaner

@mreiferson
Copy link
Member

gonna write up some high level thoughts shortly...

@mreiferson
Copy link
Member

Nice work with your first Go pull request @elubow :)

At a high level, I think I'm in favor of adding the concept of "sampling". My thoughts are fixated entirely around where this property should be applied.

I think it's useful to keep the property specification as part of the channel name like you've done. This avoids creating a situation where it needs to be added to every utility (nsq_to_http, nsq_tail, etc) as a command line option and client library as a configuration parameter. It also sets the precedent for how potentially similar concepts could be applied.

However, just because it's part of the channel name this doesn't require the property to be applied to the channel...

IMO "sampling" feels like a property of the client... for instance N clients (subscribed to the same topic/channel) should be able to sample at different rates. As implemented this doesn't support that.

I'm proposing for us to retain the channel name property specification you've implemented, but use the values internally in a different way.

  1. ephemeral=true is a channel property and would continue to function in the way it has always been implemented
  2. sample=[0-100] is a client property and would be parsed out in the SUB code path (ie. before the channel name is passed on for creation)

I must admit that going this route, I have reservations around the confusion and inconsistency it may create in the fact that this one (client) property needs to be set via the channel name spec. There are alternatives, but they have tradeoffs.

cc @jehiah

@devdazed
Copy link

The purpose behind putting sampling at the producer/channel level was to limit the amount of network traffic going to the consumer. If the consumer specifies the that there is a "sample" but does all the work of throwing out messages then a) there is no need to tell the channel it is sampled in the first place, and b) it is contradictory to the problem we were trying to solve in the first place.

@mreiferson
Copy link
Member

That's not what I meant. More specifically, I think it makes (more) sense as a server-side property of a client rather than a server-side property of a channel.

It only really impacts where (in the server) you're discarding messages.

@devdazed
Copy link

i see, is there a concept of client level options on the channel, or will this have to be built in? and how will this affect the compatibility of the drivers?

@mreiferson
Copy link
Member

technically no, but there is a concept of client level options that are specified via IDENTIFY that effectively impacts how data is consumed and clients are treated by nsqd.

with either implementation (with or without my suggested changes) the external API would remain the same and should not impact forwards/backwards compatibility of any existing utility or client library (good).

@devdazed
Copy link

ok, well i'm on board

@devdazed
Copy link

The primary issue I see with adding it to the client side is that if one client connects expecting 100% of the messages and another connects expecting 25%, then the one expecting 100% is actually going to get 37.5% of the messages and the other will get 12.5%. This is because each client gets only 50% of the messages (as there are 2) and then they will decide if it needs to be sent across the wire by using the sample rate.

When it is at the channel level, the 100% client receives all messages and the 25% receives 25%. Which is what I believe is expected (and what we are attempting to do in the pull request).

It's essentially creating a new channel for each different sample rate.

}

// Need to validate the sampleRate passed in on channel instantiation
func (c *Channel) validateSampleRate(dirtySampleRate string) int32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two things

  1. I think sample rate should be an int where 1 <= sampleRate <= 99, so we dont have to deal with floats on the server side
  2. this should return an error too so that we can propagate back up the chain if things weren't correct (and fail fatally). unfortunately, this will need a bit of refactoring because we currently assume that NewChannel cannot fail (and it would need to propagate the error as well)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making the assumption that everyone will want to sample at a resolution of 1%. What about the possibility of someone wanting to sample at 33.33%? This makes send when people are doing billions of messages, as that extra decimal can make the difference of hundreds of thousands of messages. I would lean toward only allowing floats and not allowing the 0-100.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it was implemented before it was the same resolution (it converted to int on the way out).

I personally don't care that much... a percentage is a percentage is a percentage... losing fractions of a percent granularity is meh IMO.

@elubow
Copy link
Contributor Author

elubow commented Jun 15, 2013

In order to propagate an error on NewChannel, we would need to make the clients (consumers) aware of a failure on channel creation. That is why I (initally) took the route of making a channel creation not fail on a bad sampling rate. Although I removed the create anyway scenario, before I go down the road of bubbling up a channel creation failure, are we planning on modifying every client to handle this?

@mreiferson
Copy link
Member

after reviewing the code, perhaps putting this new validation into IsValidChannelName is more correct

@mreiferson
Copy link
Member

Here are my thoughts after sitting on this for a bit.

I'm leaning towards treating this as a feature negotiated by each client as part of the IDENTIFY
handshake (like max_rdy_count and soon tls_v1, deflate, see #227).

My biggest concern about the channel name approach is that it sets a precedent I'm not fully
convinced we want.

I've considered @devdazed's point re: the potential confusion this can create in determining what %
of messages a given client will receive, but I don't agree that it isn't straightforward to
calculate
. I think it boils down to a documentation issue. We must suggest best practices for
grouping clients that enable sampling (particularly suggesting that you include the % in the channel
name, for clarity).

Also, in terms of the approach we choose, I don't believe that the burden of updating client
libraries and packaged utilities should weigh as heavily as correctness.

The reason I prefer this approach is because it is consistent with the way other client features are
negotiated. Additionally, it closely represents the way in which one would implement sampling right
now
(as a client-side discard). It simply moves the discarding into the server (for the
obvious efficiency benefit, which is all we're really asking for).

At a high level, the recommended approach would look like:

"clicks" (topic)
   |
   +---- "data_science_s25" (channel)
   |              |
   |              +---------- worker_a IDENTIFY { "sample_rate": 25 }
   |              |
   |              +---------- worker_b IDENTIFY { "sample_rate": 25 }
   |              |
   |              +---------- worker_c IDENTIFY { "sample_rate": 25 }
   |
   +---- "more_science_s70" (channel)
                  |
                  +---------- worker_a IDENTIFY { "sample_rate": 70 }
                  |
                  +---------- worker_b IDENTIFY { "sample_rate": 70 }
                  |
                  +---------- worker_c IDENTIFY { "sample_rate": 70 }

I can also imagine client library functionality that improves the developer experience (like
automatically creating helpful channel names based on sample rates).

In terms of actual changes, we would:

  • introduce sample_rate to the IDENTIFY payload (with the same semantics you've already
    implemented.)
  • move the discard code into the client's messagePump() rather than the channels.

A couple other points...

  • I honestly don't care if it has floating pt precision. I've already explained my position, feel
    free to go back to floats if you feel like that's important.
  • There's an interesting question of whether or not requeued messages should bypass the sampling
    logic. For this initial implementation I think we should treat all messages as equal and simply
    filter whatever comes down the pipe.

What are the downsides? The most obvious is that it is arguably easier to configure yourself into a
situation where you have consumers subscribed to the same channel with different sampling rates.

@devdazed
Copy link

I am on board with everything outlined. I would like to make it a point that I believe that the channel should hold meta-information about the sampling that will warn a client that attempts to connect to it with a differing sample rate. This will remove and confusion about the usage and allow a person with a use-case for it, to continue and ignore the warning.

@mreiferson
Copy link
Member

Closing in favor of #286

@mreiferson mreiferson closed this Dec 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants