add new algorithm for systematic sampling #34

Rowan441 · 2022-04-26T20:19:48Z

Kyle has requested a new algorithm that can reduce the difference in participant counts between each alternative.

From the original ticket: (https://github.com/clio/themis/issues/88629)

The current randomization mechanism during cohorting is a simple coin flip...

There is one major downside to doing it this way:

We end up with skewed group sizes. If we need 500 samples each in control or alternative, we often end up with 550 in one group and 500 in the other by the time the experiment completes.

That leads to higher variance in many of the experiment metrics, and makes the data harder to analyze overall.

Kyle’s suggestion is to use “segments” rather than coin flips. In our implementation, the Split framework could store a “segment” array in Redis for each experiment. If the segment size was 10, then the array might be:
[0, 1, 1, 1, 0, 0, 1, 0, 0, 1]

See more information about the feature request here: https://github.com/clio/themis/issues/88629

Test Plan

Download Robin's Split testing tool
Download and checkout this branch of the split gem
Set split-rails-test gemfile to point to the downloaded path of your local split gem (ex: gem 'split', path: "../../clio/split", require: 'split/dashboard')
Go to the experiments.yml and add the following experiments:

exp999:
  alternatives:
    - control
    - alternativeone
    - alternativetwo
  resettable: true
  algorithm: Split::Algorithms::SystematicSampling
  cohorting_block_seed: 1234
  cohorting_block_magnitude: 2
exp888:
  alternatives:
    - control
    - alternative
  resettable: true
  algorithm: Split::Algorithms::SystematicSampling

Tips

If ruby is 3.0.1 is not installed, run:

brew update && brew upgrade ruby-build
rbenv install 3.0.1

If redis is not installed, run:

brew install redis
brew tap homebrew/services
brew services start redis

Rowan441 · 2022-04-26T20:26:57Z

lib/split/experiment.rb

@@ -17,6 +18,7 @@ class Experiment
    DEFAULT_OPTIONS = {
      :resettable => true,
      :retain_user_alternatives_after_reset => false,
+      :cohorting_block => ["control", "alternative"]


I put a default value of ["control", "alternative"] here, we can set this to whatever we want. The problem currently is that if someone is using alternatives other than 'control' and 'alternative' the default won't work and they will have to override it

Hmm not sure how I feel about storing specific metadata for an algorithm at the experiment level without a more generic way of handling/storing it.

robin-phung

General question and comment.

Additionally, if we are storing a single block or segment and using it to determine the cohort for each upcoming chunk based on participant size, wouldn't that make the algorithm deterministic as opposed to random? Without being given the block I would technically be able to determine the "next" cohorted account/user after so many tries as a pattern clearly would have emerged. When you have a block size of 2 we are essentially toggling back and forth between the control and alternative.

I would imagine a better approach would be to configure a block size and allow split to keep track of the "current" block and then proceed to generate new ones as blocks are "filled".

Just playing around with random ideas I imagine we can achieve the same effect with the existing Split block randomization by creating multiple alternatives ie: 5 control and 5 alternative for example and toggling the experience based off whether its in one of the alternative groups. With that said this would look pretty ugly in our dashboards and analysis as this would likely mean we'll have to aggregate the data.

robin-phung · 2022-04-26T23:58:21Z

lib/split/experiment.rb

@@ -17,6 +18,7 @@ class Experiment
    DEFAULT_OPTIONS = {
      :resettable => true,
      :retain_user_alternatives_after_reset => false,
+      :cohorting_block => ["control", "alternative"]


Hmm not sure how I feel about storing specific metadata for an algorithm at the experiment level without a more generic way of handling/storing it.

robin-phung · 2022-04-27T00:15:45Z

lib/split/algorithms/systematic_sampling.rb

+      def self.choose_alternative(experiment)
+        block = experiment.cohorting_block
+        raise ArgumentError, "Experiment configuration is missing cohorting_block array" unless block
+        index = experiment.participant_count % block.length


[Q] Is there any concerns/considerations that need to be taken here with experiment.participant_count and concurrency/race conditions ~ when say 2 accounts are being cohorted at the same time and both retrieve the same participant_count.

Yeah, thats definitely a concern, the participant count is incremented in choose!() after this algorithm returns a value. I think the solution to this would be to store our own counter field in redis, then at the beginning of the algoritihm we can call incrby on that values and use the return value from it as the index

Rowan441 · 2022-04-27T13:29:04Z

Reply to @robin-phung 's comment

General question and comment.

Additionally, if we are storing a single block or segment and using it to determine the cohort for each upcoming chunk based on participant size, wouldn't that make the algorithm deterministic as opposed to random? Without being given the block I would technically be able to determine the "next" cohorted account/user after so many tries as a pattern clearly would have emerged. When you have a block size of 2 we are essentially toggling back and forth between the control and alternative.

I think thats a valid concern. The idea was that if the block size is large enough then in a situation where, for example, every other account was being cohorted the block size would counteract that and keep results even between cohorts.

But I don't know if that's a great solution because for example with a block of:
[control, alt, alt, alt, control, control, control, alt, control, alt, control, alt] would still have a 5:1 ratio of alternative participants

I would imagine a better approach would be to configure a block size and allow split to keep track of the "current" block and then proceed to generate new ones as blocks are "filled".

Just playing around with random ideas I imagine we can achieve the same effect with the existing Split block randomization by creating multiple alternatives ie: 5 control and 5 alternative for example and toggling the experience based off whether its in one of the alternative groups. With that said this would look pretty ugly in our dashboards and analysis as this would likely mean we'll have to aggregate the data.

That could work. In that could we would need some sort of mutex lock when one of the accounts is generating a new block, OR we could seed the random generation of new blocks 🤔

We could also make an algorithm basically the same as block_randomization but instead we randomly pick an alternative from the subset of the alternatives that have participant count <= smallest_alternative_participant_count + 3.
We would need no extra redis fields, and and race conditions probably could be ignored because the algorithm would self correct any outlier alternatives as more users are cohorted

Rowan441 · 2022-04-28T21:01:54Z

lib/split/algorithms/systematic_sampling.rb

+  module Algorithms
+    module SystematicSampling
+      def self.choose_alternative(experiment)
+        count = experiment.next_cohorting_block_index


This is the Redis call to increment a counter that keeps track of the number of users that have been run through this algorithm so far.

It will return an integer from Redis that will be the next value of the counter, because this value is supplied by Redis there shouldn't be any race conditions where multiple users have the same count variable

next_cohorting_block_index is more like a cohort count? The index leads me to a misunderstanding, it's the index number in a block, which shouldn't be larger than the size of the block.

Rowan441 · 2022-04-28T21:05:33Z

lib/split/experiment.rb

@@ -17,6 +19,7 @@ class Experiment
    DEFAULT_OPTIONS = {
      :resettable => true,
      :retain_user_alternatives_after_reset => false,
+      :cohorting_block_magnitude => 1


instead of block length, I chose to use cohorting_block_mangitude which is the number of times each alternative repeats in a single block.

e.g cohorting_block_magnitude = 3 would produce blocks of length 6 like: [control, alt, control, control, alt, alt]

I did this because since there are two (or more) alternatives some block lengths would be invalid (e.g 5) since the blocks have to be balanced between alternatives

Rowan441 · 2022-04-28T21:09:51Z

lib/split/experiment.rb

@@ -43,6 +46,11 @@ def set_alternatives_and_options(options)
      self.metadata = options_with_defaults[:metadata]
      self.friendly_name = options_with_defaults[:friendly_name] || @name
      self.retain_user_alternatives_after_reset = options_with_defaults[:retain_user_alternatives_after_reset]
+
+      if self.algorithm == Split::Algorithms::SystematicSampling
+        self.cohorting_block_seed = options_with_defaults[:cohorting_block_seed] || self.name.to_i(36)


Seed is used so any instance of Rails can generate the blocks and they are all the same

Default seed if none is specified in YAML is based on the name of the experiment: self.name.to_i(36)

I think something like:

Digest::MD5.hexdigest('exp893_solo').to_i(16) => 214402640715795185602270768195163573728

would be better, but I wasn't sure if importing Digest module would be overkill

Just want to point out, to_i doesn't support special characters, and it may also end up with a really long integer. How about something like https://apidock.com/ruby/v2_5_5/String/sum ?

That is perfect for our use case, thanks!

Rowan441 · 2022-04-28T21:16:49Z

lib/split/algorithms/systematic_sampling.rb

+        block_length = experiment.cohorting_block_magnitude * experiment.alternatives.length
+        block_num, index = count.divmod block_length
+
+        r = Random.new(block_num + experiment.cohorting_block_seed)
+        block = (experiment.alternatives*experiment.cohorting_block_magnitude).shuffle(random: r)
+        block[index]


What we do here to create any block with equal # of each alternative:

take the list of alternatives:

['red', 'blue', 'green']

repeat it cohorting_block_magnitude times:

['red', 'blue', 'green', 'red', 'blue', 'green', 'red', 'blue', 'green']

shuffle it (seeded by cohorting_block_seed and the current block #)

['green', 'red', 'blue', 'red', 'blue', 'green', 'red', 'green', 'blue']

christophercpang

Looks good to me!

lib/split/algorithms/systematic_sampling.rb

DluceJin · 2022-04-28T23:28:31Z

spec/algorithms/systematic_sampling_spec.rb

+        results[Split::Algorithms::SystematicSampling.choose_alternative(experiment).name] += 1
+      end
+
+      expect(experiment.cohorting_block_magnitude * experiment.alternatives.length).to eq(6)


Are you trying to make sure the choose_alternative won't change the value of cohorting_block_magnitude? I didn't quite get what you want to secure here...

DluceJin · 2022-04-28T23:29:22Z

spec/algorithms/systematic_sampling_spec.rb

+    end
+
+    let(:seeded_experiment2) do
+      Split::Experiment.new('link_highlight',


new line for exp_name would be easier to compare to the one above.

ericmchan

subhashree-clio · 2022-05-02T20:03:21Z

QA Passed 🚀

add new algorithm for systematic sampling

fcda139

Rowan441 commented Apr 26, 2022

View reviewed changes

Rowan441 marked this pull request as ready for review April 26, 2022 20:28

robin-phung reviewed Apr 27, 2022

View reviewed changes

add new algorithm for systematic sampling

27fc32f

Rowan441 commented Apr 28, 2022

View reviewed changes

christophercpang approved these changes Apr 28, 2022

View reviewed changes

DluceJin reviewed Apr 28, 2022

View reviewed changes

lib/split/algorithms/systematic_sampling.rb Show resolved Hide resolved

DluceJin reviewed Apr 28, 2022

View reviewed changes

robin-phung approved these changes Apr 29, 2022

View reviewed changes

ericmchan approved these changes Apr 29, 2022

View reviewed changes

implemented review comments

9786bd9

DluceJin approved these changes Apr 29, 2022

View reviewed changes

seperate alorithm counts for different versions

1730b9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new algorithm for systematic sampling #34

add new algorithm for systematic sampling #34

Rowan441 commented Apr 26, 2022 •

edited by subhashree-clio

Loading

Rowan441 Apr 26, 2022

robin-phung Apr 26, 2022

robin-phung left a comment •

edited

Loading

robin-phung Apr 26, 2022

robin-phung Apr 27, 2022

Rowan441 Apr 27, 2022

Rowan441 commented Apr 27, 2022 •

edited

Loading

Rowan441 Apr 28, 2022

DluceJin Apr 28, 2022

Rowan441 Apr 28, 2022

Rowan441 Apr 28, 2022 •

edited

Loading

DluceJin Apr 28, 2022

Rowan441 Apr 29, 2022

Rowan441 Apr 28, 2022 •

edited

Loading

christophercpang left a comment

DluceJin Apr 28, 2022

DluceJin Apr 28, 2022

ericmchan left a comment

subhashree-clio commented May 2, 2022

add new algorithm for systematic sampling #34

Are you sure you want to change the base?

add new algorithm for systematic sampling #34

Conversation

Rowan441 commented Apr 26, 2022 • edited by subhashree-clio Loading

Test Plan

Tips

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robin-phung left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rowan441 commented Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rowan441 Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rowan441 Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

christophercpang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericmchan left a comment

Choose a reason for hiding this comment

subhashree-clio commented May 2, 2022

Rowan441 commented Apr 26, 2022 •

edited by subhashree-clio

Loading

robin-phung left a comment •

edited

Loading

Rowan441 commented Apr 27, 2022 •

edited

Loading

Rowan441 Apr 28, 2022 •

edited

Loading

Rowan441 Apr 28, 2022 •

edited

Loading