Skip to content

Library for analyzing interleaved search A/B tests to determine preference between competing ranking functions

License

Notifications You must be signed in to change notification settings

bearloga/interleaved-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

interleaved

Library for analyzing interleaved search A/B tests to determine preference between competing ranking functions

Installing

pip install --upgrade git+https://github.com/bearloga/interleaved-python.git@main

Usage

from interleaved import load_example_data

data = load_example_data(preference='a') # alternatively: 'none' or 'b'

data.head()
                  timestamp   search_id  event  position ranking_function
0 2018-08-01 00:01:31+00:00  p2tvgm3clu   serp       NaN              NaN
1 2018-08-01 00:04:09+00:00  p2tvgm3clu  click      14.0                A
2 2018-08-01 00:04:29+00:00  p2tvgm3clu  click       4.0                A
3 2018-08-01 00:06:10+00:00  p2tvgm3clu  click       1.0                A
4 2018-08-01 00:06:42+00:00  p2tvgm3clu  click       7.0                B
from interleaved import Experiment

ex = Experiment(
    queries = data[data['event'] == 'click']['search_id'].to_numpy(),
    clicks = data[data['event'] == 'click']['ranking_function'].to_numpy()
)
ex.bootstrap(seed=42)

print(ex.summary(ranker_labels=['New Algorithm', 'Old Algorithm'], rescale=True))
 In this interleaved search experiment, 906 searches were used to determine whether the
results from ranker 'New Algorithm' or 'Old Algorithm' were preferred by users (based on
their clicks to the results from those rankers interleaved into a single search result
set).

 The preference statistic, as defined by Chapelle et al. (2012), was estimated to be 74.3%
with a 95% (bootstrapped) confidence interval of (70.0%, 77.9%) on [-100%, 100%] scale
with -100% indicating total preference for 'Old Algorithm', 100% indicating total
preference for 'New Algorithm', and 0% indicating complete lack of preference between the
two -- indicating that the users had preference for ranker 'New Algorithm'.

Quite a strong preference for that new algorithm!

Additional methods:

  • .distribution(rescale=False) returns the bootstrapped distribution of preference statistic (useful if visualizing)
  • .preference_statistic(rescale=False) returns the estimated preference statistic
  • .conf_int(conf_level=0.95, rescale=False) returns the confidence interval based on the bootstrapped distribution

Note: rescale=True rescales the preference statistic from [-0.5, 0.5] scale to a [-1, 1] scale, which may help with interpretability of the results.

References

About

Library for analyzing interleaved search A/B tests to determine preference between competing ranking functions

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages