GitHub - amadeus-pinto/forecasting-daily-fantasy-gambler-picks-based-on-the-news: models of athlete market shares in a daily fantasy field...

#Forecasting Daily Fantasy Gamblers' Picks Based on the News

##Introduction

In October of 2015, an employee of the daily fantasy sports (DFS) gambling site DraftKings leveraged his site's user data to win $350,000 on a competitor site, FanDuel, resulting in allegations of insider trading and concern over the potential for further abuses committed by an unregulated industry.

In a zero-sum game, knowledge of opponents' positions (the field's picks) ahead of market (the contest) represents a serious advantage (or serious abuse when applied by the same people making the market). The expected value of pick I, V(pick_I), is a product of the probability P of pick I's success and its associated payout p:

V(pick_I) = sum_J P(S_J)*p(S_J;w_I).

P(S_J) is the probability of pick I's score J, S_J, and is independent of gamblers' perceptions, whereas payout p depends on pick I's score and gamblers' collective valuation of I, where w_I represents the "market share" (also "weight" or "ownership") in pick I. {w_I} is the quantity of interest here, and models projecting field ownerships ahead of market are the goal of this project.

Apart from the intrinsic neatness of explaining/predicting the decisions a collection of people make given incomplete information and perceived utility, knowledge of athlete market shares can potentially help one accurately project outcomes of entire fantasy contests themselves... Specifically, if one has a "good" estimate of the covariance matrix of athlete fantasy output, one can collect statistics from an ensemble of Monte-Carlo-sampled contests, each one represented in a set of tickets giving rise to the predicted market. A basic application of such a simulation would be ranking the tickets by probability of profitability, etc.

##The Long and Short of DFS Mechanics

"Fantasy owners" (the gamblers) submit one or multiple ticket(s) of athletes' names together satisfying a set of constraints (e.g., on the total fantasy salary of the athletes, fantasy positions, etc.) to an online site (a.k.a. the bookie - FanDuel, DraftKings, etc.) which takes a rake and pays out a subset of tickets as a function of ticket score, determined by athletes' accumulated game stats after bets are locked.

Broadly, there are two (very different) contest types: a "tournament" game pays out the top ~20%, with payout odds growing ~exponentially from 1:1 at the ~20th percentile line (a typical first place ticket's return is ~1000:1); a "cash" game pays out the top ~50% fixed 1:1 odds. Naturally, these payouts force fantasy owners to favor higher-variance players (riskier positions) in "tournaments" and lower-variance players (safer positions) in "cash" games. This causes lower-mean/higher-variance distributions of ticket scores in the former and higher-mean/lower-variance distributions in the latter. See this FanDuel tutorial for more (or less).

##Project Specifics

What causes gamblers to make the choices they make with the information they have, and how accurately can I predict the field of wagers in a given contest?

I set out to answer these questions using scraped contest records of field ownerships in past FanDuel NBA contests (thanks to @brainydfs), historic athlete performance data from sportsdatabase.com, the news leading up to the particular contest (e.g., "industry" fantasy output projections from BasketballMonster and FantasyCrunchers, and available injury/roster reports), and elements of the DFS game mechanics presumed to impact gamblers' decisions. (These and others are detailed in the Models section below.)

I trained and validated ~400 player-centered models (estimators include ridge, lasso, random forest, and gradient-boosted regressors) on over 700 "tournament" contests from the 2016 season and the first two months of the 2017 season, holding out January 2017 slates for tests.

Results

training/validating player-centered models

Below is a plot of athletes' average ownerships against their standard deviations. Colors correspond to groups coming out of a k-Means decomposition (k=5) of the matrix of player models by respective model coefficients. Clustering of player models (only implicitly included in the model parameters) indicates similar model composition among athletes in close proximity on the mean-standard deviation plane. In other words, it turns out that common factors drive (un)popular athletes' market share. High-mean, high-spread clusters (colored light blue and light green) include Russell Westbrook, James Harden, and LeBron James ownership models, among others you could probably name... Their ownership models depend on, among others, the slate_size variable similarly (steeply inversely; see the Models section below).

The predictive power of an ownership model increases with mean ownership (and also with the spread around ownership):

This means ownerships of athletes who typically dominate the market share are predicted more accurately. On the other hand, athletes with higher-variance ownerships (specifically, lower mean-to-standard deviation ratio types) are more difficult to predict.

Find your favorite player's stat line here:

name	mean	std	RMSE(train)	RMSE(val)	R2
Russell Westbrook	49.5	20.15	8.14	9.04	0.8
James Harden	45.22	23.94	11.11	10.57	0.81
Kevin Durant	42.23	23.19	9.88	10.9	0.78
LeBron James	41.97	23.53	10.66	11.83	0.75
Kawhi Leonard	41.37	25.48	11.74	10.83	0.82
Stephen Curry	40.24	22.59	9.53	10.43	0.79
Anthony Davis	37.68	22.3	11.15	11.15	0.75
Blake Griffin	37.61	25.23	11.52	10.31	0.83
Damian Lillard	37.33	24.86	7.89	9.48	0.85
Jimmy Butler	36.19	22.47	15.65	9.17	0.83
Giannis Antetokounmpo	35.35	22.95	13.83	10.12	0.81
Draymond Green	34.39	22.28	10.42	10.8	0.77
DeMar DeRozan	34.19	24.89	11.81	11.36	0.79
C.J. McCollum	33.88	23.32	10.01	9.92	0.82
Chris Paul	33.79	24.77	10.56	11.76	0.77
DeMarcus Cousins	33.33	21.54	10.21	10.17	0.78
Kristaps Porzingis	32.98	20.91	9.17	12.11	0.66
Gordon Hayward	32.92	24.52	9.06	9.13	0.86
Klay Thompson	32.69	18.24	9.45	9.69	0.72
Kyle Lowry	31.99	23.55	9.62	8.04	0.88
Carmelo Anthony	31.7	25.05	9.96	13.58	0.71
Paul George	31.32	22.45	9.1	13.68	0.63
John Wall	29.64	20.28	7.89	8.53	0.82
Julius Randle	29.01	23.68	8.55	11.71	0.76
Paul Millsap	28.97	23.04	10.44	10.42	0.8
Rajon Rondo	28.3	19.58	10.05	11.75	0.64
Harrison Barnes	28.01	18.66	7.49	8.36	0.8
Eric Bledsoe	27.92	21.65	9.02	9.66	0.8
Andrew Wiggins	27.76	20.35	8.47	9.14	0.8
Kyrie Irving	27.45	18.39	9.27	9.53	0.73
Kevin Love	27.11	17.16	8.24	9.31	0.71
Isaiah Thomas	27.01	21.75	10.5	8.21	0.86
Danilo Gallinari	26.92	21.31	12.39	9.51	0.8
Rudy Gay	26.89	21.91	11.96	10.83	0.76
Chris Bosh	26.79	21.79	8.56	8.05	0.86
LaMarcus Aldridge	26.64	22.28	10.22	11.15	0.75
Devin Booker	25.45	19.65	7.32	9.69	0.76
Thaddeus Young	24.78	18.89	11.6	9.81	0.73
Victor Oladipo	24.75	18.88	8.74	9.19	0.76
Khris Middleton	24.73	16.92	7.29	9.74	0.67
Kemba Walker	24.52	19.64	10.02	7.39	0.86
Nicolas Batum	24.17	16.75	7.77	11.08	0.56
Will Barton	24.17	19.54	8.07	10.75	0.7
Dwyane Wade	23.81	17.76	11.29	10.57	0.65
Derrick Favors	23.72	21.14	11.44	10.69	0.74
Tim Frazier	22.58	18.73	9.87	12.11	0.58
Serge Ibaka	22.37	16.21	8.03	8.53	0.72
Otto Porter	22.21	18.36	11.26	9.88	0.71
Al-Farouq Aminu	22.11	20.29	9.29	9.59	0.78
Jabari Parker	22.0	19.15	9.0	9.65	0.75
Joel Embiid	21.78	18.35	9.91	13.86	0.43
Avery Bradley	21.67	19.92	8.55	7.97	0.84
Wilson Chandler	21.64	20.28	19.6	9.57	0.78
Jae Crowder	21.57	17.99	13.26	11.12	0.62
Brandon Knight	21.41	18.4	11.47	8.9	0.77
Tobias Harris	21.17	17.2	8.67	7.82	0.79
Chandler Parsons	21.05	23.22	13.89	13.11	0.68
Bradley Beal	20.9	14.94	6.35	9.25	0.62
Hassan Whiteside	20.81	16.11	10.34	9.1	0.68
Dirk Nowitzki	20.69	18.67	10.86	11.33	0.63
Reggie Jackson	20.66	18.18	6.92	9.44	0.73
Rodney Hood	20.56	17.25	9.05	10.25	0.65
Karl-Anthony Towns	20.4	15.84	7.26	8.75	0.69
Jared Sullinger	20.31	17.33	10.18	5.99	0.88
Zach LaVine	20.27	16.72	7.92	8.07	0.77
Lou Williams	20.18	17.05	9.95	8.44	0.75
Goran Dragic	20.16	16.54	9.8	8.8	0.72
Andre Drummond	20.04	15.53	8.45	8.0	0.73
Kentavious Caldwell-Pope	19.98	17.09	6.76	8.1	0.78
T.J. Warren	19.93	21.48	9.53	9.89	0.79
Kenneth Faried	19.6	19.09	9.61	10.14	0.72
Evan Fournier	19.58	18.15	11.59	11.87	0.57
Robert Covington	19.36	16.61	8.71	9.33	0.68
Omri Casspi	19.32	16.43	10.34	9.28	0.68
Nerlens Noel	19.22	19.32	8.55	10.56	0.7
J.J. Redick	18.92	15.54	9.13	7.78	0.75
Mike Conley	18.92	17.24	10.37	9.17	0.72
Tyreke Evans	18.78	17.5	13.88	11.75	0.55
Pau Gasol	18.74	16.41	10.56	8.87	0.71
DeAndre Jordan	18.68	16.16	7.28	9.16	0.68
Trevor Ariza	18.59	15.06	7.24	7.06	0.78
Michael Carter-Williams	18.53	15.43	6.9	8.4	0.7
Marcus Morris	18.5	16.49	8.52	7.52	0.79
Jeff Teague	18.48	16.2	9.98	9.07	0.69
Derrick Rose	18.14	15.85	8.33	8.2	0.73
Kent Bazemore	17.82	16.6	7.04	8.96	0.71
Ricky Rubio	17.77	15.92	7.09	7.94	0.75
Monta Ellis	17.77	15.12	8.12	8.76	0.66
Ryan Anderson	17.68	14.97	10.55	8.81	0.65
Emmanuel Mudiay	17.48	16.29	8.28	7.25	0.8
Markieff Morris	17.44	14.68	12.02	8.15	0.69
Gorgui Dieng	17.22	16.42	8.15	9.03	0.7
Zach Randolph	17.04	15.67	7.55	8.36	0.72
Sergio Rodriguez	16.72	12.69	8.12	4.5	0.87
Taj Gibson	16.6	14.29	7.66	8.47	0.65
Shelvin Mack	16.49	16.64	8.85	8.84	0.72
Wesley Matthews	16.15	14.93	5.89	7.57	0.74
Sean Kilpatrick	16.06	18.16	17.78	7.43	0.83
Eric Gordon	15.99	13.21	4.89	6.3	0.77
Nikola Mirotic	15.95	15.03	12.18	6.92	0.79
Myles Turner	15.72	14.95	13.19	10.55	0.5
Jamal Crawford	15.64	15.25	5.2	8.85	0.66
Jordan Clarkson	15.38	14.98	7.11	8.51	0.68
Nikola Jokic	15.35	16.74	7.21	9.24	0.7

player models in hold-out

I held out contests from January 2017 for testing. Predicted versus true ownerships for each ownership observation are plotted below, with colors corresponding to athlete position, and where perfect predictions would lie on the line predicted ownership=true ownership. Compare these predictions with the null model, which guesses the player's mean ownership each observation:

bundled predictions of whole-tournament means

Player models are doing great, but player models are hardly useful in a vacuum... A useful question to ask is: How well do market share predictions taken together predict the market? A contest's average fantasy score <S> is an appropriate quantity for this purpose since it couples the {w} explicitly,

    <S> = sum_K w_K*S_K.

S_K and w_K are athlete fantasy scores and ownerships, respectively. In the limit predicted minus true {w} go to zero, all that's left to analytically determine the exact contest mean fantasy score are the {S}, which depend on NBA players alone. Below are predicted versus true contest means for unseen January 2017. Each point is a contest mean computed from the collection of independent player ownerships (actual and predicted) belonging to that contest. Perfect contest mean predictions would lie on the line predicted < contest score > = true < contest score >.

Substituting null-model ownerships in the tournament mean equation, predicted means are wildly unrealistic.

Outlook

In view of their performances in hold-out tests, player ownership models and bundled contest predictions seem to enjoy a great degree of predictive power...

Efforts to improve accuracy might focus on introducing more features, and/or linearly-combining models of player ownerships.

To the first point, given the relatively high impact of "industry" player valuations on ownership models, inclusion of daily projections from more sources, e.g., RotoGrinders, fb-ninja, and any number of others, preferably in order of decreasing volume of their subscription base (after all, the same people buying into industry projections are the same people using these numbers to gamble in contests, and the same people generating the dependent variable modeled here) should increase accuracy.

To the second point, lower-variance ownership predictions are probably obtainable via stacked regression. I trained four estimators: lasso (L1), ridge (L2), random forest, and gradient-boosted regressors. The correlation matrix of their test-set (January hold-out) residuals is plotted below. Surprisingly, lasso and ridge regressors aren't too positively correlated, which isn't the case with the ensemble models. A preliminary investigation of a solution via stacked regression might check errors of an equally-weighted composite model against individual models' errors.

##Models

Feature Definitions

Follow the links below to view distributions of regressors for LeBron James' contest observations. (Bear in mind that different variables are important for different athlete models).

"industry" valuation

proj_fc: fantasycrunchers fantasy score projection
proj_mo: basketballmonster fantasy score projection
v_fc: fantasycrunchers value (projection /salary)
v_mo: basketballmonster value (projection /salary)

vegas quantities

```line```: sportsdatabase matchup line

```total```: sportsdatabase matchup total

momentum (rolling mean - abbreviated below as "rm" - of previous Y=1-,5-game windows; computed with sportsdatabase queries) -

```
```rm.Y.score```: recent score        
```

```rm.Y.salary```: recent salary

```
```rm.Y.value```: recent value
```

```rm.Y.val_exceeds.X```: recent value has exceeded X=4,5,6

```rm.Y.opp_total_score```: recent opponent total  score

```rm.Y.opp_off_score```: recent opponent total offensive score

```rm.Y.opp_def_score```: recent opponent total defensive score

```rm.Y.team_total_score```: recent team total score

```rm.Y.team_def_score```: recent team total defensive score

```rm.Y.team_off_score```: recent team total offensive score

value over replacement (standard score within X=salary,position class) -

```z.X.proj_fc```: z-score of fantasycruncher projections within class X

```z.X.proj_mo```: z-score of basketballmonster projections within class X

```z.X.v_fc```: z-score of fantasycruncher value within class X

```z.X.v_mo```: z-score of basketballmonster value within class X

game specs/mechanics -

```total_entries```: total contest tickets

```max_user_frac```: maximum tickets per user / total contest tickets

```slate_size```: number of NBA games in slate

```log.slate_size```: log( number of NBA games in slate)

fictituous gambler portfolios (X=worldview,Y=max overlap w/previous solution,Z=number of tickets) -

This type of feature is arguably the most interesting, and without a doubt among the most predictive. It is constructed as follows: For each contest, initialize a set of "fictitious gamblers", each with a specified "worldview" (tuning bias), some measure of "risk tolerance" (tuning variance), and number of bets. For each fictitious gambler, solve the integer programming problem of constructing a number of unique bets, each maximizing projected fantasy points (enumerated by worldview) subject to feasibility constraints (FanDuel's salary cap, position requirements, etc.) and risk tolerance constraints (maximum number of athelete overlaps with previous integer programming solution in fictitous gambler's portfolio). For each athlete in the slate, construct a feature set of fictitious holdings, each entry of which is the proportion of the fictitious portfolio in that athlete.

```gpp.fict.proj_X.Y.Z```: X=(fantasycrunchers,basketballmonster,their average),Y=(2,4,6); Z=25

Feature Importance

To better understand which features are important for ownership models, I computed the percentage of time a feature is in the top 5 over all player models. ('top 5' is defined for ensemble models as the 5 features with the highest gini importance, and for linear models, as the 5 features with the highest absolute-magnitude coefficients). They're tabulated for each estimator below.

rfr feature	%time_in_top_5
gpp.fict.proj_mu.04.025.006	50.9
gpp.fict.proj_mo.04.025.006	41.0
gpp.fict.proj_fc.04.025.006	33.2
gpp.fict.proj_mu.06.025.006	29.1
z.proj_mo	23.0
z.proj_fc	22.3
slate_size	22.0
z.salary	21.5
log.slate_size	21.5
gpp.fict.proj_fc.06.025.006	20.8
gpp.fict.proj_mo.06.025.006	20.5
gpp.fict.proj_mu.08.025.006	18.5
z.sbin.proj_mo	17.0
max_user_frac	16.5
z.sbin.proj_fc	15.2
gpp.fict.proj_mo.08.025.006	13.4
gpp.fict.proj_fc.08.025.006	11.9
rm.01.score	9.1
rm.05.value	8.4

gbr feature	%time_in_top_5
z.salary	52.2
z.proj_fc	32.2
z.sbin.proj_mo	30.4
z.proj_mo	29.6
max_user_frac	27.3
z.v_fc	24.1
total_entries	22.5
z.v_mo	21.3
z.sbin.proj_fc	20.8
z.sbin.v_mo	20.0
z.sbin.v_fc	20.0
slate_size	15.7
log.slate_size	15.2
gpp.fict.proj_mu.04.025.006	13.9
gpp.fict.proj_mo.04.025.006	13.7
gpp.fict.proj_fc.04.025.006	13.4
rm.01.value	9.4
rm.01.score	8.9
rm.05.value	7.8

lasso feature	%time_in_top_5
log.slate_size	55.4
z.salary	43.5
rm.05.val_exceeds.06	22.8
rm.05.val_exceeds.05	20.5
z.proj_mo	18.7
status	18.2
rm.01.val_exceeds.06	16.5
rm.05.val_exceeds.04	15.4
rm.01.val_exceeds.05	15.4
rm.05.salary	14.9
z.proj_fc	13.7
v_mo	13.2
rm.01.score	12.7
v_fc	12.2
rm.01.val_exceeds.04	11.4
proj_mo	10.9
gpp.fict.proj_mo.04.025.006	10.6
rm.01.salary	10.6
gpp.fict.proj_fc.04.025.006	8.6

ridge feature	%time_in_top_5
z.salary	48.6
rm.05.val_exceeds.06	45.3
log.slate_size	44.1
rm.05.val_exceeds.05	41.3
rm.05.val_exceeds.04	33.2
max_user_frac	27.8
rm.05.salary	22.3
rm.01.salary	18.7
status	17.7
z.proj_mo	15.9
rm.01.val_exceeds.06	15.4
rm.01.score	10.6
gpp.fict.proj_mo.04.025.006	9.9
rm.01.val_exceeds.05	9.6
z.proj_fc	9.6
gpp.fict.proj_mo.08.025.006	9.4
gpp.fict.proj_fc.04.025.006	9.1
gpp.fict.proj_mo.06.025.006	8.9
gpp.fict.proj_fc.08.025.006	6.3

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
ANALYSIS		ANALYSIS
DATA/parsed_raw		DATA/parsed_raw
PYTHON		PYTHON
docs		docs
dou_MYMODELS/test_SLATE_PREDS		dou_MYMODELS/test_SLATE_PREDS
gpp_MYMODELS/test_SLATE_PREDS		gpp_MYMODELS/test_SLATE_PREDS
raw		raw
.gitignore		.gitignore
README.md		README.md
repo_anatomy.txt		repo_anatomy.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Results

Outlook

Feature Definitions

Feature Importance

About

Releases

Packages

Languages

amadeus-pinto/forecasting-daily-fantasy-gambler-picks-based-on-the-news

Folders and files

Latest commit

History

Repository files navigation

Results

Outlook

Feature Definitions

Feature Importance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages