forked from gjord/gwern.net
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGoogle Alerts.page
475 lines (356 loc) · 26 KB
/
Google Alerts.page
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
---
title: Alerts Over Time
description: Does Google Alerts return fewer results each year? A statistical investigation
created: 1 July 2013
tags: statistics
status: finished
confidence: likely
importance: 4
...
> Has Google Alerts been sending fewer results the past few years? Yes. Responding to rumors of its demise, I investigate the number of results in my personal Google Alerts notifications 2007-2013, and find no overall trend of decline until I look at a transition in [mid-2011](#it-was-mid-2011) where the results fall dramatically. I speculate about [the cause](#panda) and [implications](#conclusion) for Alerts's future.
While researching my [Google shutdowns]() essay on how long Google products survive before being killed (inspired by Reader), I came across speculation that [Google Alerts](!Wikipedia), a service which runs search queries on your behalf & emails you about any new matching webpages (extremely useful for keeping abreast of topics and one of the oldest Google services), [had](http://searchengineland.com/google-alerts-arent-working-148642 "Dear Google Alerts: Why Aren't You Working?") [broken](http://searchengineland.com/google-alerts-still-broken-152444 "Google Alerts: Still Broken") [badly](http://thefinancialbrand.com/28346/google-alerts-broken/ "An Open Letter to Google: Google Alerts Broken, Now Useless To Financial Marketers") in 2012. When I saw this, I remembered thinking that my own alerts did not seem to be as useful as they did, but I hadn't been sure if this was Alerts's fault or if my particular keywords were just less active than the past. Google's official comments on the topic have been minimal[^official-reply].
[^official-reply]: ["What's Wrong With Google Alerts? The small but useful service seems to be dying. One researcher uses empirical research to answer the questions that Google won't."](http://www.buzzfeed.com/justinesharrock/whats-wrong-with-google-alerts), _BuzzFeed_:
> Google has refused to shed light on the decline. Today, a Google spokesperson told BuzzFeed, "we're always working to improve our products - we'll continue making updates to Google Alerts to make it more useful for people." In other words, a polite non-answer.
Alerts dying would be a problem for me as I have used Alerts extensively since 28 January 2007 (2347 days) with 23 current Alerts (and many more in the past) - of my 501,662 total emails, 3,815 were Alert emails - and there did not seem to be any usable alternatives^[I have seen some alternative services, [Yahoo! Search Alerts](http://www.ghacks.net/2013/06/29/yahoo-search-alerts-a-google-alerts-alternative/), [talkwalker](http://talkwalker.com/en) & [Mention](https://en.mention.net/) suggested, but have not used them; the latter 2 do well in [a comparison with Google Alerts](http://moz.com/ugc/google-alerts-vs-mention-vs-talkwalker "Google Alerts VS Mention VS Talkwalker").]. Troublingly, Alerts's RSS feeds were unavailable between [July](http://googlesystem.blogspot.com/2013/07/google-alerts-drops-rss-feeds.html "Google Alerts Drops RSS Feeds") & [September 2013](http://thenextweb.com/google/2013/09/11/google-alerts-regains-rss-delivery-option-it-lost-after-google-readers-demise/ "Google Alerts regains RSS delivery option it lost after Google Reader's demise").
As it happened, the survival model suggested that Alerts had a good chance of surviving a long time, and I put it from mind until I remembered that since I had used Alerts for so many years and had so many emails, I could easily check the claims empirically - *did* Alerts abruptly stop returning many hits? This is a straightforward question to answer: extract the subject/date/number of links from each Alerts email, stratify by unique alert, and regress over time. So I did.
# Data
As part of my backup procedures, I fetch daily my Gmail emails using [`getmail4`](http://pyropus.ca/software/getmail/) into a [maildir](!Wikipedia). Alerts uses a unchanging subject line like `Subject: Google Alert - "Frank Herbert" -mason`, so it is easy to find all its emails and separate them out.
~~~{.Bash}
~/mail/
$ find ~/mail/ -type f -exec fgrep -l {} 'Google Alert -' \;
/home/gwern/mail/new/1282125775.M532208P12203Q683Rb91205f53b0fec0d.craft
/home/gwern/mail/new/1282125789.M55800P12266Q737Rd98db4aa1e58e9ed.craft
...
$ find ~/mail/ -type f -exec fgrep -l 'Google Alert -' {} \; > alerts.txt
$ mkdir 2013-09-25-gwern-googlealertsemails/
$ mv `cat alerts.txt` 2013-09-25-gwern-googlealertsemails/
~~~
<!-- find 2013-09-25-gwern-googlealertsemails/ -type f -exec grep -l aaktgeb {} \; | xargs rm -->
I deleted emails from a few alerts which were private; the remaining 72M of emails are available at [`2013-09-25-gwern-googlealertsemails.tar.xz`](/docs/2013-09-25-gwern-googlealertsemails.tar.xz). Then a loop & ad hoc shell-scripting extracts the subject-line, the date, and how many instances of "http://" there are in each email:
~~~{.Bash}
cd 2013-09-25-gwern-googlealertsemails/
echo "Search,Date,Links" >> alerts.csv # set up the header
for EMAIL in *.craft *.elan; do
SUBJECT="`egrep '^Subject: Google Alert - ' $EMAIL | sed -e 's/Subject: Google Alert - //'`"
DATE="`egrep '^Date: ' $EMAIL | cut -d ' ' -f 3-5 | sed -e 's/<b>..*/ /'`"
COUNT="`fgrep --no-filename --count 'http://' $EMAIL`"
echo $SUBJECT,$DATE,$COUNT >> alerts.csv
done
~~~
The scripting isn't perfect and I had to delete several spurious lines before I could read it into R and format it into a clean CSV:
~~~{.R}
alerts <- read.csv("alerts.csv", quote=c(), colClasses=c("character","character","integer"))
alerts$Date <- as.Date(alerts$Date, format="%d %b %Y")
write.csv(alerts, file="2013-09-25-gwern-googlealerts.csv", row.names=FALSE)
~~~
# Analysis
## Descriptive
~~~{.R}
alerts <- read.csv("https://www.gwern.net/docs/2013-09-25-gwern-googlealerts.csv",
colClasses=c("factor","Date","integer"))
summary(alerts)
Search Date Links
wikipedia : 255 Min. :2007-01-28 Min. : 0.0
Neon Genesis Evangelion: 247 1st Qu.:2008-12-29 1st Qu.: 10.0
"Gene Wolfe" : 246 Median :2011-02-07 Median : 22.0
"Nick Bostrom" : 224 Mean :2010-10-06 Mean : 37.9
modafinil : 186 3rd Qu.:2012-06-15 3rd Qu.: 44.0
"Frank Herbert" -mason : 184 Max. :2013-09-25 Max. :563.0
(Other) :2585
# So many because I have deleted many topics I am no longer interested in,
# and refined the search criteria of others
length(unique(alerts$Search))
[1] 68
plot(Links ~ Date, data=alerts)
~~~
![Links in each email, graphed over time](/images/google/alerts/linksperemail.png)
The first thing I notice is that it looks like the number of links per email is going up over time, with a spike in mid-2010. The second is that there's quite a bit of variation from email to email - while most are around 0, some are as high as 300. The third is that there's a weird early anomaly where emails are recorded as having 0 links; looking at those emails, they are [base64](!Wikipedia) encoded, for no apparent reason, and then all subsequent emails are in more sensible HTML/text formats. An ill-fated experiment by Google? I have no idea. The highest number is 563, which isn't very big; so despite the skew, I didn't bother to log-transform `Links`.
## Linear model
The spikiness prompts me to adjust for variation in sending rate and whatnot by bucketing emails into months, and overlay a linear regression:
~~~
library(lubridate)
alerts$Date <- floor_date(alerts$Date, "month")
alerts <- aggregate(Links ~ Search + Date, alerts, "sum")
# a simple linear model agrees with *small* monthly increase, but notes that there is a tremendous
# amount of unexplained variation
lm <- lm(Links ~ Date, data=alerts); summary(lm)
...
Residuals:
Min 1Q Median 3Q Max
-175.8 -110.0 -60.5 43.3 992.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.13e+02 1.10e+02 -3.76 0.00018
Date 3.73e-02 7.38e-03 5.06 4.9e-07
Residual standard error: 175 on 1046 degrees of freedom
Multiple R-squared: 0.0239, Adjusted R-squared: 0.023
F-statistic: 25.6 on 1 and 1046 DF, p-value: 4.89e-07
plot(Links ~ Date, data=alerts)
abline(lm)
~~~
![Total links in each search, by month](/images/google/alerts/linkspermonth.png)
No big difference with the original plot: still a generic increasing trend. Here is a basic problem with this regression: does this increase reflect an increase in the *number of alerts* I am subscribed to, tweaks to each alert to make each return more hits (shifting from old alerts to *new* alerts), or a increase in links per *unique* alert? It is only the last claim we are interested in, but any of these or other phenomenon could produce an increase.
### Per alert
We could try treating each alert separately and doing a linear regression on them, and comparing with the linear model on all data indiscriminately:
~~~{.R}
library(ggplot2)
qplot(Date, Links, color=Search, data=alerts) +
stat_smooth(method="lm", se=FALSE, fullrange=TRUE, size=0.2) +
geom_abline(aes(intercept=lm$coefficients[1], slope=lm$coefficients[2], color=c()), size=1) +
ylim(0,1130) +
theme(legend.position = "none")
~~~
![Splitting data by alert, regressing individually](/images/google/alerts/monthlylinks-individuallinearregression.png)
The result is chaotic. Individual alerts are pointing every which way. Regressing on every alert together confounds issues, and regressing on individual alerts produces no agreement. We want some intermediate approach which respects that alerts have different behavior, but yields a meaningful overall statement.
## Multi-level model
What we want is to look at each unique alert, estimate its increase/decrease over time, and perhaps summarize all the slopes into a single grand slope. There is a hierarchical structure to the data: the overall slope of Google influences the slope of each alert, which influences the distribution of the data points around each slope.
We can do this with a [multi-level model](!Wikipedia), using [`lme4`](http://cran.r-project.org/web/packages/lme4/index.html).
We'll start by fitting & comparing 2 models:
1. only the intercept varies between each alert, but all alerts increase or decrease at the same rate
2. the intercept varies between each alert, and also alerts differ in their waxing or waning
~~~{.R}
library(lme4)
mlm1 <- lmer(Links ~ Date + (1|Search), alerts); mlm1
Random effects:
Groups Name Variance Std.Dev.
Search (Intercept) 28943 170
Residual 12395 111
Number of obs: 1048, groups: Search, 68
Fixed effects:
Estimate Std. Error t value
(Intercept) 427.93512 139.76994 3.06
Date -0.01984 0.00928 -2.14
Correlation of Fixed Effects:
(Intr)
Date -0.988
mlm2 <- lmer(Links ~ Date + (1+Date|Search), alerts); mlm2
Random effects:
Groups Name Variance Std.Dev. Corr
Search (Intercept) 6.40e+06 2529.718
Date 2.78e-02 0.167 -0.998
Residual 8.36e+03 91.446
Number of obs: 1048, groups: Search, 68
Fixed effects:
Estimate Std. Error t value
(Intercept) 295.5469 420.0588 0.70
Date -0.0090 0.0278 -0.32
Correlation of Fixed Effects:
(Intr)
Date -0.998
# compare the models: does model 2 buy us anything?
anova(mlm1, mlm2)
mlm1: Links ~ Date + (1 | Search)
mlm2: Links ~ Date + (1 + Date | Search)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mlm1 4 13070 13089 -6531 13062
mlm2 6 12771 12801 -6379 12759 303 2 <2e-16
~~~
Model 2 is better on both simplicity/fit criteria, so we'll look at that closer:
~~~{.R}
coef(mlm2)
$Search
(Intercept) Date
763.48 -0.046763
adult iodine supplementation (IQ OR intelligence OR cognitive) 718.80 -0.043157
AMD pacifica virtualization -836.63 0.062123
(anime OR manga) (half-Japanese OR hafu OR half-American) 956.52 -0.059438
caloric restriction -2023.10 0.153667
"Death Note" (script OR live-action OR Parlapanides) 866.63 -0.051314
"dual n-back" 4212.59 -0.266879
dual n-back 1213.85 -0.073265
electric sheep screensaver -745.78 0.055937
"Frank Herbert" -93.28 0.013636
"Frank Herbert" -mason 10815.19 -0.676188
freenet project -1154.14 0.087199
"Gene Wolfe" 496.01 -0.026575
Gene Wolfe 1178.36 -0.072681
...
wikileaks -3080.74 0.227583
WikiLeaks -388.34 0.031441
wikipedia -1668.94 0.133976
Xen 390.01 -0.017209
~~~
Date is in links per month, so when Xen has a slope of -0.02, that means that every year it falls one link.
~~~{.R}
max(abs(coef(mlm2)$Search$Date))
[1] 0.6762
~~~
Which comes from the `"Frank Herbert" -mason` search, probably reflecting how relatively new the search is or the effectiveness of the filter I added to the original `"Frank Herbert"` search. In general, the slopes are very similar, there seem to be as many positive slopes as there are negative, and the overall summary slope is a tiny negative slope in the second model (-0.01); but most of the searches' slopes exclude zero in the caterpillar plot:
![`qqmath(ranef(mlm2, postVar=TRUE))`](/images/google/alerts/mlm2-slopes.png)
This says to me that there is no large change over time happening within each alert, as the original claims went, but there does seem to be something going on. When we plot the overall regression and the per-alert regressions, we see
~~~{.R}
fixParam <- fixef(mlm2)
ranParam <- ranef(mlm2)$Search
params <- cbind(ranParam[1]+fixParam[1], ranParam[2]+fixParam[2])
p <- qplot(Date, Links, color=Search, data=alerts)
p +
geom_abline(aes(intercept=`(Intercept)`, slope=Date, color=rownames(params)), data=params, size=0.2) +
geom_abline(aes(intercept=fixef(mlm2)[1], slope=fixef(mlm2)[2], color=c()), size=1) +
ylim(0,1130) +
theme(legend.position = "none")
~~~
![Multi-level regression, grand and individual fits](/images/google/alerts/monthlylinks-individualmlm.png)
This clearly makes more sense than regressing each alert separately, as we avoid crazily steep slopes when there are just a few emails to use and their regressions get shrunk to the overall regression. We also see no evidence for any large or statistically-significant change over time for alerts in general: some alerts do increase over time but some alerts also decrease over time, and there is only a small decrease which we might blame on internal Google problems.
<!--
### Robustness
How reliable *is* the slight negative overall correlation that we saw? We can do a posterior check of how much variation there is in the inferred model by randomly generating data from that model and seeing how often the overall `Date` fixed effect varies from positive to negative:
~~~{.R}
library(arm)
mlm2.sim <- sim(mlm2, n.sims = 100000)
fixef.mlm2.sim <- fixef(mlm2.sim)
quantile(fixef.mlm2.sim[,2], probs = c(0, 0.025, 0.975, 1))
0% 2.5% 97.5% 100%
-0.12879 -0.04986 0.06343 0.12596
hist(fixef.mlm2.sim[,2], main="Change in hits per month, 100k simulations", xlab="Coefficient")
~~~
![Distribution of slopes estimated from 100k simulation runs](/images/google/alerts/mlm2-simulation.png)
-->
## What about the fall?
Having done all this, I thought I was finished until I remembered that the original bloggers didn't complain about a *steady* deterioration over time, but an abrupt one starting somewhere in 2012. What happens when I do a binary split and compare 2010/2011 to 2012/2013?
~~~{.R}
alertsRecent <- alerts[year(alerts$Date)>=2010,]
alertsRecent$Recent <- year(alertsRecent$Date) >= 2012
wilcox.test(Links ~ Recent, conf.int=TRUE, data=alertsRecent)
Wilcoxon rank sum test with continuity correction
data: Links by Recent
W = 71113, p-value = 6.999e-10
alternative hypothesis: true location shift is not equal to 0
95% confidence interval:
34 75
sample estimates:
difference in location
53
~~~
I avoided a normality-based test like [`t.test`](!Wikipedia "Student's t-test") and used instead [Wilcoxon](!Wikipedia "Mann-Whitney U") because there's no reason to expect the number of links per month to follow a normal distribution. Regardless of the details, there's a big difference between the two time periods: 219 vs 140 links! A fall of 36% is certainly a serious decline, and it cannot be waved away as due to my Alerts settings (I always use "All results" and never "Only the best results") nor, as we'll see now, a confound like the possibilities that motivated multi-level model use:
~~~{.R}
R> alerts$Recent <- year(alerts$Date) >= 2012
R> mlm3 <- lmer(Links ~ Date + Recent + (1+Date|Search), alerts); mlm3
Random effects:
Groups Name Variance Std.Dev. Corr
Search (Intercept) 9.22e+03 9.60e+01
Date 9.52e-05 9.75e-03 -0.164
Residual 1.18e+04 1.09e+02
Number of obs: 1048, groups: Search, 68
Fixed effects:
Estimate Std. Error t value
(Intercept) -440.1540 175.3630 -2.51
Date 0.0413 0.0121 3.42
RecentTRUE -102.2273 13.3224 -7.67
Correlation of Fixed Effects:
(Intr) Date
Date -0.993
RecentTRUE 0.630 -0.647
R> anova(mlm1, mlm2, mlm3)
Models:
mlm1: Links ~ Date + (1 | Search)
mlm2: Links ~ Date + (1 + Date | Search)
mlm3: Links ~ Date + Recent + (1 + Date | Search)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mlm1 4 13070 13089 -6531 13062
mlm2 6 12771 12801 -6379 12759 303 2 <2e-16
mlm3 7 13015 13050 -6500 13001 0 1 1
~~~
### It was mid-2011
A new model treating pre-2012 as different turns up with a superior fit. Can we do better? A [`changepoint`](http://cran.r-project.org/web/packages/changepoint/index.html) fingers May/June 2011 as the culprit and giving a larger difference in means (254 vs 147):
~~~{.R}
library(changepoint)
plot(cpt.meanvar(alertsRecent$Links), ylab="Links")
~~~
![Link count 2010-2013, depicting a regime transition in May/June 2011](/images/google/alerts/changepoint.png)
With this new changepoint, the test is more significant
~~~{.R}
alertsRecent <- alerts[year(alerts$Date)>=2010,]
alertsRecent$Recent <- alertsRecent$Date > "2011-05-01"
wilcox.test(Links ~ Recent, conf.int=TRUE, data=alertsRecent)
Wilcoxon rank sum test with continuity correction
data: Links by Recent
W = 63480, p-value = 4.61e-12
alternative hypothesis: true location shift is not equal to 0
95% confidence interval:
62 112
sample estimates:
difference in location
87
~~~
And the fit improves by a large amount:
~~~{.R}
R> alerts$Recent <- alerts$Date > "2011-05-01"
R> mlm4 <- lmer(Links ~ Date + Recent + (1+Date|Search), alerts); mlm4
Random effects:
Groups Name Variance Std.Dev. Corr
Search (Intercept) 8.64e+03 9.30e+01
Date 9.28e-05 9.63e-03 -0.172
Residual 1.11e+04 1.05e+02
Number of obs: 1048, groups: Search, 68
Fixed effects:
Estimate Std. Error t value
(Intercept) -1.11e+03 1.87e+02 -5.91
Date 8.86e-02 1.30e-02 6.83
RecentTRUE -1.65e+02 1.44e+01 -11.43
Correlation of Fixed Effects:
(Intr) Date
Date -0.994
RecentTRUE 0.709 -0.725
R> anova(mlm1, mlm2, mlm3, mlm4)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mlm1 4 13070 13089 -6531 13062
mlm2 6 12771 12801 -6379 12759 302.7 2 <2e-16
mlm3 7 13015 13050 -6500 13001 0.0 1 1
mlm4 7 12948 12983 -6467 12934 66.8 0 <2e-16
~~~
#### Robustness
<!--
Given the foregoing, it's not surprising that the posterior simulation never spits out an estimate for `Recent` anywhere close to zero:
~~~{.R}
mlm4.sim <- sim(mlm4, n.sims = 100000)
fixef.mlm4.sim <- fixef(mlm4.sim)
quantile(fixef.mlm4.sim[,3], probs = c(0, 0.025, 0.975, 1))
0% 2.5% 97.5% 100%
-186.74 -152.69 -101.37 -69.05
~~~
The posterior check is good, so I take a look at the other direction:
-->
Is the fall robust against different samples of my data, using [bootstrapping](!Wikipedia "Bootstrapping (statistics)")? The answer is yes, and the Wilcoxon test even turns out to have given us a pretty good confidence interval earlier:
~~~{.R}
library(boot)
recentEstimate <- function(dt, indices) {
d <- dt[indices,] # allows boot to select subsample
mlm4 <- lmer(Links ~ Date + Recent + (1+Date|Search), d)
return(fixef(mlm4)[3])
}
bs <- boot(data=alerts, statistic=recentEstimate, R=10000, parallel="multicore", ncpus=4); bs
...
Bootstrap Statistics :
original bias std. error
t1* -164.8 34.06 17.44
boot.ci(bs)
...
Intervals :
Level Normal Basic
95% (-233.0, -164.7 ) (-228.2, -156.7 )
Level Percentile BCa
95% (-172.9, -101.4 ) (-211.8, -156.7 )
~~~
A confidence interval of (-159,-95) is both statistically-significant in this context and also an effect size to be reckoned with. It seems this mid-2011 fall is real. I'm surprised to find such a precise, localized, drop in my Alerts quantities. I did expect to find a decline, but I expected it to be a gradual incremental process as Google's search algorithms gradually excluded more and more links. I didn't expect to be able to say something like "in this month, results dropped by more than a third".
### Panda?
I don't know of any changes announced to Google Alerts in May/June 2011, and the emails can't tell us directly what happened. But I can speculate.
There is one culprit that comes to mind for what may have changed in early 2011 which would then led to a fall in collated links (a fall which would accumulate to statistical-significance in June 2011): the pervasive change to webpage rankings called [Google Panda](!Wikipedia). It affected many websites & searches, had teething problems, reportedly boosted social networking sites (which I generally see very few of in my own alerts), and was rolled out globally in April 2011 - just in time to trigger a change in May/June (with continuous [changes through 2011](http://moz.com/google-algorithm-change#2011 "Google Algorithm Change History")).
(We'll probably never know the true reason: Google is notoriously uncommunicative about many of its internal technical decisions and changes.)
# Conclusion
So where does this leave us?
Well, the overall linear regressions turned out to not answer the question, but they were still educational in demonstrating the considerable diversity between alerts and the trickiness of understanding what question exactly we were asking; the variability and differences between alerts reminds us to not be fooled by randomness and try to look for big effects & the big picture - if someone says their alerts seem a little down, they may have been fooled by selective memory, but when they say their alerts went from 20 links an email to 3, then we should avoid unthinking skepticism and look more carefully.
When we investigated the claim directly, we didn't *quite* find the claim: there was no changepoint anywhere in 2012 as claimed by bloggers like - they seem to have been half a year off from when the change occurred in my own alerts. What's going on there? It's hard to say. Google sometimes rolls out changes to users over long periods of time, so perhaps I was hit early by some changes drastically reducing links. Or perhaps it simply took time for people to become certain that there were fewer links (in which case I have given them too little credit). Or perhaps separate SEO-related changes hit their searches after mine were.
Is Alerts "broken"? Well, it's taken a clear hit: the number of found links are down, and my own impression is that the returned links are not such gems that they make up for their gem-like rarity. And it's certainly not good that the problem is now 2 years old without any discernible improvement.
But on closer inspection, the hit seems to have been a one-time deal, and if my Panda speculation is correct, it does not reflect any neglect or contempt by Google but simply more important factors - Search remaining high-quality will always be a higher priority than Alerts, because Search is the dog that wags the tail. My survival model may yet have the last laugh and Alerts outlast its more famous brethren.
I suppose it depends on whether you see the glass as half-full or half-empty: if half-full, then this is good news because it means that Alerts isn't in as bad shape as it looks and may not soon be following Reader into the great Recycle Bin in the sky; if half-empty, then this is another example of how Google does not communicate with its users, makes changes unilaterally and invisibly, will degrade one service for another more profitable service, and how users are helpless in the face of its technical supremacy (who else can do as good a job of spidering the Internet for new content matching keywords?).
# See also
- [Fanfiction reviewer survival curves](hpmor "A survival analysis finds no major anomalies in reviewer lifetimes, but an apparent increase in mortality for reviewers who started reviewing with later chapters")
- [Weather]()
# External links
- [Hacker News discussion](https://news.ycombinator.com/item?id=6445270)
<!-- For the update:
0. fix the 0-link anomalous emails early in the sample
1. use a MLM, using a Poisson family - matches the underlying process more accurately
2. add additional level of nesting for unique subject lines but about the same subject (so it would go email-subject-lines nested under general-topics eg. all the Gene Wolfe searches should be clustered)
2. add Google Scholar & Pubmed alerts
3. add a nesting level for Google (Alerts & Scholar), and nesting for all alerts (Google & Pubmed) -->