forked from yang/notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Math.page
1020 lines (860 loc) · 43.4 KB
/
Math.page
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
- stanford linear systems & optimization | convex optimization I:
http://see.stanford.edu/see/courseinfo.aspx?coll=2db7ced4-39d1-4fdb-90e8-364129597c87
- finite element method
- KKT conditions
- <http://www.geometer.org/mathcircles/graphprobs.pdf>
signals
=======
- fourier, z, laplace, wavelet, hilbert xforms
- butterworth filter
statistics
==========
- frequentist statistics
- Point estimates trade off bias and variance
- Interval estimates have the goal of achieving nominal coverage and the goal
of being informative
- Tests have the goals of calibration and power
- frequentists express prior info not in dist but in choice of methods---may
still be too restrictive
- prospective vs retrospective studies
- prospective: cohort is random sample of population; watch them; may be very
skewed (eg if studying rare disease)
- retrospective: already know cases & controls; sample both evenly
- ascertainment: just means sampling (ascertainment bias, ascertainment method)
- TODO mixed effects
- characteristic function = fourier transform
- TODO <http://itunes.apple.com/us/itunes-u/statistics-110-introduction/id495213607>
- bootstrap hypothesis testing: for 2 groups, merge them, generate a bunch of
bootstrap samples, and see how frequently they generate means as different as
what's seen (the p-value)
- areas: descriptive, inferential
- moments
- positive skew: lumped to the left (counter-intuitive)
- positive kurtosis: pointier peak
- sample variance: same as var but divide by $n-1$ instead of $n$
- mean deviation: the mean absolute deviation from the mean
- harder to eval in closed form
- used by TPC To calculate large deviations from RTT by calcing EWM deviation
- weak law of large numbers (prove using Chebyshev's ineq):
$\lim_{n \rightarrow \infty} P(|\bar{X_n} - \mu| < \eps) = 1$
- strong law of large numbers (more complex proof):
$P( \lim_{n \rightarrow \infty} \bar{X_n} = \mu ) = 1$
- central limit theorem: mean/sum of sufficiently many independent random
variables, each with finite mean & variance, is approximately normal
- as sample size increases, sample mean distribution approaches normal
- standard error of the mean: standard deviation of the sampling distribution
of the sample mean ("sampling distribution of the sample mean" is the
"distribution of the mean of samples")
- confidence interval: at a certain confidence level (eg 95%), the prob this
experiment yielded a sample that contains the true pop param
- "Were this procedure to be repeated on multiple samples, the calculated
confidence interval (which would differ for each sample) would encompass
the true population parameter 95% of the time."
- *not* for predicting prob of true pop param being in interval; pop param
has actual value, not a prob dist
- instead, indicates *reliability of estimate*
- find *lowest* value of param (e.g. population mean or population binomial
proportion) such that estimate (e.g. sample mean or sample binomial
proportion) is still in innermost 95% of the sampling distribution of the
estimate
- innermost means excluding lowest/highest 2.5%
- sampling distribution:
- for normal data $~ N(\mu, \sigma)$, it's $~ N(\mu, \sigma / \sqrt{n})$,
where $\mu$ is unknown
- for bernoulli data $~ Bern(p)$, it's $~ Bin(n,p)$, where $p$ is unknown
- do the same for *highest* such value of param
- closely related to significance testing: testing null hyp $\theta=0$ can be
done by seeing if confidence interval for $\theta$ includes 0 (w confidence
level $y=1-\alpha$)
- intervals calculated from (functions of) sample, so they're statistics,
functions of RVs, hence RVs themselves
- margin of error: radius (half) of confidence interval
- statistical error: deviation of a sample from unobservable true/theoretical
value (eg person's height vs population mean)
- residual: deviation of a sample from observable estimated value (eg person's
height vs sample mean)
- survival function: complement of CDF
- 2 time series are _cointegrated_ if linear combo is stationary
- 2 time series are cointegrated if they share common _stochastic drift_
- eg drunkard and his dog: both randomly constantly moving but periodically
error-correcting; cointegrated of order zero
- eg stock market index and its futures contract, pairs trading
- eg, R-squared fails when dealing w non-stationary vars, leading to
_spurious regressions_ suggesting relationships when there's none
- Engle-Granger test: simplest detection approach TODO
- order of integration $I(p)$: min num diffs required to obtain a stationary
series
- eg 1,1,1: already stationary
- eg 1,2,3: diff w lag-1 is stationary
- eg 1,4,9,16,25: diff w lag-1 gives 3,5,7,9; second diff is stationary
- similar to calculus integration/differentiation
- simpson's paradox: a correlation present in diff groups is reversed when
groups are combined; often in social-/med-sciences when frequency data
hastily given causal interps
- TODO
- (paired) student t-test
- distributions
- student t-distribution: something about normals for small populations where
variance is unknown
- chi-square distribution
- Gaussian distribution, Cauchy, Exponential, Gamma, Generalized Extreme
Value, Rician, Wrapped Cauchy, Von Mises, Binomial, and Beta
- $\mathrm{Correlation}(X, Y) = \frac{ \mathrm{Covariance}(X, Y) }{ \sigma_X
\sigma_Y }$ where $\sigma$ is the SD
- seq of RVs has _heteroscedasticity_ if they have diff variances
- usu. error terms vary w each obs, often true in cross-sectional/time-series
- _copula_, Gaussian copula: TODO
- Kendall tau rank correlation coefficient $\tau$
- similarity of orderings of data
- TODO uses p-value somewhere
bayesians vs frequentists
- views of probability
- frequentist: $P(H)=.5$ turns up heads .5 of the time
- bayesian: coin w $P(H)=.5$ deserves even-odds betting
- views of stats
- frequentist: ask what's $P(H=h)$ given that null hyp that true odds are .5
- bayesian: estimate $P(\theta=.5)$ vs $P(\theta\ne.5)$ given #H & prior
- frequentist
- believes pop mean is real, but unknown and unknowable
- frequentist estimator
- find sample mean; construct confidence interval around it
- can't say "95% prob that true mean is in interval"; must say "95% prob over
intervals that an interval contains true mean", i.e. "if we were to
repeatedly sample data, find mean, and construct confidence interval, 95%
of intervals would contain the true mean"
- frequentist tester
- form hypothesis: "mean is blah"
- accept or reject hypothesis with test to significance level
- bayesian
- only data is real; pop mean is an abstraction; construct a credible
interval, centered near the sample mean and tempered by prior beliefs
(sometimes very non-informative)
- can say "95% prob that this interval contains the mean"
- <http://www.statisticalengineering.com/frequentists_and_bayesians.htm>
- <http://www.rasmusen.org/x/2007/09/25/bayesian-vs-frequentist-statistical-theory/>
conjugate priors
- conjugate priors produce posteriors of the same algebraic form
- an algebraic convenience that yields closed-form posteriors
- beta dist is conjugate prior for binomial & bernoulli
- dirichlet dist is generalization of beta and conjugate prior for multinomial
& categorical
- TODO
non-informative aka objective priors
- not nec. flat/uniform priors
- eg for bernoulli/binomial it's `beta(.5,.5)`
- construct according to Jeffrey's rule to minimize Fisher information
- <http://www.amstat.org/publications/jse/v12n2/zhu.pdf>
probability vs statistics
- probabilists: hypothesis to data
- very pure, "subjunctive"; strict if-thens; if die is fair, then $P(1)=1/6$;
"is die fair?" -> "not my dept"
- "in 100 tosses of fair coin, what's $P(H=40)$?
- statistician: data to hypothesis
- deal w not just if-then, but whether die fair, what the bias is
- more science-y; nature must be observed to get at the underlying rules
- "if 40 of 100 tosses were heads, what's $P(\theta=.5)$?": much harder
question, such that no agreement among math/science/philos/prob/stats
- TODO finish
- <http://appliedclinicaltrialsonline.findpharma.com/appliedclinicaltrials/Metrics%2FStatistics/Bayesian-Likelihood-and-Frequentist-Approaches-to-/ArticleLong/Article/detail/84710>
test accuracy
- precision = relevant & retrieved / retrieved
- recall = relevant & retrieved / relevant
- F1 score aka F-measure: `F = 2 * (precision * recall) / (precision + recall)`
statistical significance
- standard score: number of SDs; aka z-value, z-score, normal score
- this is frequentist hypothesis testing; Bayesian hypothesis testing is
Bayesian inference
- alt hyps
- Fisherian testing: no alt hyp, only accept/reject null hyp
- Neyman-Pearson: alt hyps
- in practice: alt hyp is just negation of null hyp
- p-value: significance level; P(observation or more extreme|null hypothesis)
- assuming null hyp (no underlying relationship or diff btwn the variables),
what's prob we'll see a diff at least as extreme as the one observed just
by chance?
- not same as saying, given that the alt hyp *is* true, we can replicate
the results 5% of the time or 95% of the time; that's called the
_statistical power_ of the design
- the lower, the less likely the result if the null hyp is true, thus the
more statistically significant the result is
- if < .05 or .01, corresponding to a 5% or 1% chance of rejecting the null
hyp when it's actually true (type I error)
- 5% is a typical value (border-line acceptability); in the end this is all
arbitrary, no way to say something is/isn't "really" significant
- eg P(14+ heads of 20 flips|fair coin)
- simple hypothesis: of the form $\theta = \theta_0$; specifies pop dist
completely
- composite hypothesis: of the form $\theta > \theta_0$; doesn't specify
population dist completely
- two-tail: p-value is $P(X>a \vee X<b | \theta)$
- $H_0 : \theta = \theta_0$ vs $H_1 : \theta \ne \theta_0$
- one-tail: p-value is $P(X>x|\theta)$
- $H_0 : \theta \le \theta_0$ vs $H_1 : \theta > \theta_0$ (or vice versa)
- problem is that the $\theta$ is unknown (just something under $\theta_0$;
TODO how do we justify the p-value, then, which must assume some value of
$\theta$?
- E-value: expected num tests until a certain observation, assuming null hyp
- how to efficiently calc prob of *at least* $h$ H in $n$ flips?
- type I error: false pos; rej null hyp when it's actually true
- type II error: false neg; acc null hyp when it's actually false
- mustn't stop experiment as soon as significant results are detected
- <http://www.evanmiller.org/how-not-to-run-an-ab-test.html>
- common "general format" of significance tests
- they're some ratio of (some measure of the differentiation common in the
variables) / (overall differentiation of those vars)
- eg: (part of overall differentiation of the WCC scores) / (overall
differentiation of the WCC scores)
- called ratio of "explained variation" to "total variation"
- can only get to P(null hypothesis|observation) if we're Bayesian and have a
prior
- Fisher violently opposes the concept of an alt hyp
- many different and widespread interpretation pitfalls
- quantile regression: estimate median or other quantile instead of mean
- good for risk models
- uses linear programming
- robust/resistant statistics: around for 20+ yrs; not widely used; p-values
reign for many non-ideal real-world reasons
- z-test: appropriate only if you have a SD, else use t-test
statistical tests
- many different statistical measures
- _power_ of a test is its probability of correctly rejecting null hyp
(complement of false negative rate, $\beta$)
- power function $\beta(\theta) = P_\theta(X \in R)$, where $R$ is rejection
region (this is a diff $\beta$)
- size/significance level of a test ($\alpha$): prob of incorrectly rejecting
null hyp (false positive rate), or upper bound of prob of rejecting the null
hyp over all cases covered by the null hyp
- always want to choose the _most powerful_ test (highest power under $H_1$)
for a given size/significance level
- finding most powerful tests is hard; most powerful tests don't exist for
many cases
- 4 widely used tests: Wald, $\chi^2$, permutation, likelihood ratio
- contingency tables: used by many statistical tests
- test selection decision table: <http://www.graphpad.com/www/book/choose.htm>
- the complexity here is a critique of the frequentist approach
- chi-square: typically Pearson's chi-square
- Fisher's exact test: for small categorical data; compares two groups
- a type of exact permutation test; based on hypergeom dist
- for 2x2 contingency tables (1 DOF, since margins are held fixed)
- for each variation of table, calculate $p = \frac{(a+b choose a) (c+d
choose c)}{(n choose a+c)} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}$
- sum these up to get the $p$-value
- use fisher if possible; easy w computers; chi square is approximation
- <http://www.graphpad.com/faq/viewfaq.cfm?faq=1114>
- <http://en.wikipedia.org/wiki/Exact_test>
- <http://www.itl.nist.gov/div898/handbook/prc/section3/prc33.htm>
- Pearson's chi-square: for large categorical data
- assesses _goodness of fit_ and _tests of independence_ (TODO)
- $X^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i}$, where
- $X^2$ is Pearson's cumulative test statistic which asymptotically
approaches a $\chi^2$ distribution
- $O_i$ is observed freq
- $E_i$ is expected (theoretical) freq
- $n$ is number of cells in contingency table
- goodness of fit: compare group against theoretical expectation
- $\chi^2$ test approximates Fisher
- $X^2$ becomes $(ad-bc)^2 (a+b+c+d) / ((a+b)(c+d)(b+d)(a+c))
- degrees of freedom
- for goodness of fit, it's number of terms in $X^2$ expr - 1
- for indep test, it's (num cols) (num rows) = 1
- for 1DOF, chi square dist = gaussian, so same as "equal proportions z test"
(z test over the difference of 2 binomial random variables)
- <http://stats.stackexchange.com/questions/2391/what-is-the-relationship-between-a-chi-square-test-and-test-of-equal-proportions>
- <http://stattrek.com/AP-Statistics-4/Test-Difference-Proportion.aspx?Tutorial=AP>
- <http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html>
- <http://math.hws.edu/javamath/ryan/ChiSquare.html>
- binomial test: for small categorical data
- compare group against theoretical value (i.e. with-replacement; see
binomial dist vs hypergeom dist)
- non-parametric tests TODO: mann-whitney, wilcoxon
- Kolmogorov-Smirnov (K-S) test: nonparam test for equality of continuous 1D
prob dists; often used to test for GOF
- "The two-sample KS test is one of the most useful and general
nonparametric methods for comparing two samples, as it is sensitive to
differences in both location and shape of the empirical cumulative
distribution functions of the two samples."
- sequential analysis: sample size isn't fixed in advance; data is evaluated
during collection; may conclude exp earlier
- alt is bayesian exp design
- Mann-Whitney-Wilcoxon RankSum test: useful to determine if 2 distributions
are significantly different or not. Unlike the t-test, the RankSum test does
not assume that the data are normally distributed, potentially providing a
more accurate assessment of the data sets.
- Chow test: whether 2 linear regressions on diff data sets are equal
- also (in econometrics) for whether 2 time series over different times
exhibit a structural break
data
- censoring: limiting values to be within bounds
- truncation: omitting values that are out of bounds
- eg: if policyholders subject to policy limit $u$ and deductible $d$, then
losses are left-truncated (unreported if under $d$) and right-censored
(limited to $u-d$)
time series (TODO also cover math)
- stationary process: stochastic process whose joint prob dist doesn't change
when shifted in time/space, so mean/variance/autocorr don't change
- white noise is stationary; cymbal clash variance diminishes over time
- i.e. "flat", not trending over time
- for a non-stationary signal, either fitting a curve and then taking
residuals or differencing may give stationary data
- autocorrelation: cross-correlation of signal w itself; a lagged
self-correlation; to detect non-randomness; works if fixed intervals
- IID iff 0 autocorrelation
- Durbin-Watson test: conventional test to detect autocorrelation
- 0 to 4; 2 means no autocorrelation
- cross-correlation: sliding dot product of two signals; like convolution but
no reversal
- linear prediction: future values of discrete-time signal estimated as linear
function of previous samples
- autoregressive (AR) model: random process used to model natural phenomena; a
type of linear prediction formula
- vector autoregressive (VAR) model: popular in econometrics
- Dickey Fuller test: whether unit root is present in autoregressive model
- TODO: (partial) autocorrelation plots, ARMA, ARIMA, Ljung-Box for zero
autocorr, unit root test for coint, Granger-causality, whiteness (iid-ness) &
normality, VAR impulse response analysis and forecast error variance decomp,
- TODO filtering: Hodrick-Prescott (HP), Baxter-King, Christiano-Fitzgerald
(all popular in finance)
- TODO Bayesian dynamic linear models (DLM)
- repetition
- seasonal aka periodic: fixed-length cycles
- cyclic: variable-length cycles
- cyclic/seasonal models
- TODO cyclic vs seasonal vs periodic <http://robjhyndman.com/researchtips/cyclicts/>
statistical inference
- analysis of variance (ANOVA or AOV): TODO
- canonical (variate) analysis: TODO
- 1 predictor, 1 criterion: coef of corr $r$, coef of det $r^2$, coef $\beta$
- $n$ predictors, 1 criterion: multiple corr $R$, multiple coef of det $R^2$,
weights $\beta$
- $n$ predictors, $m$ criteria: canonical corrs $\rho_1, \dots$, canonical
weights $C,D$
- find linear combo of predictors and linear combo of criteria w max corr
- kernel density estimator: basically place a small kernel everywhere you have
a sample, resulting in somtehing much smoother than a histogram (where bins
can make all the difference in interpretation)
- mean absolute percentage error (MAPE): $M = \frac{1}{n} \sum_{t=1}^n
\left\vert \frac{A_t - F_t}{A_t} \right\vert$
relationship btwn mean squared error and gaussian dist? (TODO cs229 notes)
outlier elimination
- chauvenet's criterion
- pierce's criterion: "if you need rigor, generality, and the ability to reject more than one data point"
- process involves multiple steps and lookup tables
- <https://web.chemistry.ohio-state.edu/~coe/research/documents/error_distributions_new2.pdf>
exponential smoothing for time series analysis
- _exponential moving average (EMA)_ aka _exponentially weighted moving average
(EWMA)_ aka _single exponential smoothing (SES)_: does _not_ spot trends
- _double exponential moving average (DEMA)_ aka _linear exponential smoothing
(LES)_: incorporates seasonality
- _triple exponential moving average (TEMA)_ aka _triple exponential smoothing
(TES)_: incorporates seasonality & trend
- <http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc43.htm>
likelihood
- likelihoods for continuous distributions use PDFs instead of probabilities
(since $P(X=x) = 0$); this is justified, see
<http://en.wikipedia.org/wiki/Likelihood_function#Likelihoods_for_continuous_distributions>
misc
- benford's law: distribution of leading digits is non-uniform (exp decay)
pitfalls
--------
- unbiased estimators can be terrible
- drug X results may have $p=.04$ and drug Y results may have $p=.06$ but
difference may well NOT be significant (diff btwn sig & insig may well be
insig)
- $p$-value is not prob of hyp being right; to compute this (posterior), more
info is needed (prior)
- statistical significance != significance
- data dredging: many sig tests will inevitably result in false pos results
- example takedown: <http://www.americanscientist.org/issues/id.14344,y.0,no.,content.true,page.1,css.print/issue.aspx>
regression tips
---------------
- from andrew gelman <http://andrewgelman.com/2010/12/what_do_practit/>
- (wouldn't take all thes as gospel)
- pre-process your vars; discretize numeric, avg several related
- usu fine to just treat discrete outcomes (eg 1-5 scale) as numeric; don't
worry about ordered logit/probit/etc., just run the reg already
- take the 2 most important input vars in your reg & throw in their interaction
- The key assumptions of a regression model are validity and additivity. Except
when you’re focused on predictions, don’t spend one minute worrying about
distributional issues such as normality or equal variance of the errors.
surveys
-------
- Likert scale aka rating scale: psychometric scale (5/7-point)
- ordinal data, but not interval/linear data; requires nonparam tests
experiments
-----------
- factorial design: 2+ factors, ea w discrete values/levels; experimental units
take on all combos
- eg compare 2 motors at 2 speeds (4 combos)
combinatorics
=============
- permutations without replacement: P(n,r) = n! / (n-r)!
- permutations with replacement: n^r
- combinations without replacement (partitioning sets): C(n,k) = n! / (k! (n-k)!)
- this is the _binomial coefficient_
- _multinomial coefficient_: partitioning into multiple sets
- binomial is actually C(n; k, n-k)
- C(n; k1, k2, ..., kp) = n! / (k1! k2! ... kp!)
- combinations with replacement (multisets/bags):
C(n-1 + k; n-1, k) = (n-1+k)! / (k! (n-1)!)
- proof using "stars and bars"
mathematics
===========
- abstract algebra: study of algebraic structures such as groups, rings,
fields, modules, vector spaces, and algebras
- topology: extension of geometry that focuses on structure
- algebraic topology: uses abstract algebra
- topological graph theory: topology + graph theory; studies embeddings of
graphs in surfaces
- number theory
- computational number theory, algorithmic number theory: applications in
crypto
\begin{equation}
e^x = 1 + x \left(
1 + x \left(
\frac{1}{2} + x \left(
\frac{1}{6} + x \left(
\frac{1}{24} + x \left( \dots \right) \right) \right) \right) \right)
\end{equation}
geometry
========
- Pythagorean theorem
- easy proof
Interesting questions that came up during ICFPC08
- find the tangent lines (up to four) between a pair of circles
- find the tangent lines between a circle and a point
- if we roll a circle around the outside of an ellipse and trace the path of
its center, is the resulting oval an ellipse?
- circumcenter of a polygon = point equidistant to all vertices = center of
circumscribed circle = intersection of all edge-bisecting lines
linear algebra
==============
inverting matrices
- Gaussian elim yields row echelon form (0s in lower-left)
- Gaussian-Jordan elim yields reduced row echelon form ($I$)
From <http://en.wikipedia.org/wiki/Sparse_matrix>:
> Conceptually, sparsity corresponds to systems which are loosely coupled.
> Consider a line of balls connected by springs from one to the next; this is a
> sparse system. By contrast, if the same line of balls had springs connecting
> every ball to every other ball, the system would be represented by a dense
> matrix. The concept of sparsity is useful in combinatorics and application
> areas such as network theory, of a low density of significant data or
> connections.
eigenvalues and eigenvectors
- treating $M$ as linear xform, which vectors would keep same direction (but
possibly scaled)? ($Mx=\lambda x$)
- $M$ xforms unit sphere $x^T x = 1$ into ellipsoid where evecs are axes,
evals are radii
- if $M$ has data pts as col vecs, covariance matrix $M^T M$ has evecs as
principal components, evals as their variances
- shear: 1 evec; uniform scale: $\infty$ evecs; non-uniform scale: 2 evecs;
rotate: 0 evecs
- _power method_ or _power iteration_: eigenvalue algo
- given mat $A$, find scalar $\lambda$ (eigenvalue) and non-0 vector $v$
(eigenvector) s.t. $A v = \lambda v$
- doesn't compute matrix decomposition so suitable for large sparse A
- but will only find one eigenvalue (with max val, aka _dominant eigenpair_)
and may converge slowly
- start with random vector and iteratively multiply by $A$ and normalize
- used in pagerank, HITS
- <http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf>
- for transition matrix of markov chain, evec is stationary state $\pi$
- for ratings matrix (eg rows are users, cols are movies), evecs of $M M^T$ are
row features, of $M^T M$ are col features
symmetric positive definite, positive semi-definite, negative definite, indefinite, etc.
- $M$ is _positive definite_ iff any of:
- $z^T M z > 0$ for all nonzero vector $z$
- all evals of $M$ are pos
- all leading principle minors are pos
- i.e. upper left $k \times k$ submatrix of $M$ has pos det for all $k =
1,2,\dots,n$
- eg: if $M$ is diagonal, then $M$ is pos def iff all diag elts are pos
(diag elts are evals)
- $M$ is neg def iff any of:
- $-M$ is pos def, i.e. $x^T M x < 0$ for all nonzero vector $x$
- all evals of $M$ are neg
- upper left $1 \times 1$ submatrix has pos det, upper left $2 \times 2$
submatrix has neg det, upper left $3 \times 3$ submatrix has pos det, etc.
- $M$ is pos semi-def iff any of:
- $z^T M z \ge 0$ for all vector $z$ (usual definition)
- all evals of $M$ are non-neg
- note: if $M$ is pos semi-def, then all upper left $k \times k$ submatrices
have non-neg dets, but this is not a sufficient condition any more; same
for neg semi-def below
- $M$ is neg semi-def iff any of:
- $z^T M z \le 0$ for all vector $z$ (usual definition)
- all evals of $M$ are non-pos
$M$ is indef iff any of:
- $M$ is neither pos semi-def nor neg semi-def
- $M$ has pos and neg evals
definitions
- vectors linearly dependent iff
- any of the vectors can be linear combo of any others (redundancy)
- only linear combo that equals 0 is all-0 coefs
- vectors are $n$-dimensional and span $\Re^n$
- determinant of vectors as cols is 0
- basis of vector space: linearly indep spanning set of vectors
- subspace of $\Re^n$: set of vectors that's closed under scalar mult (hence
must incl. origin)
- nullspace of $A$ is set of vectors $x$ s.t. $Ax=0$; a valid subspace
- only way to represent a line in >2D is parametrically ($t$)
- A matrix $A$ is diagonalizable if there exists a non-singular matrix $B$ and
a diagonal matrix $D$ such that $A = BDB^{-1}$.
- _singular value decomposition (SVD)_: important factorization of a
rectangular real or complex matrix
- many applications in signal processing, statistics
- eg computing the pseudoinverse, least squares fitting of data, matrix
approximation, and determining the rank, range and null space of a matrix
- singularity
- matrix is singular iff has no inverse iff any eigenvalues are zero iff
characteristic polynomial is 0
- matrix is nonsingular iff invertible iff all eigenvalues are non-zero iff
characteristic polynomial is non-0
- characteristic polynomial: $p(\lambda) = \det(A - \lambda I)$
- eigenvalues are roots of characteristic polynomial
- eigenspace: for a given eigenvalue, linear subspace spanned by its
eigenvectors
- if $M$ is pos def then $z^* M z > 0$ for any vector z
- a pos def matrix has pos def inverse
- all eigenvalues of pos def are positive
- a real sym matrix has real sym inverse
- _kernel_ of a mapping: elts that map to the zero elt
- _kernel_ aka _null space_ of a matrix: nullity of A
foundations
- _vector space_: eg $\mathbf{R}^2$ (set of all 2-dim vecs)
- _subspace_: subset of vector space closed under linear combination
- subspaces of $\mathbf{R}^2$: $\mathbf{R}^2$, any line through origin, and
$\{(0,0)\}$
- subspaces of $\mathbf{R}^3$: $\mathbf{R}^3$, any plane through origin,
any line through origin, and $\{(0,0)\}$
- _column space_ $C(A)$ is span of $A$'s col vecs
- if not fully linearly indep, then fewer dimensions spanned then #columns
- $Ax=b$ can be solved iff $b \in C(A)$
- _kernel_ aka _null space_ $N(A)$ is set of all solutions to $Ax=0$
- solutions $x$ always form a subspace, i.e. linear combos of $x$s are 0
- $N(A)=0$ means col vecs of $A$ lin indep ($C(A)$ is a basis)
- TODO
- stochastic matrix
- eigen*, singular, etc.
- http://www.cut-the-knot.org/arithmetic/algebra/eigens.shtml
- spectral decomposition aka eigendecomposition
- spectral theorem
norms/distance functions
- taxicab geometry aka rectilinear dist aka L1 distance aka $\ell_1$ norm aka
city block dist aka manhattan dist aka manhattan length
- sum of absolute diffs in coordinates
- hamming dist is manhattan dist of unit hypercube (since hamming dst
components are either 0 or 1; hamming dist is # positions that are diff)
- euclidian dist aka $L^2$/$\ell^2$ dist/norm
- $p$ norm aka $L^p$ norm
- $( \sum_i |x_i|^p )^{1/p}$
- maximum norm aka infinity norm aka uniform norm aka supremum norm
- $\max( |x_1|, \dots, |x_2| )$
- triangle inequality holds for $n$-dim vectors
cryptography
============
RSA (TODO REVIEW!)
- choose primes p, q
- primality testing: _wheel factorization_ is certain; there are also
probabilistic tests
- find $e$ such that $gcd(e, (p-1)(q-1)) = 1$
- i.e., find $a$ such that $ae + b(p-1)(q-1) = 1$. This is a _diophantine
equation_ and can be solved using the _Euclidian algorithm_.
- exponentiation: find $M^e \mod n$
- $M$ is data, $e$ is pub key, $n$ is priv key
- _square and multiply algorithm_
- _Chinese Remainder Theorem (CRT)_ for more complex, faster algos
How To Share A Secret (Shamir 79)
- split into $n$ pieces such that any $k$ can reconstruct but $k-1$ can't
- <http://research.swtch.com/2008/01/i-could-tell-you-but.html>
signals, coding, information theory
===================================
Audio & Video Compression (Keith Winstein's SIPB cluedump 2008/11/06)
- huffman code: recurse: least freq are same len, differ only in last bit
- keithw: slides
- entropy coding
- arithmetic coding can get us flatter?
general
- Shannon's theorem: describes _channel capacity_, the max information rate at
which reliable comm is possible over a channel that has a certain error prob
or SNR
- only existential, not constructive, hence actual capacity depends on algo
maximum entropy
- given constraints ("testable information"), the dist that best represents the
current state of knowledge is the one w largest entropy
- i.e., choose the _least informative_ (and thus most generalizable) dist
given what we know, since we don't know anything else
- also, many physical systems tend to move toward maxent configs over time
- makes explicit our freedom in using different forms of prior information
- applications
- obtain priors for bayesian inference
- maxent models: observed data itself is the testable information
- examples
- given just mean and sd, use normal
- given just the support [a,b], use uniform over [a,b]
- exponential dist w mean $1/\lambda$ is maxent dist over all dists in
$[0,\infty]$ w mean $1/\lambda$
shannon entropy
- $H(X) = E(I(X)) = \sum_i p(x_i) I(x_i) = - \sum_i p(x_i) \log p(x_i)$,
assuming $X$ has possible values $x_i$
- $I(X)$ is the _information content_ of $X$
- if $p_i = 0$, the $0 \log 0$ is taken to be 0
- eg binary entropy: [$H(X)$ for varying
$P(X=1)$](http://en.wikipedia.org/wiki/File:Binary_entropy_plot.svg)
- distinguish btwn _entropy_ of a distribution of messages/symbols and the
_entropy rate_ of a stochastic process
- eg need 7 bits to distinguish the first 128 symbols generated in the
fibonacci sequence
- the fib sequence formula itself has a much lower entropy and applies to any
length fib seq
Fisher information
- reflects amt of (pos) info of observer
- depends on PDF & its first derivatives
- FIM is the curvature of the KL divergence
minimum description length (MDL)
- best hyp yields best compression; occam's razor
- minimum message length (MML) is very similar
probability
===========
stochastic processes
- bernoulli process: stream of discrete & independent events, usu binary;
discrete-time, unlike poisson
- poisson process: stream of continuous & independent events occurring at const
avg rate; continuous-time (infinite-granularity) bernoulli process
- characterized by rate param $\lambda$
distributions
- bernoulli: heads or tails
- binomial: number of successes in $n$ draws with replacement (multiple
bernoulli trials)
- hypergeometric: number of successes in $n$ draws *without* replacement
(unlike binomial)
- poisson: # events over some amount of time according to poisson process;
infinite-granularity binomial
- exponential: time btwn events in poisson process
- gaussian/normal
- log-normal distribution: dist of a RV whose log is normal
- in Black-Scholes model, changes in the log of exchange rates, price
indices, and stock market indices are assumed normal
- Cauchy: occurs in spectroscopy
- gamma: sum of fixed # exponentially dist vars
- eg total time for some # ppl to arrive
- great reference illustrating relationships among distributions:
<http://www.johndcook.com/distribution_chart.html>
misc
- law of total expectation: $E(E(X|Y)) = E(X)$
- law of total variance: $V(X) = V(E(X|Y)) + E(V(X|Y))$
- <http://anyall.org/totvar/>
- $P(X < min(Y,Z)) = P(X > max(Y, Z)) + 1 – P(X > Y) – P(X > Z)$
- Chebyshev's inequality: $P(|X-\mu| \ge k \sigma) \le \frac{1}{k^2}$ for any real $k>0$
- Jensen's inequality: If $f$ is convex and $X$ is a random variable, $f(E(X))
\le E(f(X))$.
problems
- expected # flips before HTH > that of HTT, because HTH overlaps with itself
- sketch: with a million flips, same # HTH and HTT, but HTH are more clumped,
so more space btwn them
- st. petersburg's paradox <http://mindyourdecisions.com/blog/2010/06/07/the-st-petersburg-paradox-a-flimsy-critique-of-expectation-theory-by-people-who-dont-know-math-or-economics/>
- game: toss coin until H; payoff is $2 for 0T, $4 for 1T, $8 for 2T, ...
- expected payout is $1/2 * 2 + 1/4 * 4 + 1/8 * 8 + ... = \infty$
- but must consider realistic payouts (nobody has $\infty$), utility of
payout (e.g. log scale), and weight-vs-risk (why you don't play the
lottery)
- kelly criterion: optimal fraction to bet in a series of bets to maximize
_expected return/growth rate_ which is `log(growth rate)`
game theory
===========
- shapley shubik power index: measure power of players in weighted voting game
- of all possible permutations, when votes in favor happen in order, which
player tilts the vote? rank them by the frequency this applies
- pareto efficiency aka pareto optimality: a possibly local optimum
- can't improve one person without worsening another
- nash equilibrium: nobody can improve by changing only own strategy
unilaterally
- assumes each player knows equilibrium strategies of others
- doesn't mean there aren't strategies that are better for all
- eg competing businesses forming cartel to increase profits
- may appear irrational to 3rd party, since not necessarily pareto optimal
puzzles
- pirates
- prisoner's dilemma
- if A testifies against B and B is silent, A goes free and B gets full 10yr
- if both silent, both get 6mo
- if both testify, both get 5yr
- regardless of what opponent chooses, you always are better off betraying
- nash equilibrium not nec pareto-optimal
- non-0-sum game
- hotelling's game
- 2 competing hot dogs $(i,j)$ stand on beach that ranges over $[-1, 1]$
- customers go to nearest stand
- stands will naturally go to $(0,0)$, even though social opt is $(-.5,.5)$
number theory
=============
- fermat's last theorem: no 3 pos ints $a,b,c$ can satisfy $a^n+b^n=c^n$ for
any int $n>2$
- long-standing conjecture; proven in 1995
misc
====
- set is _well-ordered_ iff has total ordering and any subset has least elt
- harmonic series: $\sum_{n=1}^{\infty}\frac{1}{n}=\infty$
- for density function $f(x)$ with one hump, choose $a,b$ on opposite sides of
the hump with $f(a) = f(b)$, then $[a, b]$ is the shortest interval with its
mass
- useful in statistics and more
- <http://www.johndcook.com/blog/2009/02/23/finding-shortest-confidence-interval/>
queueing theory
===============
- little's law: $L=\lambda W$ where $L$ is avg number of customers in system,
$\lambda$ is avg arrival rate, $W$ is avg time each customer spends in system
- Kendall's notation: A/B/C where A is arrival process, B is service time dist,
C is # servers
- M/M/1 queue: Poisson arrivals, exp'l service time, 1 server
- <http://en.wikipedia.org/wiki/Kendall%27s_notation>
measure theory
==============
- studies ways of generalizing the notions of length/area/volume
- measure theory asks questions like:
- How do we define a measure on our space? (Jordan measure and Lebesgue
measure are two different options in Euclidean space.)
- What properties does our measure satisfy? (For example, does it satisfy
translational invariance, rotational invariance, additivity?)
- Which objects are measurable/which objects can we say it's okay not to
measure in order to preserve nice properties of our measure? (The
Banach-Tarski ball can be rigidly reassembled into two copies of the same
shape and size as the original, so we don't want it to be measurable, since
then we would lose additivity properties.)
- TODO URL
sequences and series
====================
- TODO name: $(1-1/x)^k \approx e^{-k/x}$, $(1-k/m)^n \approx e^{-kn/m}$ (last
seen <http://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf>)
graph theory
============
- a branch of combinatorics
- eulerian paths
- eg bridges of Konigsberg
- poly-time algos
- euler trail aka eulerian path aka eulerian walk: visit each edge once;
iff only 2 nodes have odd degrees
- eulerian circuit aka eulerian cycle aka eulerian tour: also return to
start; iff all nodes have even degrees
- hamiltonian path: visit each vertex once; NP-complete
- overview: <http://arxiv.org/abs/condmat/0303516>
- spectral graph theory
- study of graphs and the characteristic polynomials, evals, and evecs of $A$
or $L$
- digraph: directed graph
- sparse graph: # edges close to min # edges
- dense graph: # edges close to max # edges
- finite graph: finite # edges and # nodes
- locally finite graph: each node has finite degree
- _maximal clique_: clique that's not a subset of any other clique
- _maximum clique_: biggest clique(s) in the graph
- multigraph: a graph with multiple edges between same node pairs
- spectrum of a graph: spectrum (eigenvalues) of graph's adj mat $A$
- isomorphic graphs have same spectrum
- $d$-regular graph is connected iff $\lambda_1 > \lambda-2$
- $d$-regular graph has largest eval $d$
- laplacian matrix of a graph (of n nodes) $L = D - A$ where
- $D = diag(d_1, ..., d_n)$, where $d_i$ = degree of node i
- $A$ = adj mat
- spectral embedding: plotting a graph using some of its spectral eigenvectors
as coordinates
- spectral gap: difference between largest evals, $\lambda_1 - \lambda_2$
- expander graph: sparse graph with high connectivity
- name: subsets "expand" quickly
- standard constraint: $d$-regular
- expansion parameters: edge expansion, vertex expansion, spectral expansion
- hard to compute (exponentially many subsets)
- lower- and upper-bounded by expressions on the gap of the graph
- many apps involve random walks on expanders
- markov transition matrix is just proportional to the adj mat
- applications
- fault tolerance: networks with guaranteed connectivity (removing random
edges does not reduce property of an expander by much)
- more surprisingly: ECCs, PRNGs, crypto hashes, PCP theorem
- $k$-regular graph: each node has degree $k$
- max degree is an eval
- _clustering coefficient_ of some node is (# links between neighbors / #
possible); best to refer to wikipedia illustration
- intuitively: how close a node + neighbors are to being a clique
- _centrality_: importance of a node; various measures
- _degree centrality_: just the node degree
- nodes that occur on many shorest paths between pairs of nodes in the graph
have a high _betweenness centrality_ score
- _eigenvector centrality_: proportional to sum of centrality scores of
neighbors
- calc eigendecomp of adj matrix, select evec w biggest eval
- elt $i$ in evec gives centrality of node $i$
- complex networks: exhibit non-trivial topological features
- e.g.:
- heavy tail in degree distribution
- high clustering coefficient TODO
- assortativity: preference for a network's nodes to attach to others that
are similar or different in some way
- clustering at many scales
- hierarchical structure
- scale-free network: describes many "real-world networks"
- some nodes are highly connected hubs (high degree)
- most nodes are of low degree
- degree distribution follows power law
- structure and dynamics are independent of # nodes
- preferential attachment
- fault-tolerance: assuming uniformly random node failures, unlikely that a
hub fails
- hypergraph: generalization of a graph, where edges can connect any number of
vertices
- de Bruijn graph: with n dimensions and m symbols, this is a directed graph of
m^n vertices, one for each length-n sequence of symbols
- rule for edges is more complicated; captures overlaps between sequences of
symbols
- butterfly graph: recursive criss-crosses
- found in fft
- butterfly network: a network that's almost partitioned, connected only by a
few edges
- different from butterfly graph
- network theory: application of graph theory to networks, which are graphs
that represent relations between objects
- network coding: "combining" values to maximize information flow
- Euclidian graph: edge weights = Euclidian distance
- random graph: graph generated by randomly adding edges to a node set
- there are different random processes used
- Erdos number: collaborative distance to ("degrees away from") Paul Erdos
- eg: start w all disconnected nodes, then create an edge btwn each pair w
some prob
- small world networks
- randomly adding few edges to high-diameter graph drops diameter drastically
- typical properties:
- high clustering coefficient
- short avg path length
- over-abundance of hub nodes
- many local links and few long range "shortcuts"
- create ring of nodes, each connected to $k$ nearest neighbors; then w
some prob, rewire each edge to an existing dst node
- planar graph
- $O(n)$ to determine planarity
- sparse: $O(v)$ edges, asymptotically smaller than max ($O(v^2)$)
- polyhedra can be "unrolled" into planar graphs
- Euler characteristic: number that describes an aspect of a toplogical space's
shape or structure
- Euler's formula: for finite, connected, planar graphs, $v - e + f = 2$, where
$f$ is # faces
- i.e., Euler characteristic is 2
category theory
===============