Math.page

- stanford linear systems & optimization | convex optimization I:
  http://see.stanford.edu/see/courseinfo.aspx?coll=2db7ced4-39d1-4fdb-90e8-364129597c87
- finite element method
- KKT conditions
- <http://www.geometer.org/mathcircles/graphprobs.pdf>

signals
=======

- fourier, z, laplace, wavelet, hilbert xforms
- butterworth filter

statistics
==========

- frequentist statistics
  - Point estimates trade off bias and variance
  - Interval estimates have the goal of achieving nominal coverage and the goal
    of being informative
  - Tests have the goals of calibration and power
- frequentists express prior info not in dist but in choice of methods---may
  still be too restrictive

- prospective vs retrospective studies
  - prospective: cohort is random sample of population; watch them; may be very
    skewed (eg if studying rare disease)
  - retrospective: already know cases & controls; sample both evenly

- ascertainment: just means sampling (ascertainment bias, ascertainment method)

- TODO mixed effects

- characteristic function = fourier transform

- TODO <http://itunes.apple.com/us/itunes-u/statistics-110-introduction/id495213607>

- bootstrap hypothesis testing: for 2 groups, merge them, generate a bunch of
  bootstrap samples, and see how frequently they generate means as different as
  what's seen (the p-value)

- areas: descriptive, inferential

- moments
  - positive skew: lumped to the left (counter-intuitive)
  - positive kurtosis: pointier peak

- sample variance: same as var but divide by $n-1$ instead of $n$

- mean deviation: the mean absolute deviation from the mean
  - harder to eval in closed form
  - used by TPC To calculate large deviations from RTT by calcing EWM deviation

- weak law of large numbers (prove using Chebyshev's ineq):
  $\lim_{n \rightarrow \infty} P(|\bar{X_n} - \mu| < \eps) = 1$
- strong law of large numbers (more complex proof):
  $P( \lim_{n \rightarrow \infty} \bar{X_n} = \mu ) = 1$
- central limit theorem: mean/sum of sufficiently many independent random
  variables, each with finite mean & variance, is approximately normal
  - as sample size increases, sample mean distribution approaches normal
- standard error of the mean: standard deviation of the sampling distribution
  of the sample mean ("sampling distribution of the sample mean" is the
  "distribution of the mean of samples")
- confidence interval: at a certain confidence level (eg 95%), the prob this
  experiment yielded a sample that contains the true pop param
  - "Were this procedure to be repeated on multiple samples, the calculated
    confidence interval (which would differ for each sample) would encompass
    the true population parameter 95% of the time."
  - *not* for predicting prob of true pop param being in interval; pop param
    has actual value, not a prob dist
  - instead, indicates *reliability of estimate*
  - find *lowest* value of param (e.g. population mean or population binomial
    proportion) such that estimate (e.g. sample mean or sample binomial
    proportion) is still in innermost 95% of the sampling distribution of the
    estimate
    - innermost means excluding lowest/highest 2.5%
    - sampling distribution:
      - for normal data $~ N(\mu, \sigma)$, it's $~ N(\mu, \sigma / \sqrt{n})$,
        where $\mu$ is unknown
      - for bernoulli data $~ Bern(p)$, it's $~ Bin(n,p)$, where $p$ is unknown
    - do the same for *highest* such value of param
  - closely related to significance testing: testing null hyp $\theta=0$ can be
    done by seeing if confidence interval for $\theta$ includes 0 (w confidence
    level $y=1-\alpha$)
  - intervals calculated from (functions of) sample, so they're statistics,
    functions of RVs, hence RVs themselves
  - margin of error: radius (half) of confidence interval
- statistical error: deviation of a sample from unobservable true/theoretical
  value (eg person's height vs population mean)
- residual: deviation of a sample from observable estimated value (eg person's
  height vs sample mean)

- survival function: complement of CDF

- 2 time series are _cointegrated_ if linear combo is stationary
  - 2 time series are cointegrated if they share common _stochastic drift_
  - eg drunkard and his dog: both randomly constantly moving but periodically
    error-correcting; cointegrated of order zero
  - eg stock market index and its futures contract, pairs trading
  - eg, R-squared fails when dealing w non-stationary vars, leading to
    _spurious regressions_ suggesting relationships when there's none
  - Engle-Granger test: simplest detection approach TODO

- order of integration $I(p)$: min num diffs required to obtain a stationary
  series
  - eg 1,1,1: already stationary
  - eg 1,2,3: diff w lag-1 is stationary
  - eg 1,4,9,16,25: diff w lag-1 gives 3,5,7,9; second diff is stationary
  - similar to calculus integration/differentiation

- simpson's paradox: a correlation present in diff groups is reversed when
  groups are combined; often in social-/med-sciences when frequency data
  hastily given causal interps

- TODO
- (paired) student t-test
- distributions
  - student t-distribution: something about normals for small populations where
    variance is unknown
  - chi-square distribution
  - Gaussian distribution, Cauchy, Exponential, Gamma, Generalized Extreme
    Value, Rician, Wrapped Cauchy, Von Mises, Binomial, and Beta
- $\mathrm{Correlation}(X, Y) = \frac{ \mathrm{Covariance}(X, Y) }{ \sigma_X
  \sigma_Y }$ where $\sigma$ is the SD

- seq of RVs has _heteroscedasticity_ if they have diff variances
  - usu. error terms vary w each obs, often true in cross-sectional/time-series
- _copula_, Gaussian copula: TODO

- Kendall tau rank correlation coefficient $\tau$
  - similarity of orderings of data
  - TODO uses p-value somewhere

bayesians vs frequentists

- views of probability
  - frequentist: $P(H)=.5$ turns up heads .5 of the time
  - bayesian: coin w $P(H)=.5$ deserves even-odds betting
- views of stats
  - frequentist: ask what's $P(H=h)$ given that null hyp that true odds are .5
  - bayesian: estimate $P(\theta=.5)$ vs $P(\theta\ne.5)$ given #H & prior
- frequentist
  - believes pop mean is real, but unknown and unknowable
- frequentist estimator
  - find sample mean; construct confidence interval around it
  - can't say "95% prob that true mean is in interval"; must say "95% prob over
    intervals that an interval contains true mean", i.e. "if we were to
    repeatedly sample data, find mean, and construct confidence interval, 95%
    of intervals would contain the true mean"
- frequentist tester
  - form hypothesis: "mean is blah"
  - accept or reject hypothesis with test to significance level
- bayesian
  - only data is real; pop mean is an abstraction; construct a credible
    interval, centered near the sample mean and tempered by prior beliefs
    (sometimes very non-informative)
  - can say "95% prob that this interval contains the mean"
- <http://www.statisticalengineering.com/frequentists_and_bayesians.htm>
- <http://www.rasmusen.org/x/2007/09/25/bayesian-vs-frequentist-statistical-theory/>

conjugate priors

- conjugate priors produce posteriors of the same algebraic form
- an algebraic convenience that yields closed-form posteriors
- beta dist is conjugate prior for binomial & bernoulli
- dirichlet dist is generalization of beta and conjugate prior for multinomial
  & categorical
- TODO

non-informative aka objective priors

- not nec. flat/uniform priors
- eg for bernoulli/binomial it's `beta(.5,.5)`
- construct according to Jeffrey's rule to minimize Fisher information
- <http://www.amstat.org/publications/jse/v12n2/zhu.pdf>

probability vs statistics

- probabilists: hypothesis to data
  - very pure, "subjunctive"; strict if-thens; if die is fair, then $P(1)=1/6$;
    "is die fair?" -> "not my dept"
  - "in 100 tosses of fair coin, what's $P(H=40)$?
- statistician: data to hypothesis
  - deal w not just if-then, but whether die fair, what the bias is
  - more science-y; nature must be observed to get at the underlying rules
  - "if 40 of 100 tosses were heads, what's $P(\theta=.5)$?": much harder
    question, such that no agreement among math/science/philos/prob/stats
- TODO finish
- <http://appliedclinicaltrialsonline.findpharma.com/appliedclinicaltrials/Metrics%2FStatistics/Bayesian-Likelihood-and-Frequentist-Approaches-to-/ArticleLong/Article/detail/84710>

test accuracy

- precision = relevant & retrieved / retrieved
- recall = relevant & retrieved / relevant
- F1 score aka F-measure: `F = 2 * (precision * recall) / (precision + recall)`

statistical significance

- standard score: number of SDs; aka z-value, z-score, normal score
- this is frequentist hypothesis testing; Bayesian hypothesis testing is
  Bayesian inference
- alt hyps
  - Fisherian testing: no alt hyp, only accept/reject null hyp
  - Neyman-Pearson: alt hyps
  - in practice: alt hyp is just negation of null hyp
- p-value: significance level; P(observation or more extreme|null hypothesis)
  - assuming null hyp (no underlying relationship or diff btwn the variables),
    what's prob we'll see a diff at least as extreme as the one observed just
    by chance?
    - not same as saying, given that the alt hyp *is* true, we can replicate
      the results 5% of the time or 95% of the time; that's called the
      _statistical power_ of the design
  - the lower, the less likely the result if the null hyp is true, thus the
    more statistically significant the result is
  - if < .05 or .01, corresponding to a 5% or 1% chance of rejecting the null
    hyp when it's actually true (type I error)
  - 5% is a typical value (border-line acceptability); in the end this is all
    arbitrary, no way to say something is/isn't "really" significant
  - eg P(14+ heads of 20 flips|fair coin)
- simple hypothesis: of the form $\theta = \theta_0$; specifies pop dist
  completely
- composite hypothesis: of the form $\theta > \theta_0$; doesn't specify
  population dist completely
- two-tail: p-value is $P(X>a \vee X<b | \theta)$
  - $H_0 : \theta = \theta_0$ vs $H_1 : \theta \ne \theta_0$
- one-tail: p-value is $P(X>x|\theta)$
  - $H_0 : \theta \le \theta_0$ vs $H_1 : \theta > \theta_0$ (or vice versa)
  - problem is that the $\theta$ is unknown (just something under $\theta_0$;
    TODO how do we justify the p-value, then, which must assume some value of
    $\theta$?
- E-value: expected num tests until a certain observation, assuming null hyp
- how to efficiently calc prob of *at least* $h$ H in $n$ flips?
- type I error: false pos; rej null hyp when it's actually true
- type II error: false neg; acc null hyp when it's actually false
- mustn't stop experiment as soon as significant results are detected
  - <http://www.evanmiller.org/how-not-to-run-an-ab-test.html>
- common "general format" of significance tests
  - they're some ratio of (some measure of the differentiation common in the
    variables) / (overall differentiation of those vars)
  - eg: (part of overall differentiation of the WCC scores) / (overall
    differentiation of the WCC scores)
  - called ratio of "explained variation" to "total variation"
- can only get to P(null hypothesis|observation) if we're Bayesian and have a
  prior
  - Fisher violently opposes the concept of an alt hyp
- many different and widespread interpretation pitfalls
- quantile regression: estimate median or other quantile instead of mean
  - good for risk models
  - uses linear programming
- robust/resistant statistics: around for 20+ yrs; not widely used; p-values
  reign for many non-ideal real-world reasons
- z-test: appropriate only if you have a SD, else use t-test

statistical tests

- many different statistical measures
- _power_ of a test is its probability of correctly rejecting null hyp
  (complement of false negative rate, $\beta$)
  - power function $\beta(\theta) = P_\theta(X \in R)$, where $R$ is rejection
    region (this is a diff $\beta$)
- size/significance level of a test ($\alpha$): prob of incorrectly rejecting
  null hyp (false positive rate), or upper bound of prob of rejecting the null
  hyp over all cases covered by the null hyp
- always want to choose the _most powerful_ test (highest power under $H_1$)
  for a given size/significance level
  - finding most powerful tests is hard; most powerful tests don't exist for
    many cases
  - 4 widely used tests: Wald, $\chi^2$, permutation, likelihood ratio
- contingency tables: used by many statistical tests
- test selection decision table: <http://www.graphpad.com/www/book/choose.htm>
  - the complexity here is a critique of the frequentist approach
- chi-square: typically Pearson's chi-square
- Fisher's exact test: for small categorical data; compares two groups
  - a type of exact permutation test; based on hypergeom dist
  - for 2x2 contingency tables (1 DOF, since margins are held fixed)
  - for each variation of table, calculate $p = \frac{(a+b choose a) (c+d
    choose c)}{(n choose a+c)} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}$
    - sum these up to get the $p$-value
  - use fisher if possible; easy w computers; chi square is approximation
    - <http://www.graphpad.com/faq/viewfaq.cfm?faq=1114>
    - <http://en.wikipedia.org/wiki/Exact_test>
    - <http://www.itl.nist.gov/div898/handbook/prc/section3/prc33.htm>
- Pearson's chi-square: for large categorical data
  - assesses _goodness of fit_ and _tests of independence_ (TODO)
  - $X^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i}$, where
    - $X^2$ is Pearson's cumulative test statistic which asymptotically
      approaches a $\chi^2$ distribution
    - $O_i$ is observed freq
    - $E_i$ is expected (theoretical) freq
    - $n$ is number of cells in contingency table
  - goodness of fit: compare group against theoretical expectation
  - $\chi^2$ test approximates Fisher
    - $X^2$ becomes $(ad-bc)^2 (a+b+c+d) / ((a+b)(c+d)(b+d)(a+c))
  - degrees of freedom
    - for goodness of fit, it's number of terms in $X^2$ expr - 1
    - for indep test, it's (num cols) (num rows) = 1
  - for 1DOF, chi square dist = gaussian, so same as "equal proportions z test"
    (z test over the difference of 2 binomial random variables)
    - <http://stats.stackexchange.com/questions/2391/what-is-the-relationship-between-a-chi-square-test-and-test-of-equal-proportions>
    - <http://stattrek.com/AP-Statistics-4/Test-Difference-Proportion.aspx?Tutorial=AP>
  - <http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html>
  - <http://math.hws.edu/javamath/ryan/ChiSquare.html>
- binomial test: for small categorical data
  - compare group against theoretical value (i.e. with-replacement; see
    binomial dist vs hypergeom dist)
- non-parametric tests TODO: mann-whitney, wilcoxon
  - Kolmogorov-Smirnov (K-S) test: nonparam test for equality of continuous 1D
    prob dists; often used to test for GOF
    - "The two-sample KS test is one of the most useful and general
      nonparametric methods for comparing two samples, as it is sensitive to
      differences in both location and shape of the empirical cumulative
      distribution functions of the two samples."
- sequential analysis: sample size isn't fixed in advance; data is evaluated
  during collection; may conclude exp earlier
  - alt is bayesian exp design
- Mann-Whitney-Wilcoxon RankSum test: useful to determine if 2 distributions
  are significantly different or not. Unlike the t-test, the RankSum test does
  not assume that the data are normally distributed, potentially providing a
  more accurate assessment of the data sets.
- Chow test: whether 2 linear regressions on diff data sets are equal
  - also (in econometrics) for whether 2 time series over different times
    exhibit a structural break

data

- censoring: limiting values to be within bounds
- truncation: omitting values that are out of bounds
- eg: if policyholders subject to policy limit $u$ and deductible $d$, then
  losses are left-truncated (unreported if under $d$) and right-censored
  (limited to $u-d$)

time series (TODO also cover math)

- stationary process: stochastic process whose joint prob dist doesn't change
  when shifted in time/space, so mean/variance/autocorr don't change
  - white noise is stationary; cymbal clash variance diminishes over time
  - i.e. "flat", not trending over time
  - for a non-stationary signal, either fitting a curve and then taking
    residuals or differencing may give stationary data
- autocorrelation: cross-correlation of signal w itself; a lagged
  self-correlation; to detect non-randomness; works if fixed intervals
  - IID iff 0 autocorrelation
  - Durbin-Watson test: conventional test to detect autocorrelation
    - 0 to 4; 2 means no autocorrelation
- cross-correlation: sliding dot product of two signals; like convolution but
  no reversal
- linear prediction: future values of discrete-time signal estimated as linear
  function of previous samples
- autoregressive (AR) model: random process used to model natural phenomena; a
  type of linear prediction formula
- vector autoregressive (VAR) model: popular in econometrics
- Dickey Fuller test: whether unit root is present in autoregressive model
- TODO: (partial) autocorrelation plots, ARMA, ARIMA, Ljung-Box for zero
  autocorr, unit root test for coint, Granger-causality, whiteness (iid-ness) &
  normality, VAR impulse response analysis and forecast error variance decomp,
- TODO filtering: Hodrick-Prescott (HP), Baxter-King, Christiano-Fitzgerald
  (all popular in finance)
- TODO Bayesian dynamic linear models (DLM)
- repetition
  - seasonal aka periodic: fixed-length cycles
  - cyclic: variable-length cycles
- cyclic/seasonal models
  - TODO cyclic vs seasonal vs periodic <http://robjhyndman.com/researchtips/cyclicts/>

statistical inference

- analysis of variance (ANOVA or AOV): TODO
- canonical (variate) analysis: TODO
  - 1 predictor, 1 criterion: coef of corr $r$, coef of det $r^2$, coef $\beta$
  - $n$ predictors, 1 criterion: multiple corr $R$, multiple coef of det $R^2$,
    weights $\beta$
  - $n$ predictors, $m$ criteria: canonical corrs $\rho_1, \dots$, canonical
    weights $C,D$
  - find linear combo of predictors and linear combo of criteria w max corr
- kernel density estimator: basically place a small kernel everywhere you have
  a sample, resulting in somtehing much smoother than a histogram (where bins
  can make all the difference in interpretation)

- mean absolute percentage error (MAPE): $M = \frac{1}{n} \sum_{t=1}^n
  \left\vert \frac{A_t - F_t}{A_t} \right\vert$

relationship btwn mean squared error and gaussian dist?  (TODO cs229 notes)

outlier elimination

- chauvenet's criterion
- pierce's criterion: "if you need rigor, generality, and the ability to reject more than one data point"
  - process involves multiple steps and lookup tables
  - <https://web.chemistry.ohio-state.edu/~coe/research/documents/error_distributions_new2.pdf>

exponential smoothing for time series analysis

- _exponential moving average (EMA)_ aka _exponentially weighted moving average
  (EWMA)_ aka _single exponential smoothing (SES)_: does _not_ spot trends
- _double exponential moving average (DEMA)_ aka _linear exponential smoothing
  (LES)_: incorporates seasonality
- _triple exponential moving average (TEMA)_ aka _triple exponential smoothing
  (TES)_: incorporates seasonality & trend
- <http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc43.htm>

likelihood

- likelihoods for continuous distributions use PDFs instead of probabilities
  (since $P(X=x) = 0$); this is justified, see
  <http://en.wikipedia.org/wiki/Likelihood_function#Likelihoods_for_continuous_distributions>

misc

- benford's law: distribution of leading digits is non-uniform (exp decay)

pitfalls
--------

- unbiased estimators can be terrible
- drug X results may have $p=.04$ and drug Y results may have $p=.06$ but
  difference may well NOT be significant (diff btwn sig & insig may well be
  insig)
- $p$-value is not prob of hyp being right; to compute this (posterior), more
  info is needed (prior)
- statistical significance != significance
- data dredging: many sig tests will inevitably result in false pos results
- example takedown: <http://www.americanscientist.org/issues/id.14344,y.0,no.,content.true,page.1,css.print/issue.aspx>

regression tips
---------------

- from andrew gelman <http://andrewgelman.com/2010/12/what_do_practit/>
- (wouldn't take all thes as gospel)
- pre-process your vars; discretize numeric, avg several related
- usu fine to just treat discrete outcomes (eg 1-5 scale) as numeric; don't
  worry about ordered logit/probit/etc., just run the reg already
- take the 2 most important input vars in your reg & throw in their interaction
- The key assumptions of a regression model are validity and additivity. Except
  when you’re focused on predictions, don’t spend one minute worrying about
  distributional issues such as normality or equal variance of the errors.

surveys
-------

- Likert scale aka rating scale: psychometric scale (5/7-point)
  - ordinal data, but not interval/linear data; requires nonparam tests

experiments
-----------

- factorial design: 2+ factors, ea w discrete values/levels; experimental units
  take on all combos
  - eg compare 2 motors at 2 speeds (4 combos)

combinatorics
=============

- permutations without replacement: P(n,r) = n! / (n-r)!
- permutations with replacement: n^r
- combinations without replacement (partitioning sets): C(n,k) = n! / (k! (n-k)!)
  - this is the _binomial coefficient_
  - _multinomial coefficient_: partitioning into multiple sets
    - binomial is actually C(n; k, n-k)
    - C(n; k1, k2, ..., kp) = n! / (k1! k2! ... kp!)
- combinations with replacement (multisets/bags):
  C(n-1 + k; n-1, k) = (n-1+k)! / (k! (n-1)!)
  - proof using "stars and bars"

mathematics
===========

- abstract algebra: study of algebraic structures such as groups, rings,
  fields, modules, vector spaces, and algebras
- topology: extension of geometry that focuses on structure
- algebraic topology: uses abstract algebra
- topological graph theory: topology + graph theory; studies embeddings of
  graphs in surfaces
- number theory
- computational number theory, algorithmic number theory: applications in
  crypto

\begin{equation}
e^x = 1 + x \left(
      1 + x \left(
      \frac{1}{2} + x \left(
      \frac{1}{6} + x \left(
      \frac{1}{24} + x \left( \dots \right) \right) \right) \right) \right)
\end{equation}

geometry
========

- Pythagorean theorem
  - easy proof

Interesting questions that came up during ICFPC08

- find the tangent lines (up to four) between a pair of circles
  - find the tangent lines between a circle and a point
- if we roll a circle around the outside of an ellipse and trace the path of
  its center, is the resulting oval an ellipse?

- circumcenter of a polygon = point equidistant to all vertices = center of
  circumscribed circle = intersection of all edge-bisecting lines

linear algebra
==============

inverting matrices

- Gaussian elim yields row echelon form (0s in lower-left)
- Gaussian-Jordan elim yields reduced row echelon form ($I$)

From <http://en.wikipedia.org/wiki/Sparse_matrix>:

> Conceptually, sparsity corresponds to systems which are loosely coupled.
> Consider a line of balls connected by springs from one to the next; this is a
> sparse system. By contrast, if the same line of balls had springs connecting
> every ball to every other ball, the system would be represented by a dense
> matrix. The concept of sparsity is useful in combinatorics and application
> areas such as network theory, of a low density of significant data or
> connections.

eigenvalues and eigenvectors

- treating $M$ as linear xform, which vectors would keep same direction (but
  possibly scaled)?  ($Mx=\lambda x$)
  - $M$ xforms unit sphere $x^T x = 1$ into ellipsoid where evecs are axes,
    evals are radii
  - if $M$ has data pts as col vecs, covariance matrix $M^T M$ has evecs as
    principal components, evals as their variances
  - shear: 1 evec; uniform scale: $\infty$ evecs; non-uniform scale: 2 evecs;
    rotate: 0 evecs
- _power method_ or _power iteration_: eigenvalue algo
  - given mat $A$, find scalar $\lambda$ (eigenvalue) and non-0 vector $v$
    (eigenvector) s.t. $A v = \lambda v$
  - doesn't compute matrix decomposition so suitable for large sparse A
  - but will only find one eigenvalue (with max val, aka _dominant eigenpair_)
    and may converge slowly
  - start with random vector and iteratively multiply by $A$ and normalize
  - used in pagerank, HITS
  - <http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf>
- for transition matrix of markov chain, evec is stationary state $\pi$
- for ratings matrix (eg rows are users, cols are movies), evecs of $M M^T$ are
  row features, of $M^T M$ are col features

symmetric positive definite, positive semi-definite, negative definite, indefinite, etc.

- $M$ is _positive definite_ iff any of:
  - $z^T M z > 0$ for all nonzero vector $z$
  - all evals of $M$ are pos
  - all leading principle minors are pos
    - i.e. upper left $k \times k$ submatrix of $M$ has pos det for all $k =
      1,2,\dots,n$
    - eg: if $M$ is diagonal, then $M$ is pos def iff all diag elts are pos
      (diag elts are evals)
- $M$ is neg def iff any of:
  - $-M$ is pos def, i.e. $x^T M x < 0$ for all nonzero vector $x$
  - all evals of $M$ are neg
  - upper left $1 \times 1$ submatrix has pos det, upper left $2 \times 2$
    submatrix has neg det, upper left $3 \times 3$ submatrix has pos det, etc.
- $M$ is pos semi-def iff any of:
  - $z^T M z \ge 0$ for all vector $z$ (usual definition)
  - all evals of $M$ are non-neg
  - note: if $M$ is pos semi-def, then all upper left $k \times k$ submatrices
    have non-neg dets, but this is not a sufficient condition any more; same
    for neg semi-def below
- $M$ is neg semi-def iff any of:
  - $z^T M z \le 0$ for all vector $z$ (usual definition)
  - all evals of $M$ are non-pos
  $M$ is indef iff any of:
    - $M$ is neither pos semi-def nor neg semi-def
    - $M$ has pos and neg evals

definitions

- vectors linearly dependent iff
  - any of the vectors can be linear combo of any others (redundancy)
  - only linear combo that equals 0 is all-0 coefs
  - vectors are $n$-dimensional and span $\Re^n$
  - determinant of vectors as cols is 0
- basis of vector space: linearly indep spanning set of vectors
- subspace of $\Re^n$: set of vectors that's closed under scalar mult (hence
  must incl. origin)
- nullspace of $A$ is set of vectors $x$ s.t. $Ax=0$; a valid subspace
- only way to represent a line in >2D is parametrically ($t$)
- A matrix $A$ is diagonalizable if there exists a non-singular matrix $B$ and
  a diagonal matrix $D$ such that $A = BDB^{-1}$.
- _singular value decomposition (SVD)_: important factorization of a
  rectangular real or complex matrix
  - many applications in signal processing, statistics
  - eg computing the pseudoinverse, least squares fitting of data, matrix
    approximation, and determining the rank, range and null space of a matrix
- singularity
  - matrix is singular iff has no inverse iff any eigenvalues are zero iff
    characteristic polynomial is 0
  - matrix is nonsingular iff invertible iff all eigenvalues are non-zero iff
    characteristic polynomial is non-0
- characteristic polynomial: $p(\lambda) = \det(A - \lambda I)$
- eigenvalues are roots of characteristic polynomial
- eigenspace: for a given eigenvalue, linear subspace spanned by its
  eigenvectors
- if $M$ is pos def then $z^* M z > 0$ for any vector z
- a pos def matrix has pos def inverse
- all eigenvalues of pos def are positive
- a real sym matrix has real sym inverse
- _kernel_ of a mapping: elts that map to the zero elt
- _kernel_ aka _null space_ of a matrix: nullity of A

foundations

- _vector space_: eg $\mathbf{R}^2$ (set of all 2-dim vecs)
  - _subspace_: subset of vector space closed under linear combination
    - subspaces of $\mathbf{R}^2$: $\mathbf{R}^2$, any line through origin, and
      $\{(0,0)\}$
    - subspaces of $\mathbf{R}^3$: $\mathbf{R}^3$, any plane through origin,
      any line through origin, and $\{(0,0)\}$
  - _column space_ $C(A)$ is span of $A$'s col vecs
    - if not fully linearly indep, then fewer dimensions spanned then #columns
    - $Ax=b$ can be solved iff $b \in C(A)$
  - _kernel_ aka _null space_ $N(A)$ is set of all solutions to $Ax=0$
    - solutions $x$ always form a subspace, i.e. linear combos of $x$s are 0
    - $N(A)=0$ means col vecs of $A$ lin indep ($C(A)$ is a basis)

- TODO
  - stochastic matrix
  - eigen*, singular, etc.
  - http://www.cut-the-knot.org/arithmetic/algebra/eigens.shtml
  - spectral decomposition aka eigendecomposition
  - spectral theorem

norms/distance functions

- taxicab geometry aka rectilinear dist aka L1 distance aka $\ell_1$ norm aka
  city block dist aka manhattan dist aka manhattan length
  - sum of absolute diffs in coordinates
  - hamming dist is manhattan dist of unit hypercube (since hamming dst
    components are either 0 or 1; hamming dist is # positions that are diff)
- euclidian dist aka $L^2$/$\ell^2$ dist/norm
- $p$ norm aka $L^p$ norm
  - $( \sum_i |x_i|^p )^{1/p}$
- maximum norm aka infinity norm aka uniform norm aka supremum norm
  - $\max( |x_1|, \dots, |x_2| )$
- triangle inequality holds for $n$-dim vectors

cryptography
============

RSA (TODO REVIEW!)

- choose primes p, q
  - primality testing: _wheel factorization_ is certain; there are also
    probabilistic tests
- find $e$ such that $gcd(e, (p-1)(q-1)) = 1$
  - i.e., find $a$ such that $ae + b(p-1)(q-1) = 1$. This is a _diophantine
    equation_ and can be solved using the _Euclidian algorithm_.
- exponentiation: find $M^e \mod n$
  - $M$ is data, $e$ is pub key, $n$ is priv key
  - _square and multiply algorithm_
  - _Chinese Remainder Theorem (CRT)_ for more complex, faster algos

How To Share A Secret (Shamir 79)

- split into $n$ pieces such that any $k$ can reconstruct but $k-1$ can't
- <http://research.swtch.com/2008/01/i-could-tell-you-but.html>

signals, coding, information theory
===================================

Audio & Video Compression (Keith Winstein's SIPB cluedump 2008/11/06)

- huffman code: recurse: least freq are same len, differ only in last bit
- keithw: slides
- entropy coding
- arithmetic coding can get us flatter?

general

- Shannon's theorem: describes _channel capacity_, the max information rate at
  which reliable comm is possible over a channel that has a certain error prob
  or SNR
  - only existential, not constructive, hence actual capacity depends on algo

maximum entropy

- given constraints ("testable information"), the dist that best represents the
  current state of knowledge is the one w largest entropy
  - i.e., choose the _least informative_ (and thus most generalizable) dist
    given what we know, since we don't know anything else
  - also, many physical systems tend to move toward maxent configs over time
  - makes explicit our freedom in using different forms of prior information
- applications
  - obtain priors for bayesian inference
  - maxent models: observed data itself is the testable information
- examples
  - given just mean and sd, use normal
  - given just the support [a,b], use uniform over [a,b]
  - exponential dist w mean $1/\lambda$ is maxent dist over all dists in
    $[0,\infty]$ w mean $1/\lambda$

shannon entropy

- $H(X) = E(I(X)) = \sum_i p(x_i) I(x_i) = - \sum_i p(x_i) \log p(x_i)$,
  assuming $X$ has possible values $x_i$
  - $I(X)$ is the _information content_ of $X$
  - if $p_i = 0$, the $0 \log 0$ is taken to be 0
- eg binary entropy: [$H(X)$ for varying
  $P(X=1)$](http://en.wikipedia.org/wiki/File:Binary_entropy_plot.svg)
- distinguish btwn _entropy_ of a distribution of messages/symbols and the
  _entropy rate_ of a stochastic process
  - eg need 7 bits to distinguish the first 128 symbols generated in the
    fibonacci sequence
  - the fib sequence formula itself has a much lower entropy and applies to any
    length fib seq

Fisher information

- reflects amt of (pos) info of observer
- depends on PDF & its first derivatives
- FIM is the curvature of the KL divergence

minimum description length (MDL)

- best hyp yields best compression; occam's razor
- minimum message length (MML) is very similar

probability
===========

stochastic processes

- bernoulli process: stream of discrete & independent events, usu binary;
  discrete-time, unlike poisson
- poisson process: stream of continuous & independent events occurring at const
  avg rate; continuous-time (infinite-granularity) bernoulli process
  - characterized by rate param $\lambda$

distributions

- bernoulli: heads or tails
- binomial: number of successes in $n$ draws with replacement (multiple
  bernoulli trials)
- hypergeometric: number of successes in $n$ draws *without* replacement
  (unlike binomial)
- poisson: # events over some amount of time according to poisson process;
  infinite-granularity binomial
- exponential: time btwn events in poisson process
- gaussian/normal
- log-normal distribution: dist of a RV whose log is normal
  - in Black-Scholes model, changes in the log of exchange rates, price
    indices, and stock market indices are assumed normal
- Cauchy: occurs in spectroscopy
- gamma: sum of fixed # exponentially dist vars
  - eg total time for some # ppl to arrive
- great reference illustrating relationships among distributions:
  <http://www.johndcook.com/distribution_chart.html>

misc

- law of total expectation: $E(E(X|Y)) = E(X)$
- law of total variance: $V(X) = V(E(X|Y)) + E(V(X|Y))$
  - <http://anyall.org/totvar/>
- $P(X < min(Y,Z)) = P(X > max(Y, Z)) + 1 – P(X > Y) – P(X > Z)$
- Chebyshev's inequality: $P(|X-\mu| \ge k \sigma) \le \frac{1}{k^2}$ for any real $k>0$
- Jensen's inequality: If $f$ is convex and $X$ is a random variable, $f(E(X))
  \le E(f(X))$.

problems

- expected # flips before HTH > that of HTT, because HTH overlaps with itself
  - sketch: with a million flips, same # HTH and HTT, but HTH are more clumped,
    so more space btwn them
- st. petersburg's paradox <http://mindyourdecisions.com/blog/2010/06/07/the-st-petersburg-paradox-a-flimsy-critique-of-expectation-theory-by-people-who-dont-know-math-or-economics/>
  - game: toss coin until H; payoff is $2 for 0T, $4 for 1T, $8 for 2T, ...
  - expected payout is $1/2 * 2 + 1/4 * 4 + 1/8 * 8 + ... = \infty$
  - but must consider realistic payouts (nobody has $\infty$), utility of
    payout (e.g. log scale), and weight-vs-risk (why you don't play the
    lottery)

- kelly criterion: optimal fraction to bet in a series of bets to maximize
  _expected return/growth rate_ which is `log(growth rate)`

game theory
===========

- shapley shubik power index: measure power of players in weighted voting game
  - of all possible permutations, when votes in favor happen in order, which
    player tilts the vote? rank them by the frequency this applies
- pareto efficiency aka pareto optimality: a possibly local optimum
  - can't improve one person without worsening another
- nash equilibrium: nobody can improve by changing only own strategy
  unilaterally
  - assumes each player knows equilibrium strategies of others
  - doesn't mean there aren't strategies that are better for all
    - eg competing businesses forming cartel to increase profits
  - may appear irrational to 3rd party, since not necessarily pareto optimal

puzzles

- pirates
- prisoner's dilemma
  - if A testifies against B and B is silent, A goes free and B gets full 10yr
  - if both silent, both get 6mo
  - if both testify, both get 5yr
  - regardless of what opponent chooses, you always are better off betraying
  - nash equilibrium not nec pareto-optimal
  - non-0-sum game
- hotelling's game
  - 2 competing hot dogs $(i,j)$ stand on beach that ranges over $[-1, 1]$
  - customers go to nearest stand
  - stands will naturally go to $(0,0)$, even though social opt is $(-.5,.5)$

number theory
=============

- fermat's last theorem: no 3 pos ints $a,b,c$ can satisfy $a^n+b^n=c^n$ for
  any int $n>2$
  - long-standing conjecture; proven in 1995

misc
====

- set is _well-ordered_ iff has total ordering and any subset has least elt

- harmonic series: $\sum_{n=1}^{\infty}\frac{1}{n}=\infty$
- for density function $f(x)$ with one hump, choose $a,b$ on opposite sides of
  the hump with $f(a) = f(b)$, then $[a, b]$ is the shortest interval with its
  mass
  - useful in statistics and more
  - <http://www.johndcook.com/blog/2009/02/23/finding-shortest-confidence-interval/>

queueing theory
===============

- little's law: $L=\lambda W$ where $L$ is avg number of customers in system,
  $\lambda$ is avg arrival rate, $W$ is avg time each customer spends in system
- Kendall's notation: A/B/C where A is arrival process, B is service time dist,
  C is # servers
  - M/M/1 queue: Poisson arrivals, exp'l service time, 1 server
  - <http://en.wikipedia.org/wiki/Kendall%27s_notation>

measure theory
==============

- studies ways of generalizing the notions of length/area/volume
- measure theory asks questions like: 
  - How do we define a measure on our space? (Jordan measure and Lebesgue
    measure are two different options in Euclidean space.)
  - What properties does our measure satisfy? (For example, does it satisfy
    translational invariance, rotational invariance, additivity?)
  - Which objects are measurable/which objects can we say it's okay not to
    measure in order to preserve nice properties of our measure? (The
    Banach-Tarski ball can be rigidly reassembled into two copies of the same
    shape and size as the original, so we don't want it to be measurable, since
    then we would lose additivity properties.)
- TODO URL

sequences and series
====================

- TODO name: $(1-1/x)^k \approx e^{-k/x}$, $(1-k/m)^n \approx e^{-kn/m}$ (last
  seen <http://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf>)

graph theory
============

- a branch of combinatorics
- eulerian paths
  - eg bridges of Konigsberg
  - poly-time algos
  - euler trail aka eulerian path aka eulerian walk: visit each edge once;
    iff only 2 nodes have odd degrees
  - eulerian circuit aka eulerian cycle aka eulerian tour: also return to
    start; iff all nodes have even degrees
- hamiltonian path: visit each vertex once; NP-complete

- overview: <http://arxiv.org/abs/condmat/0303516>

- spectral graph theory
  - study of graphs and the characteristic polynomials, evals, and evecs of $A$
    or $L$

- digraph: directed graph

- sparse graph: # edges close to min # edges
- dense graph: # edges close to max # edges

- finite graph: finite # edges and # nodes
- locally finite graph: each node has finite degree

- _maximal clique_: clique that's not a subset of any other clique
- _maximum clique_: biggest clique(s) in the graph

- multigraph: a graph with multiple edges between same node pairs

- spectrum of a graph: spectrum (eigenvalues) of graph's adj mat $A$
  - isomorphic graphs have same spectrum
  - $d$-regular graph is connected iff $\lambda_1 > \lambda-2$
  - $d$-regular graph has largest eval $d$

- laplacian matrix of a graph (of n nodes) $L = D - A$ where
  - $D = diag(d_1, ..., d_n)$, where $d_i$ = degree of node i
  - $A$ = adj mat

- spectral embedding: plotting a graph using some of its spectral eigenvectors
  as coordinates

- spectral gap: difference between largest evals, $\lambda_1 - \lambda_2$

- expander graph: sparse graph with high connectivity
  - name: subsets "expand" quickly
  - standard constraint: $d$-regular
  - expansion parameters: edge expansion, vertex expansion, spectral expansion
    - hard to compute (exponentially many subsets)
    - lower- and upper-bounded by expressions on the gap of the graph
  - many apps involve random walks on expanders
    - markov transition matrix is just proportional to the adj mat
  - applications
    - fault tolerance: networks with guaranteed connectivity (removing random
      edges does not reduce property of an expander by much)
    - more surprisingly: ECCs, PRNGs, crypto hashes, PCP theorem

- $k$-regular graph: each node has degree $k$
  - max degree is an eval

- _clustering coefficient_ of some node is (# links between neighbors / #
  possible); best to refer to wikipedia illustration
  - intuitively: how close a node + neighbors are to being a clique

- _centrality_: importance of a node; various measures
  - _degree centrality_: just the node degree
  - nodes that occur on many shorest paths between pairs of nodes in the graph
    have a high _betweenness centrality_ score
  - _eigenvector centrality_: proportional to sum of centrality scores of
    neighbors
    - calc eigendecomp of adj matrix, select evec w biggest eval
    - elt $i$ in evec gives centrality of node $i$

- complex networks: exhibit non-trivial topological features
  - e.g.:
    - heavy tail in degree distribution
    - high clustering coefficient TODO
    - assortativity: preference for a network's nodes to attach to others that
      are similar or different in some way
    - clustering at many scales
    - hierarchical structure

- scale-free network: describes many "real-world networks"
  - some nodes are highly connected hubs (high degree)
  - most nodes are of low degree
  - degree distribution follows power law
  - structure and dynamics are independent of # nodes
  - preferential attachment
  - fault-tolerance: assuming uniformly random node failures, unlikely that a
    hub fails

- hypergraph: generalization of a graph, where edges can connect any number of
  vertices

- de Bruijn graph: with n dimensions and m symbols, this is a directed graph of
  m^n vertices, one for each length-n sequence of symbols
  - rule for edges is more complicated; captures overlaps between sequences of
    symbols

- butterfly graph: recursive criss-crosses
  - found in fft

- butterfly network: a network that's almost partitioned, connected only by a
  few edges
  - different from butterfly graph

- network theory: application of graph theory to networks, which are graphs
  that represent relations between objects

- network coding: "combining" values to maximize information flow

- Euclidian graph: edge weights = Euclidian distance

- random graph: graph generated by randomly adding edges to a node set
  - there are different random processes used
  - Erdos number: collaborative distance to ("degrees away from") Paul Erdos
  - eg: start w all disconnected nodes, then create an edge btwn each pair w
    some prob

- small world networks
  - randomly adding few edges to high-diameter graph drops diameter drastically
  - typical properties:
    - high clustering coefficient
    - short avg path length
    - over-abundance of hub nodes
    - many local links and few long range "shortcuts"
    - create ring of nodes, each connected to $k$ nearest neighbors; then w
      some prob, rewire each edge to an existing dst node

- planar graph
  - $O(n)$ to determine planarity
  - sparse: $O(v)$ edges, asymptotically smaller than max ($O(v^2)$)
  - polyhedra can be "unrolled" into planar graphs
- Euler characteristic: number that describes an aspect of a toplogical space's
  shape or structure
- Euler's formula: for finite, connected, planar graphs, $v - e + f = 2$, where
  $f$ is # faces
  - i.e., Euler characteristic is 2

category theory
===============

- a _category_ consists of
  - a class of _objects_
  - a class of _morphisms_ aka _arrows_
  - for each morphism $f$, a _domain_ object and a _codomain_ object
  - for each object $A$, an _identity morphism_ $1_A: A \rightarrow A$
  - for each pair of morphisms $f: A \rightarrow B$ and $g: B \rightarrow C$,
    a _composite morphism_ $g \ocirc f: A \rightarrow C$
- with these rules:
  - identity composition: for each $f: A \rightarrow B$, $f \ocirc 1_A = f$
    and $1_B \ocirc f = f$
  - associativity
- functors: morphisms from category to category
  - define functor $F: C1 \rightarrow C2$
    - every object $A \in C1$ maps to an object $F(A) \in C2$
    - every map $A \right B \in C1$ maps to $F(A) \rightarrow F(B) \in C2$
  - must preserve rules:
    - $F(1_A) = 1_{F(A)} \forall A \in C1$
    - $F(g \ocirc f) = F(g) \ocirc F(f) \forall f: A \rightarrow B, g: B
      \rightarrow C$ where $A,B,C \in C1$