forked from yang/notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Algorithms.page
682 lines (568 loc) · 26.3 KB
/
Algorithms.page
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
TODO bertsekas' auction algorithm (max weighting in bipartite graph)
TODO merge in notes from CS170
TODO http://www.catonmat.net/blog/summary-of-mit-introduction-to-algorithms/
TODO http://cstheory.stackexchange.com/questions/189/algorithms-from-the-book
TODO
- basics
- the chinese remainder theorem (really an algo)
- the euclidean algorithm
- gaussian elimination
- linear programming
- primality testing
- fast matrix product
- tsp
- integer factoring
- combinatorial optimization
- minimum spanning tree
- shortest path in graphs
- for-fulkerson max-flow algorithm
- gale-shapley stable marriage
- general matching in graphs
- randomized algos
- primality testing
- volume of convex bodies
- counting matchings
- min-cut of graphs
- string matching
- hashing
- gibbs sampling
- markov chain monte carlo (MCMC)
- heuristic algos
- local search
- shotgun sequencing algorithm
- simulated annealing
- real-world impact
- fast fourier transform
- rsa encryption
- miller-rabin primality test
- reed-solomon codes
- lempel-ziv compression
- page rank of google
- consistent hashing of akamai
- viterbi and hidden markov models
- smith-waterman
- spectral low rank approximation algorithms
- binary decision diagrams
- info theory
- digital fountain codes
- cdma
- compressive sensing of images
- <http://rjlipton.wordpress.com/2009/10/27/highlights-of-focs-theory-day/>
strategies
- if you need to find say an $O(n)$-time algo, think about how to make use of
$O(n)$ space (dynamic programming); consider cases where the reuse is not
exactly straightforward, e.g. palindrome
- look for ways to simplify the problem by eliminating entire classes of stuff
from the solution space, e.g. max $j-i$ where $A_j > A_i$
randomized algorithms
- las vegas: always gives correct results; runtime resources may vary
- eg quicksort (rand pivots)
- monte carlo: deterministic runtime; output may be incorrect w certain prob
- eg min cut
misc terms
- _binary decision diagram (BDD)_: binary tree representation of truth table;
each layer switches on one input
bit tricks
- <http://graphics.stanford.edu/~seander/bithacks.html>
core concepts
- TODO: Range counting, predecessor, voronoi diagrams, dynamic optimality, planar point location, nearest neighbor, partial sums, connectivity
- skim adv algos
- <http://mybiasedcoin.blogspot.com/2009/10/core-tcs.html>
hash tables
- direct chaining: list per bucket; optional move-to-front heuristic
- linear scanning aka open addressing: try other slots
- try $h$, $h+1$, $h+2$, ...
- try $h_1$, $h_2$, $h_3$, ...
- cuckoo hashing: $k$ (usu. 2) hash fn's
- insert: into one of the 2 locs, displacing any resident and repeat the
insertion of the resident, possibly evicting multiple keys
- lookup: inspect 2 locs
- inserts succeed in expected const time, even amortizing table rebuilds
(growths), as long as load factor <50%; with 3 fn's, up to <91% load
- also considering $n$-way tables; with 2 keys per bucket, up to <80% load
- much faster than chained hashing for small, cache-resident hash tables on
modern processors
- $n$-way buckets faster than conventional methods for large tables
hash functions
- _rotating hashes_: for open addressing, avoid computing a bunch of different
hash functions; just compute one long hash value and apply sliding window on
it to get a bunch of shorter hash values
- implementations
- second-generation: murmurhash
- third-generation: cityhash, spookyhash; faster, greater ILP
bloom filter
- TODO
- TODO scalable bloom filters
- $m = -n*ln(p)/(ln(2)^2)$, where $m$ is # bits given $n$ elts, desired false prob $p$
- $k = 0.7*m/n$, where $k$ is # hash fns
- sometimes bits are shared by all hash fns; sometimes partition up into
disjoint ranges for each fn; same asymptotic behavior
- [Network Applications of Bloom Filters: A Survey](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.9672&rep=rep1&type=pdf)
prime sieve
given n, return all primes up through n
isprime = [true] * n
for i in [2..n]
for j in [2i, 3i, ..., n]
isprime[j] = false
return [ i for i in [2..n] where isprime[i] ]
gcd
more concise (yay mod operator):
gcd(a,b) =
if b = 0, a
else, gcd(a, b % a)
explicit:
gcd(a,b) =
if b = 0, a
if a < b, gcd(b, a) // swap
else, gcd(a, b % a)
self-balancing binary search trees
- red-black
- leaf nodes don't contain data; sometimes use single sentinal
- root, leaf nodes are black
- red nodes have black children
- every simple path from any node to any descendant leaf contains same number
of black nodes
- AVL
- splay: tree-rotate ("zig-zig/zig-zag") recently accessed node to root
string search
- knuth morris pratt
- boyer-moore
- boyer-moore-horspool
hashing
- locality sensitive hashing
- perfect hashing
- universal hashing
- linear probing, linear hashing
spatial indexing
- quadtree: 4-ary tree (child per quadrant)
- geohashing: treat each quadtree as trie, and address the quadrants as 00, 01,
10, and 11; then leaves are `00 11 01 10 00`
- to query region intersection, can just choose leaves for tightest bound,
but that's slow/numerous
- instead, specify max number of regions, then recursively choose the
quadrant that when subdivided will remove the most area while staying under
max
- visit selected areas in "Z" order; this limits the continuity
- hilbert curves: "U" order; better locality behavior [for convex hulls?]
- a type of 1D fractal known as a _space-filling curve_
- can translate (x,y) to hilbert curve pos
- <http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves>
bin packing
- NP-hard
- items of diff sizes must be packed into identical bins; minimize # bins
- formally: given bin size $V$ and item sizes $a_1,\dots,a_n$, find int $B$ and
$B$-partition $S_1 \cup \dots \cups S_B$ of ${1,\dots,n}$ such that $\sum_{i
\in S_k} a_i \le V$ for all $k \in 1,\dots,B$
integer linear programming aka integer programming
- NP-hard
- maximize $\mathbf{c}^T \mathbf{x}$ subject to $A \mathbf{x} \le \mathbf{b}$
linear programming
- simplex algorithm: simplest; usu efficient but poor worst-case behavior
- both simplex-based methods and interior point methods have similar efficiency
for routine linear programs
- in P with ellipsoid algorithm or interior point methods
- dual problem/duality: complement to the primal problem
- linear case
- primal objective function combines $n$ vars w $m$ upper-bound constraints
- dual function combines $m$ vars w $n$ lower-bound constraints
- all LP have _strong duality_: optimal values of primal & dual are equal
- lagrange duality/lagrange multiplier: TODO
priority queues
- binary heap: maintain invariant that each node is $\ge$ children
- add to bottom and take from top; bubble up/down as necessary
- binomial heap: bin-heap that supports fast (log-time) merges, but log-time
(non-const) min-peeks
- special tree structure: a binomial tree is defined recursively
- order 0 tree is single node
- order $n$ tree's children: roots of trees of orders $k-1,k-2,\dots,1,0$
- a heap is collection of $O(\log n$) trees
- to find min, inspect all roots
- merges merge trees of same order (each is $O(1)$ op)
- inserts are merges of order-0 trees, $O(1)$ amortized
- Fibonacci heap: $O(1)$ insert/decrease/merge; log-time find/del-min
- TODO
- for heaps implemented as linked objects, keeping pointers to the nodes
enables things like deleting specific nodes or relaxing their weights
randomized
==========
sampling
- reservoir sampling: sample $n$ elts from stream of unknown len in 1 pass
- keep buffer (_reservoir_) of $n$ elts, where elts are $x_i$
- $\forall x_i, i>n$, with prob $n/i$ replace random buffered elt with $x_i$
- prob $n/i$ is prob of no combination of $n$ elts containing $x_i$ (or any
particular elt in a set of $i$ elts)
- weighted reservoir sampling: each elt has weight
- maintain total seen weight and total reservoir weight
random variate generation
- walker's alias method: sample non-uniform discrete dist in constant time
- <http://www.keithschwarz.com/darts-dice-coins/>
- optimal BST: simply use huffman coding; complexity is entropy
sort
====
TODO heap, insertion, selection, bubble, radix, quick
in-place quicksort (c.a.r. hoare)
- select a pivot, possibly doing so using sampling
- rather than building lists less/equal/greater, walk the list from both ends
and swap as necessary (to construct sublists in-place)
timsort
- adaptive, stable, natural mergesort; hybrid with insertion sort
- supernatural perf on many kinds of partially ordered/reversed arrays
- less than lg(N!) cmp's needed; few as N-1
- as fast as Python's previous highly tuned samplesort hybrid on random arrays
- pseudocode: TODO
- use
- python's default list.sort
- java 7's new Arrays.sort
- <http://svn.python.org/projects/python/trunk/Objects/listsort.txt>
- <http://en.wikipedia.org/wiki/Timsort>
- <http://stackoverflow.com/questions/1733073/grokking-timsort>
dynamic programming
===================
dynamic time-warping
- general term for matching up two sequences that are varying in time/speed
- one simple formulation for sequences of discrete symbols is edit distance but
only allowed ops are insertions and deletions
edit distance
- levenshtein distance: the standard way of measuring distance; ins/del/upd
- damerau-levenshtein: also transposition of two chars
- hamming distance: considers only substitutions in same-length strings
subsequence sum
- given $x_i$, $A_i$ is max subsequence sum up to and including $x_i$
$$A_i = \max{ A_{i-1}, 0 } + x_i$$
compression
===========
- huffman coding
- variable length coding: more freq symbols given shorter codes
- optimal (but not trivial to show)
- pseudocode: build tree bottom-up
- $O(n)$ assuming you have symbols sorted by frequency
- start w ea symbol having its own root node; node freq = freq of symbol
- iterate until one root node
- merge two nodes with min freqs; new parent freq is sum of child freqs
- doesn't yield ordered trees aka optimal BSTs aka alphabetic trees; need
hu-tucker algo for that
- LZ77
- sliding window compression, replacing strings with backrefs
((length,offset) pairs)
- strings must be within window size apart
- LZ78: like LZ77 but looking forward as well; less popular
- lempel-ziv-welch (LZW): patent-encumbered modification of LZ78
dict = dict((chr(code), code) for code in xrange(256))
loop
find the next max-len substr W of input that exists in dict
replace it with its dict code
dict[W+next char] = next code
- deflate
- intended to be patent-unencumbered replacement for LZW
- LZ77 then huffman coding, but uses one huffman tree for char literals and
length codes, and another for backwards offsets
- zip
- permits multiple compression algos, but deflate dominant
- can take collections of files, but compresses each individually, so can't
exploit redundancy across files
- LZMA: lempel-ziv markov chain
- used in 7z
- burrows-wheeler transform (BWT)
- sort all rotations of a string (padded with start and end) and take last
col
- reversible: last col gives all chars; sort to get first col; last and first
give all pairs of successive chars
- bzip2: BWT + MTF transform + Huffman coding; parallelizable; slower than gzip
- move-to-front (MTF) transform: decrease entropy; usu useful only after BWT
- prediction by partial matching (PPM): smaller/slower than bzip2
- lempel-ziv-oberhumer (LZO): very fast decompression
- block-based (parallelizable)
- similar compression speed as deflate
- added to hadoop by twitter
- zippy: google's fast compression algo
- zippy: 300MB/s encode, 600MB/s decode, 2-4X compression
- gzip: 25MB/s encode, 200MB/s decode, 4-6X compression
- <http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf>
- implementations:
- gzip: same as zlib internally but diff headers/trailers
- zlib: deflate; compresses single files (freq. tar); not parallelizable
[On Reducing the Size of Compressed Javascript (by up to 20%)](http://timepedia.blogspot.com/2009/08/on-reducing-size-of-compressed.html)
- browser only supports gzip; implementing lzma/ppm in js too slow
- gzip = LZ77 + huffman; huffman uses separate huffman tree for backwards
offset
- arrange for the backwards offsets to be the same
- sort functions to cluster by similarity, eg tf-idf, edit distance
- impl: use edit distance, then recursively combine best-matching pairs
- other tricks:
- larger-base charset for var renames
- rename from bottom-up scopes to reuse vars
- stable ordering of renames (so always get f(a,b,c))
high-perf linear algebra
========================
TODO singular value decomposition (SVD)
- many apps in signal processing and statistics
- important factorization of rectangular real or complex matrix
TODO fast-fourier transform (FFT)
- compute (inverse) discrete FT
- FFT on columns
- multiply by twiddle factors
- FFT on rows
- transposes
- butterfly diagram: used to describe FFT; portion of compute that combines
smaller DFTs into larger DFT
TODO parallel HPC algorithms
coding/information theory
=========================
error detection coding
- repetition codes: just repeat the original signal
- checksum: modular sum of words/bytes in msg
- parity bit: simplest 1-bit checksum
- even: 1 iff # ones is even; ditto for odd
- special case of CRC, where 1-bit CRC is generated by polynomial $x+1$
- cyclic redundancy check: non-secure hash fn based on _cyclic codes_
- characterized by generator polynomial
- there's a table of commonly used/standard CRCs
- crypto hash function: infeasible to find input with same hash
- eg: Ethernet has CRC-32, IPv4 has 2B checksum over header, IPv6 has none, UDP
has checksum over UDP/IP headers (optional over IPv4), TCP has checksum over
TCP/IP headers with ARQ via triple-ack or timeout
checksums
- internet checksum: 2 bytes; used in IPv4, TCP, UDP headers; very weak (amazon outage)
- adler-32: 4 bytes; stronger; still efficient to compute; weak for short msgs
- fletcher's checksum: slightly stronger than adler-32; slightly slower to compute
- CRC: good middle ground; studied by mathematicians; easily implemented in HW
- CRC32C: particularly strong for communicating apps; implemented in SSE4.2
- slicing-by-8: best SW algo for CRC32C
- <http://evanjones.ca/crc32c.html>
error correcting code (ECC): usu. mean FEC
- automatic repeat request (ARQ) aka backward error correction: use error
detection codes so receiver can request re-xmits; more of a protocol
- hybrid ARQ (HARQ): combine FEC for minor errors & ARQ for major errors
- forward error correction (FEC): always include ECC
- convolutional codes: bit-by-bit; esp. suitable for HW
- Viterbi decoder: optimal decoding
- block codes: block-by-block
- early, inefficient
- repetition codes
- Hamming codes: single bit errors
- multi-dimensional parity-check (MDPC) codes: put msg in grid & calc
parity for each row & col
- efficient
- Reed-Solomon codes: popular; corrects $d$ errors with $2d$ check
symbols
- turbo codes: near-optimal
- low-density parity-check (LDPC) codes: near-optimal
- BCH codes: handles random and burst errors
- fountain codes aka rateless erasure codes
- "potentially limitless sequence of encoding symbols can be generated
from a given set of source symbols such that the original source
symbols can ideally be recovered from any subset of the encoding
symbols of size equal to or only slightly larger than the number of
source symbols"
- LT codes: first practical realization; simplest
- for each generated block, combine (xor) random sample of source
blocks
- sample size is sampled from a special distribution, the (robust)
soliton dist
misc
====
feed-forward bloom filters (dga)
- high-level idea: for 2-phase exact pattern matching on some target, first
filter the target with bloom filter built from the patterns, then
traditional-grep smaller datasets
- assumes that the matches are all of the same len (else what are the target
extracts that you should be hashing against the bloom filter?)
- take the min len; if too small then handle those patterns separately
- feed-forward bloom filters are regular bloom filters but where each positive
test flips on bits in a second "shadow" array (initially zeroed), which can
then be used to filter out irrelevant *patterns*
- paper analysis shows high prob that there will be only few patterns in second
array that don't get any matches
cardinality estimation
- linear probabilistic counter
- same as bloom filter with just 1 hash fn
- count is $-m \ln r$ where $m$ is array size, $r$ is ratio of 1s
- <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.8821>
- (super) log log counter
- assume uniform hash space dist; est the density
- if $r$ is max rank, then $2^r$ is cardinality est
- rank is # leading zeros; a fully dense set will have max rank (min hash)
be 000...001
- but this gives us at best a power-of-2 cardinality
- very space efficient: only need 5 bits to record max # 0s in 32-bit nums
- hence name, need $\log \log N$ bits where $N$ is max cardinality (eg
$2^32$)
- improvements
- use multiple indep hash fns, but slow
- equiv but faster: stochastic avging
- use part of its output to split vals into many buckets, each to get an
indep estimate
- eg for 1024 estimates, use first 10 bits, rest to count leading 0s
- avg the estimates from all buckets
- avg error for $m$ buckets is $1.3/\sqrt{m}$; for 1024 buckets, ~4%
- <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.8821>
- super log log
- discard outliers by discarding 30% buckets w largest values
- avg error down to $1.05/\sqrt{m}$
- hyper log log counter (Flajolet et al)
- use harmonic mean instead of geometric
- avg error down to $1.04/\sqrt{m}$
- complete algo a bit more complex
- <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.4286>
- log log algo buckets are parallelizable
- <http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html>
- <http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/>
- <http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation>
cardinality estimation: $k$ minimum values (KMV) sketch
- simple & intuitive: remember just $k$ minimum hashes and extrapolate
cardinality from interval between hashes
- can do set unions/intersects; eg hyper log log doesn't support intersects
- compression: can discard high order bits (state shrinks as set grows)
- expansion to multisets (counts duplicates): AKMV
- add counter to each hash; eg hyper log log doesn't support these
- <http://blog.aggregateknowledge.com/2012/07/09/sketch-of-the-day-k-minimum-values/>
count-min sketch
- similar to bloom filter, but instead of 1 array of bits, have $k$ arrays of
counters (one for each of the $k$ hash fn's), and each element is a number
not a bit (representing counts)
- basically an approximate histogram / summarizes large vectors
- operations
- add: increment the elements that the hash functions point to
- remove: decrement the elements that the hash functions point to
- count: return the min of all the elements that the hash functions point to
- can optionally implement _conservative add_: only increment the array with
the min counter value
- but decrement operation is no longer valid
- simple descrip / used in statistical counting of passwords in MSR paper
<http://research.microsoft.com/pubs/132859/popularityISeverything.pdf>
count-mean-min sketch
- count-min: good for high skew, bad at low/med skew (many collisions)
- use more careful correction/estimation from same data structure TODO
- <http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/>
stream-summary
- TODO
- <http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/>
universal hash families
- if $H$ is a family of functions $h:M \rightarrow N$ where
$M={0,1,\dots,m-1},N={0,1,\dots,n-1},m \ge n$:
- _weak universal hash family_ aka _weak 2-universal hash family_ aka
_2-universal hash family_: $\forall x,y \in M, x \ne y, P[h(x)=h(y)] \le
1/n$
- _strongly 2-universal hash family_ aka _(strongly) 2-independent hash
family_: $\forall x,y \in M, a,b \in N, P[h(x)=a \wedge h(y)=b] \le 1/n^2$
- _(strongly) $k$-independent universal hash family_: $\forall x_1, x_2,
\dots, x_k \in M, a_1, a_2, \dots, a_k \in N, P[h(x_1) = a_1 \wedge h(x_2)
= a_2 \wedge \dots] \le 1/n^k$
- "weak" commonly elided from "weak hash function" bc 2-universal hash
families tend to also be 2-independent
- carter and wegman's hash family: $h_{a,b}(x) = (ax+b \mod p) \mod n$
- if $n=m=p$ is prime, this is strongly 2-universal
- nobody actually uses mod in practice
- cormode's impl: <http://www.cs.rutgers.edu/~muthu/massdal-code-index.html>
- multiply-carry: fastest 2-universal hash family for ints
- is it strongly 2-universal?
- <http://blog.ezyang.com/2010/11/is-multiply-carry-strongly-universal/>
XOR linked list
- replace prev/next ptrs w a single field = prev xor next
- starting traversal reqs knowning addrs of 2 nodes, not just 1
skip lists
- for storing sorted list of items, like trees, but probabilistic
- insertion, removal, contains: $O(\log n)$; enumeration: $O(n)$
- insert: decide # lists this node will be a part of
- prob .5: make node part of lowest level only
- prob .25: make node part of lowest 2 lists
- ...
- delete: remove node from all lists it's in
- contains: traverse lists forward and down
- <http://igoro.com/archive/skip-lists-are-fascinating/>
tries
- trie: ordered tree where keys usu strings/radixes and all descendants of a
node have a common prefix of the key w that node
- patricia trie aka radix tree: space-optimized trie where each node w only one
child is merged w its child
- thus every internal node has at least 2 children
- hash array mapped trie (HAMT): persistent version popularized in clojure
- basically, high-fanout (very flat - 32x) radix tree keyed by hashing
- use bitmap so nodes w <32 items can allocate only as many words as nec and
bitmap identifies which items are present (population count aka hamming wt)
binary space partitioning (BSP) trees
- TODO recursively partition space; apps in graphics
- kd-tree aka $k$-dimensional tree: special case
- organizes $k$-dim *points* (this is important)
- binary tree, each internal node is a hyperplane splitting some dimension
- remember though that each node also corresponds to a point!
- think decision tree recursively splitting different continuous dimensions
- useful for range searches, NN searches
- NN search
- go down tree as if inserting query point
- once at leaf, save it as current best
- walk back up tree until back at root
- if current node closer, update current best
- any points on other (sibling) subtree that could be closer? (does
hypersphere with radius from query to current node intersect with
other subtree?)
- if so, recurse down to leaf of other subtree
- various construction algos
- canonical construction
- as we move down the tree, we cycle through the dimensions; eg for 3D
tree, level 1 splits $x$, 2 splits $y$, 3 splits $z$, 4 splits $x$,
etc.
- split on median point
- not efficient for high dim; number of data points $n$ should be much larger
than $2^k$
BK Tree
- useful for spell correction/fuzzy searches
- structure
- one node per word
- each word has edges such that edge $i$ links to sub-tree of words all
(Levenshtein) distance $i$ from current word
- querying: given term $q$ and max dist $m$
- start at root node and repeat the following at each traversed node
- let $d$ be distance btwn current node and $q$
- use triangle inequality: follow only edges $d-m \dots d+m$
- building: start w arbitrary root
- recursively traverse tree using distance until hit leaf, then insert
- properties
- N-ary and irregular but generally well-balanced
- searching w dist 1 traverses <5-8% of tree, 2 traverses <17-25% of tree
- not a great explanation but not many out there
<http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>
shotgun genome assembly
- de brujin graphs commonly used in genome sequencing
- abstraction: reassemble full string from (2B) overlapping fragments
- preparation: $k$-shingle fragments, store shingles into bloom filter
- reassembly: start w some string, and recursively try appending every char to
it and verifying it exists in the BF
def extensions(prefix):
"""Yields extensions of prefix reachable via de Brujin graph (via
shingling checks against bloom filter), or the prefix if no more
extensions."""
word = prefix[-k:]
cs = [c for c in allchars if word+c in bf]
for n,c in enumerate(cs):
for ext in extensions(prefix+c):
yield ext
if cs==[]:
yield prefix
- exponential search tree but should cut off dead-ends quickly?
- can't directly assemble w this (false connections, also many convenient graph
properties), but can use data structure to grok graph properties and
eliminate/break up data before feeding into other programs
- eliminate small graphs
- find and shard out disconnected partitions
- local graph complexity reduction & error/artifact trimming
- <http://www.slideshare.net/c.titus.brown/pycon-2011-talk-may-not-be-final-note>
similarity with shingling
- `jaccard(set(shingles(stream1)), set(shingles(stream2)))`
find majority item in stream
- pseudocode from
<http://unmaintainable.wordpress.com/2010/02/21/finding-majority-in-stream/>:
maj, n = None, 0
for x in xs:
if x == maj: n += 1
elif n == 0: maj, n = x, 1
else: n -= 1
return maj
- assumes that there *is* a majority item; requires 2nd pass to verify this if
unknown
- can generalize to find all $k$ items that exceed $1/k$ fraction of all items;
useful for finding outliers in power law dist
automatic differentiation
- not symbolic (slow)/numeric (approximate); differentiates at specific points
- add _infinitesimal_ $e$ similar to imaginary $i$ but $e^2=0$
- <http://code.google.com/p/ceres-solver/source/browse/include/ceres/jet.h>