Note: These lecture notes were slightly modified from the ones posted on the 6.858 course website from 2014.
- Historically, worried about EM signals leaking. NSA TEMPEST.
- Broadly, systems may need to worry about many unexpected ways in which
- information can be revealed.
Example setting: a server (e.g., Apache) has an RSA private key.
- Server uses RSA private key (e.g., decrypt message from client).
- Something about the server's computation is leaked to the client.
Many information leaks have been looked at:
- How long it takes to decrypt.
- How decryption affects shared resources (cache, TLB, branch predictor).
- Emissions from the CPU itself (RF, audio, power consumption, etc).
Side-channel attacks don't have to be crypto-related.
- E.g., operation time relates to which character of password was incorrect.
- Or time related to how many common friends you + some user have on Facebook.
- Or how long it takes to load a page in browser (depends if it was cached).
- Or recovering printed text based on sound from dot-matrix printer. Ref
- But attacks on passwords or keys are usually the most damaging.
Adversary can analyze information leaks, use it to reconstruct private key.
- Currently, side-channel attacks on systems described in the paper are rare.
- E.g., Apache web server running on some Internet-connected machine.
- Often some other vulnerability exists and is easier to exploit.
- Slowly becoming a bigger concern: new side-channels (VMs), better attacks.
- Side-channel attacks are more commonly used to attack trusted/embedded hw.
- E.g., chip running cryptographic operations on a smartcard.
- Often these have a small attack surface, not many other ways to get in.
- As paper mentions, some crypto coprocessors designed to avoid this attack.
What's the "Remote timing attacks are practical" paper's contribution? Ref
- Timing attacks known for a while.
- This paper: possible to attack standard Apache web server over the network.
- Uses lots of observations/techniques from prior work on timing attacks.
- To understand how this works, first let's look at some internals of RSA..
- Pick two random primes,
p
andq
.- Let
n = p*q
.
- Let
- A reasonable key length, i.e.,
|n|
or|d|
, is 2048 bits today. - Euler's function
phi(n)
: number of elements ofZ_n^*
relatively prime ton
.- Theorem [no proof here]:
a^(phi(n)) = 1 mod n
, for alla
andn
.
- Theorem [no proof here]:
- So, how to encrypt and decrypt?
- Pick two exponents
d
ande
, such thatm^(e*d) = m (mod n)
, which- means
e*d = 1 mod phi(n)
.
- means
- Encryption will be
c = m^e (mod n)
; decryption will bem = c^d (mod n)
.
- Pick two exponents
- How to get such
e
andd
?- For
n=pq
,phi(n) = (p-1)(q-1)
. - Easy to compute
d=1/e
, if we knowphi(n)
. Extended Euclidean algorithm - In practice, pick small
e
(e.g., 65537), to make encryption fast.
- For
- Public key is
(n, e)
. - Private key is, in principle,
(n, d)
.- Note:
p
andq
must be kept secret! - Otherwise, adversary can compute
d
frome
, as we did above. - Knowing
p
andq
also turns out to be helpful for fast decryption. - So, in practice, private key includes
(p, q)
as well.
- Note:
RSA is tricky to use "securely" -- be careful if using RSA directly!
- Ciphertexts are multiplicative
E(a)*E(b) = a^e * b^e = (ab)^e
.- Can allow adversary to manipulate encryptions, generate new ones.
- RSA is deterministic
- Encrypting the same plaintext will generate the same ciphertext each time.
- Adversary can tell when the same thing is being re-encrypted.
- Typically solved by "padding" messages before encryption.
OAEP
- Take plaintext message bits, add padding bits before and after plaintext.
- Encrypt the combined bits (must be less than
|n|
bits total). - Padding includes randomness, as well as fixed bit patterns.
- Helps detect tampering (e.g. ciphertext multiplication).
- Key problem: fast modular exponentiation.
- In general, quadratic complexity.
- Multiplying two 1024-bit numbers is slow.
- Computing the modulus for 1024-bit numbers is slow (1024-bit divison).
- Recall what the CRT says:
- if
x==a1 (mod p)
andx==a2 (mod q)
, wherep
andq
are relatively prime, - then there's a unique solution
x==a (mod pq)
.- and, there's an efficient algorithm for computing
a
- and, there's an efficient algorithm for computing
- if
- Suppose we want to compute
m = c^d (mod pq)
. - Can compute
m1 = c^d (mod p)
, andm2 = c^d (mod q)
. - Then use CRT to compute
m = c^d (mod n)
fromm1
,m2
; it's unique and fast. - Computing
m1
(orm2
) is ~4x faster than computingm
directly (~quadratic). - Computing
m
fromm1
andm2
using CRT is ~negligible in comparison. - So, roughly a 2x speedup.
- Naive approach to computing
c^d
: multiplyc
by itself,d
times. - Better approach, called repeated squaring:
c^(2x) = (c^x)^2
c^(2x+1) = (c^x)^2 * c
- To compute
c^d
, first computec^(floor(d/2))
, then use above forc^d
. - Recursively apply until the computation hits
c^0 = 1
. - Number of squarings:
|d|
(the number of bits needed to representd
) - Number of multiplications: number of 1 bits in
d
- Better yet (sometimes), called sliding window:
c^(2x) = (c^x)^2
c^(32x+1) = (c^x)^32 * c
c^(32x+3) = (c^x)^32 * c^3
- ...
c^(32x+z) = (c^x)^32 * c^z
, generally (wherez<=31
)- Can pre-compute a table of all necessary
c^z
powers, store in memory. - The choice of power-of-2 constant (e.g., 32) depends on usage.
- Costs: extra memory, extra time to pre-compute powers ahead of time.
- Note: only pre-compute odd powers of
c
(use first rule for even). - OpenSSL uses 32 (table with 16 pre-computed entries).
- Reducing
mod p
each time (after square or multiply) is expensive.- Typical implementation: do long division, find remainder.
- Hard to avoid reduction: otherwise, value grows exponentially.
- Idea (by Peter Montgomery): do computations in another representation.
- Shift the base (e.g.,
c
) into different representation upfront. - Perform modular operations in this representation (will be cheaper).
- Shift numbers back into original representation when done.
- Ideally, savings from reductions outweigh cost of shifting.
- Shift the base (e.g.,
- Montgomery representation: multiply everything by some factor R.
a mod q <-> aR mod q
b mod q <-> bR mod q
c = a*b mod q <-> cR mod q = (aR * bR)/R mod q
- Each mul (or sqr) in Montgomery-space requires division by
R
.
- Why is modular multiplication cheaper in Montgomery repr.?
- Choose
R
so division byR
is easy:R = 2^|q|
(2^512
for 1024-bit keys). - Because we divide by
R
, we will often not need to domod q
.|aR| = |q|
|bR| = |q|
|aR * bR| = 2|q|
|aR * bR / R| = |q|
- How do we divide by
R
cheaply?- Only works if lower bits are zero.
- Observation: since we care about value
mod q
, multiples ofq
don't matter. - Trick: add multiples of
q
to the number being divided byR
, make low bits 0.- For example, suppose
R=2^4 (10000)
,q=7 (111)
, dividex=26 (11010)
by R.x+2q = (binary) * 101000
x+2q+8q = (binary) 1100000
- Now, can easily divide by
R
: result is binary 110 (or 6). - Generally, always possible:
- Low bit of
q
is 1 (q
is prime), so can "shoot down" any bits. - To "shoot down" bit
k
, add2^k * q
- To shoot down low-order bits
l
, addq*(l*(-q^-1) mod R)
- Then, dividing by
R
means simply discarding low zero bits.
- Low bit of
- For example, suppose
- Choose
- One remaining problem: result will be
< R
, but might be> q
.- If the result happens to be greater than
q
, need to subtractq
. - This is called the "extra reduction".
- When computing
x^d mod q
,Pr[extra reduction] = (x mod q) / 2R
.- Here,
x
is assumed to be already in Montgomery form. - Intuition: as we multiply bigger numbers, will overflow more often.
- Here,
- If the result happens to be greater than
- How to multiply 512-bit numbers?
- Representation: break up into 32-bit values (or whatever hardware supports).
- Naive approach: pair-wise multiplication of all 32-bit components.
- Same as if you were doing digit-wise multiplication of numbers on paper.
- Requires
O(nm)
time if two numbers haven
andm
components respectively. O(n^2)
if the two numbers are close.
- Karatsuba multiplication: assumes both numbers have same number of components.
O(n^log_3(2)) = O(n^1.585)
time.- Split both numbers (
x
andy
) into two components (x1
,x0
andy1
,y0
).x = x1 * B + x0
y = y1 * B + y0
- E.g.,
B=2^32
when splitting 64-bit numbers into 32-bit components.
- Naive:
x*y = x1y1 * B^2 + x0y1 * B + x1y0 * B + x0y0
- Four multiplies:
O(n^2)
- Four multiplies:
- Faster:
x*y = x1y1 * (B^2+B) - (x1-x0)(y1-y0) * B + x0y0 * (B+1)
- ...
= x1y1 * B^2 + ( -(x1-x0)(y1-y0) + x1y1 + x0y0 ) * B + x0y0
- Just three multiplies, and a few more additions.
- ...
- Recursively apply this algorithm to keep splitting into more halves.
- Sometimes called "recursive multiplication".
- Meaningfully faster (no hidden big constants)
- For 1024-bit keys, "
n
" here is 16 (512/32). n^2 = 256
n^1.585 = 81
- For 1024-bit keys, "
- Multiplication algorithm needs to decide when to use Karatsuba vs. Naive.
- Two cases matter: two large numbers, and one large + one small number.
- OpenSSL: if equal number of components, use Karatsuba, otherwise Naive.
- In some intermediate cases, Karatsuba may win too, but OpenSSL ignores it,
- according to this paper.
- Server's SSL certificate contains public key.
- Server must use private key to prove its identity.
- Client sends random bits to server, encrypted with server's public key.
- Server decrypts client's message, uses these bits to generate session key.
- In reality, server also verifies message padding.
- However, can still measure time until server responds in some way.
Figure of decryption pipeline on the server:
* CRT * To Montgomery * Modular exp
--> * c_0 = c mod q --> * c'_0 = c_0*R mod q --> * m'_0 = (c'_0)^d mod q
/
/ * Use sliding window for bits
/ * of the exponent d
c * Karatsuba if c'_0 and q have
* same number of 32-bit parts
\
\ * Extra reductions proportional
\ * to ((c'_0)^z mod q) / 2R;
--> ... * z comes from sliding window
- Then, compute
m_0 = m'_0/R mod q
. - Then, combine
m_0
andm_1
using CRT to getm
. - Then verify padding in
m
. - Finally, use payload in some way (SSL, etc).
- Victim Apache HTTPS web server using OpenSSL, has private key in memory.
- Connected to Stanford's campus network.
- Adversary controls some client machine on campus network.
- Adversary sends specially-constructed ciphertext in msg to server.
- Server decrypts ciphertext, finds garbage padding, returns an error.
- Client measures response time to get error message.
- Uses the response time to guess bits of
q
.
- Overall response time is on the order of 5 msec.
- Time difference between requests can be around 10 usec.
- What causes time variations?
- Karatsuba vs. Naive
- extra reductions
- Once guessed enough bits of
q
, can factorn=p*q
, computed
frome
. - About 1M queries seem enough to obtain 512-bit
p
andq
for 1024-bit key.- Only need to guess the top 256 bits of
p
andq
, then use another algorithm.
- Only need to guess the top 256 bits of
- See the Remote timing attacks are practical paper cited in the References section at the end for more details.
- Let
q = q_0 q_1 .. q_N
, whereN = |q|
(say, 512 bits for 1024-bit keys). - Assume we know some number
j
of high-order bits ofq
(q_0
throughq_j
). - Construct two approximations of q, guessing
q_{j+1}
is either 0 or 1:g = q_0 q_1 .. q_j 0 0 0 .. 0
g_hi = q_0 q_1 .. q_j 1 0 0 .. 0
- Get the server to perform modular exponentiation (
g^d
) for both guesses.- We know
g
is necessarily less thanq
. - If
g
andg_hi
are both less thanq
, time taken shouldn't change much. - If
g_hi
is greater thanq
, time taken might change noticeably.g_hi mod q
is small.- Less time: fewer extra reductions in Montgomery.
- More time: switch from Karatsuba to normal multiplication.
- Knowing the time taken can tell us if 0 or 1 was the right guess.
- We know
- How to get the server to perform modular exponentiation on our guess?
- Send our guess as if it were the encryption of randomness to server.
- One snag: server will convert our message to Montgomery form.
- Since Montgomery's
R
is known, send (g/R mod n
) as message to server.
- How do we know if the time difference should be positive or negative?
- Paper seems to suggest it doesn't matter: just look for large diff.
- Figure 3a shows the measured time differences for each bit's guess.
- Karatsuba vs. normal multiplication happens at 32-bit boundaries.
- First 32 bits: extra reductions dominate.
- Next bits: Karatsuba vs normal multiplication dominates.
- At some point, extra reductions start dominating again.
- What happens if the time difference from the two effects cancels out?
- Figure 3, key 3.
- Larger neighborhood changes the balance a bit, reveals a non-zero gap.
- How does the paper get accurate measurements?
- Client machine uses processor's timestamp counter (
rdtsc
on x86). - Measure several times, take the median value.
- Not clear why median; min seems like it would be the true compute time.
- One snag: relatively few multiplications by
g
, due to sliding windows. - Solution: get more multiplications by values close to
g
(+ same forg_hi
). - Specifically, probe a "neighborhood" of
g
(g, g+1, .., g+400
).
- Client machine uses processor's timestamp counter (
- Why probe a 400-value neighborhood of
g
instead of measuringg
400 times?- Consider the kinds of noise we are trying to deal with.
- (1) Noise unrelated to computation (e.g. interrupts, network latency).
- This might go away when we measure the same thing many times.
- See Figure 2a in the paper.
- (2) "Noise" related to computation.
- E.g., multiplying by
g^3
andg_hi^3
in sliding window takes diff time. - Repeated measurements will return the same value.
- Will not help determine whether mul by
g
org_hi
has more reductions. - See Figure 2b in the paper.
- E.g., multiplying by
- Neighborhood values average out 2nd kind of noise.
- Since neighborhood values are nearby, still has ~same # reductions.
- Timing attack on decryption time: RSA blinding.
- Choose random
r
. - Multiply ciphertext by
r^e mod n
:c' = c*r^e mod n
. - Due to multiplicative property of RSA,
c'
is an encryption ofm*r
. - Decrypt ciphertext
c'
to get messagem'
. - Divide plaintext by
r
:m = m'/r
. - About a 10% CPU overhead for OpenSSL, according to Brumley's paper.
- Choose random
- Make all code paths predictable in terms of execution time.
- Hard, compilers will strive to remove unnecessary operations.
- Precludes efficient special-case algorithms.
- Difficult to predict execution time: instructions aren't fixed-time.
- Can we take away access to precise clocks?
- Yes for single-threaded attackers on a machine we control.
- Can add noise to legitimate computation, but attacker might average.
- Can quantize legitimate computations, at some performance cost.
- But with "sleeping" quantization, throughput can still leak info.
- Relatively tricky to develop an exploit (but that's a one-time problem).
- Possible to notice attack on server (many connection requests).
- Though maybe not so easy on a busy web server cluster?
- Adversary has to be close by, in terms of network.
- Not that big of a problem for adversary.
- Can average over more queries, co-locate nearby (Amazon EC2),
- Run on a nearby bot or browser, etc.
- Adversary may need to know the version, optimization flags, etc of OpenSSL.
- Is it a good idea to rely on such a defense?
- How big of an impediment is this?
- If adversary mounts attack, effects are quite bad (key leaked).
- Page-fault timing for password guessing [Tenex system]
- Suppose the kernel provides a system call to check user's password.
- Checks the password one byte at a time, returns error when finds mismatch.
- Adversary aligns password, so that first byte is at the end of a page,
- Rest of password is on next page.
- Somehow arrange for the second page to be swapped out to disk.
- Or just unmap the next page entirely (using equivalent of
mmap
).
- Or just unmap the next page entirely (using equivalent of
- Measure time to return an error when guessing password.
- If it took a long time, kernel had to read in the second page from disk.
- Or, if unmapped, if crashed, then kernel tried to read second page. ]
- Means first character was right!
- If it took a long time, kernel had to read in the second page from disk.
- Can guess an
N
-character password in256*N
tries, rather than256^N
.
- Suppose the kernel provides a system call to check user's password.
- Cache analysis attacks: processor's cache shared by all processes.
- E.g.: accessing one of the sliding-window multiples brings it in cache.
- Necessarily evicts something else in the cache.
- Malicious process could fill cache with large array, watch what's evicted.
- Guess parts of exponent (
d
) based on offsets being evicted.
- Cache attacks are potentially problematic with "mobile code".
- NaCl modules, Javascript, Flash, etc running on your desktop or phone.
- Network traffic timing / analysis attacks.
- Even when data is encrypted, its ciphertext size remains ~same as plaintext.
- Recent papers show can infer a lot about SSL/VPN traffic by sizes, timing.
- E.g., Fidelity lets customers manage stocks through an SSL web site.
- Web site displays some kind of pie chart image for each stock.
- User's browser requests images for all of the user's stocks.
- Adversary can enumerate all stock pie chart images, knows sizes.
- Can tell what stocks a user has, based on sizes of data transfers.
- Similar to CRIME attack mentioned in guest lecture earlier this term.
- Remote timing attacks are practical
- Cache missing for fun and profit
- Efficient Cache Attacks on AES, and Countermeasures
- Get Your Hands Off My Laptop: Physical Side-Channel Key-Extraction Attacks on PCs
- Cross-VM Side Channels and Their Use to Extract Private Keys
- Ed25519: high-speed high-security signatures