-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathl21-taintdroid.html
511 lines (450 loc) · 14.9 KB
/
l21-taintdroid.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
<h1>Taint tracking</h1>
<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.858 <a href="http://css.csail.mit.edu/6.858/2014/schedule.html">course website</a> from 2014.</p>
<h2>Android security policies</h2>
<p>What problem does the paper try to solve?</p>
<ul>
<li>Applications can exfiltrate a user's private
data and send it to some server.</li>
<li>High-level approach: keep track of which
data is sensitive, and prevent it from
leaving the device!</li>
<li>Why aren't Android permissions enough?
<ul>
<li>Android permissions control whether
application can read/write data, or
access devices or resources (e.g.,
the Internet).</li>
<li>Using Android permissions, it's hard
to specify a policy about <em>particular</em>
types of data (<em>Example:</em> "Even if the app
has access to the network, it should
never be able to send user data over
the network").</li>
<li><strong>Q:</strong> Aha! What if we never install apps
that both read data <em>and</em> have network
access?</li>
<li><strong>A:</strong> This would prevent some obvious leaks,
but it would also break many legitimate
apps! (<em>Example:</em> email app)
<ul>
<li>Information can still leak via side
channels. (<em>Example:</em> browser cache leaks
whether an object has been fetched
in the past)</li>
<li>Apps can collude! (<em>Example:</em> An app without
network privileges can pass data to
an app that does have network
privileges.)</li>
<li>A malicious app might trick another
app into sending data. (<em>Example:</em> Sending
an intent to the Gmail app?)</li>
</ul></li>
</ul></li>
</ul>
<p>What does Android malware actually do?</p>
<ul>
<li>Use location or IMEI for advertisements.
(IMEI is a unique per-device identifier.)</li>
<li>Credential stealing: send your contact list,
IMEI, phone number to remote server.</li>
<li>Turn your phone into a bot, use your contact
list to send spam emails/SMS messages!
<a href="http://www.bbc.com/news/technology-30143283">'Sophisticated' Android malware hits phones</a> </li>
<li>Preventing data exfiltration is useful, but
taint tracking by itself is insufficient to
keep your device from getting hacked!</li>
</ul>
<h2>TaintDroid overview</h2>
<p><em>TaintDroid</em> tracks sensitive information as it
propagates through the system.</p>
<ul>
<li><em>TaintDroid</em> distinguishes between information
sources and information sinks
<ul>
<li>Sources generate sensitive data:
<em>Example:</em> Sensors, contacts, IMEI</li>
<li>Sinks expose sensitive data:
<em>Example:</em> network.</li>
</ul></li>
<li><em>TaintDroid</em> uses a 32-bit bitvector to
represent taint, so there can be at most
32 distinct taint sources.</li>
<li>Roughly speaking, taint flows from rhs
to lhs of assignments.</li>
</ul>
<p><em>Examples:</em></p>
<pre><code>int lat = gps.getLatitude();
// The lat variable is now
// tainted!
Dalvik VM is a register-based machine,
so taint assignment happens during the
execution of Dalvik opcodes [see Table 1].
move_op dst src // dst receives src's taint
binary_op dst src0 src1 // dst receives union of src0
// and src1's taint
</code></pre>
<p>Interesting special case, arrays:</p>
<pre><code> char c = //. . . get c somehow.
char uppercase[] = ['A', 'B', 'C', . . .];
char upperC = uppercase[c];
// upperC's taint is the
// union of c and uppercase's
// taint.
</code></pre>
<ul>
<li>To minimize storage overheads, an array
receives a single taint tag, and all of
its elements have the same taint tag.</li>
<li><strong>Q:</strong> Why is it safe to associate just one
label with arrays or IPC messages?</li>
<li><strong>A:</strong> It should be safe to <em>over</em>-estimate
taint. This may lead to false positives,
but not false negatives.</li>
<li>Another special case: native methods
(i.e., internal VM methods like
<code>System.arraycopy()</code>, and native code
exposed via JNI).
<ul>
<li><strong>Problem:</strong> Native code doesn't go
through the Dalvik interpreter, so
<em>TaintDroid</em> can't automatically
propagate taint!</li>
<li><strong>Solution:</strong> Manually analyze the
native code, provide a summary of
its taint behavior.
<ul>
<li>Effectively, need to specify
how to copy taints from args
to return values. </li>
<li><strong>Q:</strong> How well does this scale?</li>
<li><strong>A:</strong> Authors argue this works OK
for internal VM functions
(e.g., <code>arraycopy</code>). For "easy"
calls, the analysis can be
automated---if only integers
or strings are passed, assign
the union of the input taints
to the return value.</li>
</ul></li>
</ul></li>
<li>IPC messages are like treated like
arrays: each message is associated
with a single taint that is the union
of the taints of the constituent
parts.
<ul>
<li>Data which is extracted from an
incoming message is assigned
the taint of that message.</li>
</ul></li>
<li>Each file is associated with a single
taint flag that is stored in the
file's metadata.
<ul>
<li>Like with arrays and IPC messages,
this is a conservative scheme that
may lead to false positives.</li>
</ul></li>
</ul>
<p>How are taint flags represented in memory?</p>
<ul>
<li>Five kinds of things need to have taint
tags:
<ol>
<li>Local variables in a method</li>
<li>Method arguments</li>
<li>Object instance fields</li>
<li>Static class fields</li>
<li>Arrays</li>
</ol></li>
<li>Basic idea: Store the flags for a variable
near the variable itself.
<ul>
<li><strong>Q:</strong> Why?</li>
<li><strong>A:</strong> Preserves spatial locality---this
hopefully improves caching behavior.</li>
<li>For method arguments and local variables
that live on the stack, allocate the
taint flags immediately next to the
variable.</li>
</ul></li>
</ul>
<p><em>Example:</em></p>
<pre><code> .
.
| . |
+------------------+
| local0 |
+------------------+
| local0 taint tag |
+------------------+
| local1 |
+------------------+
| local1 taint tag |
+------------------+
.
.
.
_TaintDroid_ uses a similar approach
for class fields, object fields,
and arrays -- put the taint tag
next to the associated data.
</code></pre>
<p>So, given all of this, the basic idea in
<em>TaintDroid</em> is simple: taint sensitive data
as it flows through the system, and raise
an alarm if that data tries to leave via
the network!</p>
<p>The authors find various ways that
apps misbehave:</p>
<ul>
<li>Sending location data to advertisers</li>
<li>Sending a user's phone number to the app servers</li>
</ul>
<p><em>TaintDroid</em>'s rules for information flow might lead
to counterintuitive/interesting results. Imagine that an application
implements its own linked list class. </p>
<pre><code> class ListNode{
Object data;
ListNode next;
}
</code></pre>
<p>Suppose that the application assigns tainted
values to the "data" field. If we calculate
the length of the list, is the length value
tainted?</p>
<p>Adding to a linked list involves:</p>
<ol>
<li>Allocating a <code>ListNode</code></li>
<li>Assigning to the <code>data</code> field</li>
<li>Patching up <code>next</code> pointers</li>
</ol>
<p>Note that <strong>Step 3</strong> doesn't involve tainted
data! So, "next" pointers are tainted,
meaning that counting the number of
elements in the list would not generate
a tainted value for length.</p>
<p>What are the performance overheads of <em>TaintDroid</em>?</p>
<ul>
<li>Additional memory to store taint tags.</li>
<li>Additional CPU cost to assign, propagate,
check taint tags.</li>
<li>Overheads seem to be moderate: ~3--5%
memory overhead, 3--29% CPU overhead
<ul>
<li>However, on phones, users are very
concerned about battery life: 29%
less CPU performance may be
tolerable, but 29% less battery
life is bad.</li>
</ul></li>
</ul>
<h2>Questions and answers</h2>
<p><strong>Q:</strong> Why not track taint at the level of
x86 instructions or ARM instructions?</p>
<p><strong>A:</strong> It's too expensive, and there are
too many false positives.</p>
<ul>
<li><em>Example:</em> If kernel data structures are
improperly assigned taint, then
the taint will improperly flow
to user-mode processes. This
results in taint explosion: it's
impossible to tell which state
has <em>truly</em> been affected by
sensitive data.</li>
<li>One way that this might happen is
if the stack pointer or the break
pointer are incorrectly tainted.
Once this happens, taint rapidly
explodes:
<ul>
<li>Local variable accesses are
specified as offsets from
the break pointer.</li>
<li>Stack instructions like <code>pop</code>
use the stack pointer.</li>
<li><a href="http://www.ssrg.nicta.com.au/publications/papers/Slowinska_Bos_09.pdf">Pointless Tainting? Evaluating the Practicality of Pointer Tainting</a></li>
</ul></li>
</ul>
<p><strong>Q:</strong> Taint tracking seems expensive---can't we
just examine inputs and outputs to look
for values that are known to be sensitive?</p>
<p><strong>A:</strong> This might work as a heuristic, but it's
easy for an adversary to get around it.</p>
<ul>
<li>There are many ways to encode data,
e.g., URL-quoting, binary versus
text formats, etc.</li>
</ul>
<h2>Implicit flows</h2>
<p>As described, taint tracking cannot detect <em>implicit flows</em>.</p>
<p>Implicit flows happen when a tainted value affects another variable
without directly assigning to that variable.</p>
<pre><code> if (imei > 42) {
x = 0;
} else {
x = 1;
}
</code></pre>
<p>Instead of assigning to <code>x</code>, we could try to leak information
about the IMEI over the network!</p>
<p>Implicit flows often arise because of tainted values affecting control flow.</p>
<p>Can try to catch implicit flows by assigning a taint tag to the
PC, updating it with taint of branch test, and assigning PC
taint to values inside if-else clauses, but this can lead to
a lot of false positives.</p>
<p><em>Example:</em></p>
<pre><code> if (imei > 42) {
x = 0;
} else {
x = 0;
}
// The taint tracker thinks that
// x should be tagged with imei's
// taint, but there is no information
// flow!
</code></pre>
<h2>Applications</h2>
<p>Interesting application of taint tracking:
keeping track of data copies.</p>
<ul>
<li>Often want to make sure sensitive data
(keys, passwords) is erased promptly.</li>
<li>If we're not worried about performance,
we can use x86-level taint tracking to
see how sensitive information flows
through a machine.
<a href="http://www-cs-students.stanford.edu/~blp/taintbochs.pdf">Ref</a></li>
<li>Basic idea: Create an x86 simulator that
interprets each x86 instruction in a full
system (OS + applications).</li>
<li>You'll find that software often keeps data
for longer than necessary. For example,
keystroke data stays around in:
<ul>
<li>Keyboard device driver's buffers</li>
<li>Kernel's random number generator</li>
<li>X server's event queue</li>
<li>Kernel socket/pipe buffers used to
pass messages containing keystroke</li>
<li><code>tty</code> buffers for terminal apps</li>
<li>etc...</li>
</ul></li>
</ul>
<h3>Tightlip</h3>
<p><em>TaintDroid</em> detects leaks of sensitive data,
but requires language support for the Java
VM -- the VM must implement taint tags. Can
we track sensitive information leaks without
support from a managed runtime? What if we
want to detect leaks in legacy C or C++
applications?</p>
<ul>
<li>One approach: use doppelganger processes
as introduced by the <a href="https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/yumerefendi/yumerefendi.pdf">TightLip system</a></li>
<li><strong>Step 1</strong>: Periodically, <em>Tightlip</em> runs a
daemon which scans a user's file system
and looks for sensitive information like
mail files, word processing documents,
etc.
<ul>
<li>For each of these files, <em>Tightlip</em>
generates a shadow version of the
file. The shadow version is
non-sensitive, and contains
scrubbed data.</li>
<li><em>Tightlip</em> associates each type of
sensitive file with a specialized
scrubber. <em>Example:</em> email scrubber
overwrites to: and from: fields
with an equivalent number of
dummy characters.</li>
</ul></li>
<li><strong>Step 2</strong>: At some point later, a process
starts executing. Initially, it touches
no sensitive data. If it touches sensitive
data, then <em>Tightlip</em> spawns a doppelganger
process.
<ul>
<li>The doppelganger is a sandboxed
version of the original process.
<ul>
<li>Inherits most state from the
original process...</li>
<li>...but reads the scrubbed
data instead of sensitive data</li>
</ul></li>
<li><em>Tightlip</em> lets the two processes
run in parallel, and observes
the system calls that the two
processes make.</li>
<li>If the doppelganger makes the
same system calls with the same
arguments as the original process,
then with high probability, the
outputs do not depend on sensitive
data.</li>
</ul></li>
<li><strong>Step 3</strong>: If the system calls diverge,
and the doppelganger tries to make a
network call, <em>Tightlip</em> flags a potential
leak of sensitive data.
<ul>
<li>At this point, <em>Tightlip</em> or the user
can terminate the process, fail the
network write, or do something else.</li>
</ul></li>
<li>Nice things about <em>Tightlip</em>:
<ul>
<li>Works with legacy applications</li>
<li>Requires minor changes to standard
OSes to compare order of system
calls and their arguments</li>
<li>Low overhead (basically, the overhead
of running an additional process)</li>
</ul></li>
<li>Limitations of <em>Tightlip</em>
<ul>
<li>Scrubbers are in the trusted computing
base.
<ul>
<li>They have to catch all instances
of sensitive data.</li>
<li>They also have to generate
reasonable dummy data -- otherwise,
a doppelganger might crash on
ill-formed inputs!</li>
</ul></li>
<li>If a doppelganger reads sensitive data
from multiple sources, and a system
call divergence occurs, <em>Tightlip</em> can't
tell why.</li>
</ul></li>
</ul>
<h3>Decentralized information flow control</h3>
<p><em>TaintDroid</em> and <em>Tightlip</em> assume no assistance
from the developer ...but what if developers
were willng to explicitly add taint labels to their
code?</p>
<pre><code> int {Alice --> Bob} x; // Means that x is controlled
// by the principal Alice, who
// allows that data to be seen
// by Bob.
</code></pre>
<p><em>Input channels:</em> The read values get the label
of the channel.</p>
<p><em>Output channels:</em> Labels on the channel must match
a label on the value being written.</p>
<ul>
<li>Static (i.e., compile-time) checking can catch
many bugs involving inappropriate data flows.
<ul>
<li>Loosely speaking, labels are like strong
types which the compiler can reason about.</li>
<li>Static checks are much better than dynamic
checks: runtime failures (or their absence)
can be a covert channel!</li>
</ul></li>
<li>For more details, see the <a href="http://pmg.csail.mit.edu/papers/iflow-sosp97.pdf">Jif paper</a></li>
</ul>