l21-taintdroid.html

<h1>Taint tracking</h1>

<p><strong>Note:</strong> These lecture notes were slightly modified from the ones posted on the 6.858 <a href="http://css.csail.mit.edu/6.858/2014/schedule.html">course website</a> from 2014.</p>

<h2>Android security policies</h2>

<p>What problem does the paper try to solve?</p>

<ul>
<li>Applications can exfiltrate a user's private
data and send it to some server.</li>
<li>High-level approach: keep track of which
data is sensitive, and prevent it from
leaving the device!</li>
<li>Why aren't Android permissions enough?
<ul>
<li>Android permissions control whether
application can read/write data, or
access devices or resources (e.g.,
the Internet).</li>
<li>Using Android permissions, it's hard
to specify a policy about <em>particular</em>
types of data (<em>Example:</em> "Even if the app
has access to the network, it should
never be able to send user data over
the network").</li>
<li><strong>Q:</strong> Aha! What if we never install apps
that both read data <em>and</em> have network
access?</li>
<li><strong>A:</strong> This would prevent some obvious leaks,
but it would also break many legitimate
apps! (<em>Example:</em> email app)
<ul>
<li>Information can still leak via side
channels. (<em>Example:</em> browser cache leaks
whether an object has been fetched
in the past)</li>
<li>Apps can collude! (<em>Example:</em> An app without
network privileges can pass data to
an app that does have network
privileges.)</li>
<li>A malicious app might trick another
app into sending data. (<em>Example:</em> Sending
an intent to the Gmail app?)</li>
</ul></li>
</ul></li>
</ul>

<p>What does Android malware actually do?</p>

<ul>
<li>Use location or IMEI for advertisements.
(IMEI is a unique per-device identifier.)</li>
<li>Credential stealing: send your contact list,
IMEI, phone number to remote server.</li>
<li>Turn your phone into a bot, use your contact
list to send spam emails/SMS messages!
<a href="http://www.bbc.com/news/technology-30143283">'Sophisticated' Android malware hits phones</a> </li>
<li>Preventing data exfiltration is useful, but
taint tracking by itself is insufficient to
keep your device from getting hacked!</li>
</ul>

<h2>TaintDroid overview</h2>

<p><em>TaintDroid</em> tracks sensitive information as it
propagates through the system.</p>

<ul>
<li><em>TaintDroid</em> distinguishes between information
sources and information sinks
<ul>
<li>Sources generate sensitive data:
<em>Example:</em> Sensors, contacts, IMEI</li>
<li>Sinks expose sensitive data:
<em>Example:</em> network.</li>
</ul></li>
<li><em>TaintDroid</em> uses a 32-bit bitvector to
represent taint, so there can be at most
32 distinct taint sources.</li>
<li>Roughly speaking, taint flows from rhs
to lhs of assignments.</li>
</ul>

<p><em>Examples:</em></p>

<pre><code>int lat = gps.getLatitude();
                // The lat variable is now
                // tainted!

Dalvik VM is a register-based machine,
so taint assignment happens during the
execution of Dalvik opcodes [see Table 1].

   move_op dst src          // dst receives src's taint
   binary_op dst src0 src1  // dst receives union of src0
                            // and src1's taint
</code></pre>

<p>Interesting special case, arrays:</p>

<pre><code>   char c = //. . . get c somehow.
   char uppercase[] = ['A', 'B', 'C', . . .];
   char upperC = uppercase[c];
                     // upperC's taint is the
                     // union of c and uppercase's
                     // taint.
</code></pre>

<ul>
<li>To minimize storage overheads, an array
receives a single taint tag, and all of
its elements have the same taint tag.</li>
<li><strong>Q:</strong> Why is it safe to associate just one
label with arrays or IPC messages?</li>
<li><strong>A:</strong> It should be safe to <em>over</em>-estimate
taint. This may lead to false positives,
but not false negatives.</li>
<li>Another special case: native methods
(i.e., internal VM methods like
<code>System.arraycopy()</code>, and native code
exposed via JNI).
<ul>
<li><strong>Problem:</strong> Native code doesn't go
through the Dalvik interpreter, so
<em>TaintDroid</em> can't automatically
propagate taint!</li>
<li><strong>Solution:</strong> Manually analyze the
native code, provide a summary of
its taint behavior.
<ul>
<li>Effectively, need to specify
how to copy taints from args
to return values.      </li>
<li><strong>Q:</strong> How well does this scale?</li>
<li><strong>A:</strong> Authors argue this works OK
for internal VM functions
(e.g., <code>arraycopy</code>). For "easy"
calls, the analysis can be
automated---if only integers
or strings are passed, assign
the union of the input taints
to the return value.</li>
</ul></li>
</ul></li>
<li>IPC messages are like treated like
arrays: each message is associated
with a single taint that is the union
of the taints of the constituent
parts.
<ul>
<li>Data which is extracted from an
incoming message is assigned
the taint of that message.</li>
</ul></li>
<li>Each file is associated with a single
taint flag that is stored in the
file's metadata.
<ul>
<li>Like with arrays and IPC messages,
this is a conservative scheme that
may lead to false positives.</li>
</ul></li>
</ul>

<p>How are taint flags represented in memory?</p>

<ul>
<li>Five kinds of things need to have taint
tags:
<ol>
<li>Local variables in a method</li>
<li>Method arguments</li>
<li>Object instance fields</li>
<li>Static class fields</li>
<li>Arrays</li>
</ol></li>
<li>Basic idea: Store the flags for a variable
near the variable itself.
<ul>
<li><strong>Q:</strong> Why?</li>
<li><strong>A:</strong> Preserves spatial locality---this
hopefully improves caching behavior.</li>
<li>For method arguments and local variables
that live on the stack, allocate the
taint flags immediately next to the
variable.</li>
</ul></li>
</ul>

<p><em>Example:</em></p>

<pre><code>                 .
                 .
        |        .         |
        +------------------+
        |     local0       |
        +------------------+
        | local0 taint tag |
        +------------------+
        |     local1       |
        +------------------+
        | local1 taint tag |
        +------------------+
                 .
                 .
                 .

    _TaintDroid_ uses a similar approach
    for class fields, object fields,
    and arrays -- put the taint tag
    next to the associated data.
</code></pre>

<p>So, given all of this, the basic idea in
<em>TaintDroid</em> is simple: taint sensitive data
as it flows through the system, and raise
an alarm if that data tries to leave via
the network!</p>

<p>The authors find various ways that
apps misbehave:</p>

<ul>
<li>Sending location data to advertisers</li>
<li>Sending a user's phone number to the app servers</li>
</ul>

<p><em>TaintDroid</em>'s rules for information flow might lead
to counterintuitive/interesting results. Imagine that an application 
implements its  own linked list class. </p>

<pre><code>    class ListNode{
        Object data;
        ListNode next;
    }
</code></pre>

<p>Suppose that the application assigns tainted
values to the "data" field. If we calculate
the length of the list, is the length value
tainted?</p>

<p>Adding to a linked list involves:</p>

<ol>
<li>Allocating a <code>ListNode</code></li>
<li>Assigning to the <code>data</code> field</li>
<li>Patching up <code>next</code> pointers</li>
</ol>

<p>Note that <strong>Step 3</strong> doesn't involve tainted
data! So, "next" pointers are tainted,
meaning that counting the number of
elements in the list would not generate
a tainted value for length.</p>

<p>What are the performance overheads of <em>TaintDroid</em>?</p>

<ul>
<li>Additional memory to store taint tags.</li>
<li>Additional CPU cost to assign, propagate,
check taint tags.</li>
<li>Overheads seem to be moderate: ~3--5%
memory overhead, 3--29% CPU overhead
<ul>
<li>However, on phones, users are very
concerned about battery life: 29%
less CPU performance may be
tolerable, but 29% less battery
life is bad.</li>
</ul></li>
</ul>

<h2>Questions and answers</h2>

<p><strong>Q:</strong> Why not track taint at the level of
x86 instructions or ARM instructions?</p>

<p><strong>A:</strong> It's too expensive, and there are
too many false positives.</p>

<ul>
<li><em>Example:</em> If kernel data structures are
improperly assigned taint, then
the taint will improperly flow
to user-mode processes. This
results in taint explosion: it's
impossible to tell which state
has <em>truly</em> been affected by
sensitive data.</li>
<li>One way that this might happen is
if the stack pointer or the break
pointer are incorrectly tainted.
Once this happens, taint rapidly
explodes:
<ul>
<li>Local variable accesses are
specified as offsets from
the break pointer.</li>
<li>Stack instructions like <code>pop</code>
use the stack pointer.</li>
<li><a href="http://www.ssrg.nicta.com.au/publications/papers/Slowinska_Bos_09.pdf">Pointless Tainting? Evaluating the Practicality of Pointer Tainting</a></li>
</ul></li>
</ul>

<p><strong>Q:</strong> Taint tracking seems expensive---can't we
just examine inputs and outputs to look
for values that are known to be sensitive?</p>

<p><strong>A:</strong> This might work as a heuristic, but it's
easy for an adversary to get around it.</p>

<ul>
<li>There are many ways to encode data,
e.g., URL-quoting, binary versus
text formats, etc.</li>
</ul>

<h2>Implicit flows</h2>

<p>As described, taint tracking cannot detect <em>implicit flows</em>.</p>

<p>Implicit flows happen when a tainted value affects another variable 
without directly assigning to that variable.</p>

<pre><code>     if (imei &gt; 42) {
         x = 0;
     } else {
         x = 1;
     }
</code></pre>

<p>Instead of assigning to <code>x</code>, we could try to leak information 
about the IMEI over the network!</p>

<p>Implicit flows often arise because of tainted values affecting control flow.</p>

<p>Can try to catch implicit flows by assigning a taint tag to the
PC, updating it with taint of branch test, and assigning PC
taint to values inside if-else clauses, but this can lead to
a lot of false positives.</p>

<p><em>Example:</em></p>

<pre><code>     if (imei &gt; 42) {
         x = 0;
     } else {
         x = 0;
     }

     // The taint tracker thinks that
     // x should be tagged with imei's
     // taint, but there is no information
     // flow!
</code></pre>

<h2>Applications</h2>

<p>Interesting application of taint tracking:
keeping track of data copies.</p>

<ul>
<li>Often want to make sure sensitive data
(keys, passwords) is erased promptly.</li>
<li>If we're not worried about performance,
we can use x86-level taint tracking to
see how sensitive information flows
through a machine.
<a href="http://www-cs-students.stanford.edu/~blp/taintbochs.pdf">Ref</a></li>
<li>Basic idea: Create an x86 simulator that
interprets each x86 instruction in a full
system (OS + applications).</li>
<li>You'll find that software often keeps data
for longer than necessary. For example,
keystroke data stays around in:
<ul>
<li>Keyboard device driver's buffers</li>
<li>Kernel's random number generator</li>
<li>X server's event queue</li>
<li>Kernel socket/pipe buffers used to
pass messages containing keystroke</li>
<li><code>tty</code> buffers for terminal apps</li>
<li>etc...</li>
</ul></li>
</ul>

<h3>Tightlip</h3>

<p><em>TaintDroid</em> detects leaks of sensitive data,
but requires language support for the Java
VM -- the VM must implement taint tags. Can
we track sensitive information leaks without
support from a managed runtime? What if we
want to detect leaks in legacy C or C++
applications?</p>

<ul>
<li>One approach: use doppelganger processes
as introduced by the <a href="https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/yumerefendi/yumerefendi.pdf">TightLip system</a></li>
<li><strong>Step 1</strong>: Periodically, <em>Tightlip</em> runs a
daemon which scans a user's file system
and looks for sensitive information like
mail files, word processing documents,
etc.
<ul>
<li>For each of these files, <em>Tightlip</em>
generates a shadow version of the
file. The shadow version is
non-sensitive, and contains
scrubbed data.</li>
<li><em>Tightlip</em> associates each type of
sensitive file with a specialized
scrubber. <em>Example:</em> email scrubber
overwrites to: and from: fields
with an equivalent number of
dummy characters.</li>
</ul></li>
<li><strong>Step 2</strong>: At some point later, a process
starts executing. Initially, it touches
no sensitive data. If it touches sensitive
data, then <em>Tightlip</em> spawns a doppelganger
process.
<ul>
<li>The doppelganger is a sandboxed
version of the original process.
<ul>
<li>Inherits most state from the
original process...</li>
<li>...but reads the scrubbed
data instead of sensitive data</li>
</ul></li>
<li><em>Tightlip</em> lets the two processes
run in parallel, and observes
the system calls that the two
processes make.</li>
<li>If the doppelganger makes the
same system calls with the same
arguments as the original process,
then with high probability, the
outputs do not depend on sensitive
data.</li>
</ul></li>
<li><strong>Step 3</strong>: If the system calls diverge,
and the doppelganger tries to make a
network call, <em>Tightlip</em> flags a potential
leak of sensitive data.
<ul>
<li>At this point, <em>Tightlip</em> or the user
can terminate the process, fail the
network write, or do something else.</li>
</ul></li>
<li>Nice things about <em>Tightlip</em>:
<ul>
<li>Works with legacy applications</li>
<li>Requires minor changes to standard
OSes to compare order of system
calls and their arguments</li>
<li>Low overhead (basically, the overhead
of running an additional process)</li>
</ul></li>
<li>Limitations of <em>Tightlip</em>
<ul>
<li>Scrubbers are in the trusted computing
base.
<ul>
<li>They have to catch all instances
of sensitive data.</li>
<li>They also have to generate
reasonable dummy data -- otherwise,
a doppelganger might crash on
ill-formed inputs!</li>
</ul></li>
<li>If a doppelganger reads sensitive data
from multiple sources, and a system
call divergence occurs, <em>Tightlip</em> can't
tell why.</li>
</ul></li>
</ul>

<h3>Decentralized information flow control</h3>

<p><em>TaintDroid</em> and <em>Tightlip</em> assume no assistance
from the developer ...but what if developers
were willng to explicitly add taint labels to their
code?</p>

<pre><code>  int {Alice --&gt; Bob} x;  // Means that x is controlled
                          // by the principal Alice, who
                          // allows that data to be seen
                          // by Bob.
</code></pre>

<p><em>Input channels:</em> The read values get the label
of the channel.</p>

<p><em>Output channels:</em> Labels on the channel must match
a label on the value being written.</p>

<ul>
<li>Static (i.e., compile-time) checking can catch
many bugs involving inappropriate data flows.
<ul>
<li>Loosely speaking, labels are like strong
types which the compiler can reason about.</li>
<li>Static checks are much better than dynamic
checks: runtime failures (or their absence)
can be a covert channel!</li>
</ul></li>
<li>For more details, see the <a href="http://pmg.csail.mit.edu/papers/iflow-sosp97.pdf">Jif paper</a></li>
</ul>