Skip to content

Architecture: Nanoscope ROM

Leland Takamine edited this page Jan 16, 2019 · 7 revisions

Fork of AOSP that enables Nanoscope's extremely low-overhead method tracing.

Overview

GitHub Repo

Goal

Trace every Java method with minimal instrumentation overhead.

Strategy

  • Upon entering a method, write a method pointer and timestamp to our trace data buffer to represent a push to the callstack.
  • On method exit, write a nullptr and timestamp to represent a pop.
  • When we stop tracing, process and flush the trace data to disk.

API

System Property API

  • nanoscope-propertywatcher.h

The System Property API is useful for controlling tracing from the command line. This is the recommended approach for integrating with external tools.

Start Tracing
# Data will be flushed to /data/data/com.example/files/data.txt
adb shell setprop dev.nanoscope com.example:data.txt
Stop Tracing
adb shell setprop dev.nanoscope \'\'

Java API

  • thread.cc
    • Thread::SetThreadName(const char*)

The Java API is useful for controlling tracing programmatically and/or if you want to trace a Thread that is not the Main Thread. If you need reproducible trace start and end points, use this method instead of the System Property API. Note that with this method, you do not need to run nanoscope start/stop.

Start Tracing
Thread.currentThread().setName("START_TRACING");
Stop Tracing
// Data will be flushed to /data/data/com.example/files/data.txt
Thread.currentThread().setName("STOP_TRACING:/data/data/com.example/files/data.txt");
Viewing the Trace File

When using the Java API, you do not need to run nanoscope start/stop. Once the data is flushed, pull the trace fill off the device and open it with nanoscope open <tracefile>.

Why use Thread.setName()?

It's true that it seems more natural to expose a new API called Thread.startTracing(), for example. The problem with this approach is that this new method wouldn't be visible to any apps as they'd be compiled against the standard Android SDK. Hijacking the Thread.setName method allows us to provide an API that's accessible from all apps. And don't worry, the calls above won't actually change the Thread name.

Listening for flush completion

The data is first flushed to a temporary file. Once the the file is fully written, it's moved over to the final location. Watch the final file path - when it exists, the flush is complete.

Trace File Format

After stopping a trace, the trace data is processed and flushed to the specified location on disk. This section defines the trace file format.

The file format is a simple human-readable push/pop format. Each line conforms to this pattern:

[timestamp]:[methodSignature | "POP"]

If the name after the colon is POP, then it represents a pop from the callstack. Otherwise, it's a push. Here's an example of a trace output file - it represents a parent method that calls two sibling methods:

12009599181900:void com.example.Test.methodA(java.lang.String)
12009599181946:void com.example.Test.methodB(java.lang.String)
12009599181982:POP
12009599181996:void com.example.Test.methodC()
12009599182019:POP
12009599182019:POP

The Test class may have looked something like this:

class Test {
    
    void methodA(String str) {
        methodB(str);
        methodC();
    }
    
    void methodB(String str) {}
    void methodC() {}
}

Why not Google's Trace Event Format?

Nanoscope can easily generate millions of trace events. We found that Google's Trace Event Format was too verbose at this scale and slowed down the generation and parsing of the trace file.

Trace Data Buffer

  • thread.h
    • Thread.tlsPtr_.trace_data

Our guiding principle here is to do as little work as possible during method execution. Since memory access is likely to be a instrumentation performance bottleneck, we want to reduce the amount of data written at the start and end of every method. With that in mind, we have a minimal in-memory representation of the trace data that conceptually looks like this:

struct Event {
  ArtMethod* method;
  uint64_t timestamp;
};

Event[] trace_data;

The above Event struct doesn't exist in code, but it does accurately represent how we store the trace data in memory. In practice, we store the trace data as an int64_t array (Thread.tlsPtr_.trace_data) for simplicity and to allow for easy extensions to the base format.

Writing to the Buffer

We've described the in-memory representation of our trace data above. This section describes how we write trace data to the buffer at the beginning and end of every Java method.

General Strategy

Method Entry
if (tlsPtr_.trace_data_ptr != nullptr) {                         // If our buffer exists then trace this Thread.
  *tlsPtr_.trace_data_ptr++ = reinterpret_cast<int64_t>(method); // Write ArtMethod pointer to the buffer.
  *tlsPtr_.trace_data_ptr++ = generic_timer_count();             // Write timer count to the buffer.
}
Method Exit
if (tlsPtr_.trace_data_ptr != nullptr) {
  *tlsPtr_.trace_data_ptr++ = 0;                     // Only difference from Method Entry is we write a nullptr to indicate a pop from the callstack.
  *tlsPtr_.trace_data_ptr++ = generic_timer_count();
}

Since Java code can be executed either as interpreted Dalvik bytecode or as compiled native instructions we cover the specifics of each scenario in the subsections below.

Interpreter

  • interpreter.cc
    • interpreter::Execute(Thread*, DexFile::CodeItem*, ShadowFrame&, JValue, bool)
  • thread.cc
    • Thread::TraceStart(ArtMethod*)
    • Thread::TraceEnd()

The interpreter logic is written exactly as shown in General Strategy. You can find the actual code for method entry and method exit in Thread::TraceStart and Thread::TraceEnd, respectively. All interpreted method execution flows through interpreter::Execute so it's straightforward to instrument all of these calls by executing Thread::TraceStart and Thread::TraceEnd at appropriate times in this method.

Compiler

  • code_generator_arm[64].cc
    • InstructionCodeGeneratorARM[64]::VisitTraceStart(HTraceStart*)
    • InstructionCodeGeneratorARM[64]::VisitTraceEnd(HTraceEnd*)
  • instruction_builder.cc
    • HInstructionBuilder::Build()
    • HInstructionBuilder::BuildReturn(Instruction&, Primitive::Type, uint32_t)
  • nodes.h
    • HGraph
    • HTraceStart
    • HTraceEnd

To improve runtime performance, Java code is sometimes compiled to native instructions (ART documentation). In these cases, calls don't flow through any single method like they do in the interpreter. The lack of a single instrumentable method complicates the instrumentation as we're forced to inject the appropriate logic at the beginning and end of every compiled method. We've added the following logic to accomplish this:

  1. Define two new pseudo-instructions (HTraceStart / HTraceEnd) to be used in the intermediate representation (IR).
  2. When we build the IR graph (HGraph), we add code to insert the HTraceStart and HTraceEnd at appropriate locations. This is done in HInstructionBuilder::Build and HInstructionBuilder::BuildReturn.
  3. Now that we've marked where the trace logic shall be inserted, we can override the methods InstructionCodeGeneratorARM[64]::VisitTraceStart and InstructionCodeGeneratorARM[64]::VisitTraceEnd where we'll actually generate the assembly for the method entry and method exit logic.

Below is the ARMv8-A assembly we need to generate at the beginning of each method. This logic lives in InstructionCodeGeneratorARM64::VisitTraceStart. Similar logic is required at the end of each method:

; TR = thread register
; MR = method register
; data_ptr = temp register
; timestamp = temp register

VisitTraceStart:
    LDR data_ptr, [TR, data_ptr_offset]  ; data_ptr = thread.data_ptr
    CBZ data_ptr, done                   ; if (data_ptr != null)
    MRS timestamp, SYS_CNTVCT_EL0        ;     timestamp = <Generic Timer count>
    STR MR, [data_ptr], +8               ;     *data_ptr++ = method
    STR timestamp, [data_ptr], +8        ;     *data_ptr++ = timestamp
    STR data_ptr, [TR, data_ptr_offset]  ;     *thread.data_ptr = data_ptr
  done:

Flushing the Buffer

  • thread.cc
    • flush_trace_data

Upon stopping tracing, we kick off a background thread to process and flush the trace data buffer. You can find this logic in the flush_trace_data method. Here's a pseudocode representation of the flushing process:

// TIMER_FREQUENCY: Timer ticks per nanosecond
// ptr: Pointer to beginning of trace data buffer
// end: Pointer representing the end of data written to trace data buffer

while (ptr < end) {
  methodName = convertToName(ptr++);
  timestamp = convertToNanoseconds(ptr++);
  writeToFile(methodName, timestamp);
}

convertToName(ptr) {
  return ptr == 0 ? "POP" : resolveArtMethodName(ptr);
}

convertToNanoseconds(ptr) {
  return ptr / TIMER_FREQUENCY;
}

Generic Timer Register

  • thread.cc
    • generic_timer_count()

Reporting the current time is a potential bottleneck for instrumentation performance. To reduce the overhead of calculating the current timestamp and to simplify the generated machine code, we read directly from ARM's Generic Timer Register.

Interpreter

  • generic_timer_count()

In the interpreter, we use the following code to read the timestamp for 64-bit processes:

asm volatile("mrs %0, cntvct_el0" : "=r"(t));

The volatile here is necessary to prevent the compiler from optimizing away multiple calls to generic_timer_count which would result in stale timestamps.

Compiler

  • code_generator_arm64.cc
    • SYS_CNTVCT_EL0
  • assembler_thumb2.cc
    • Thumb2Assembler::mrrc(Register, Register, int, int, int)

Ultimately, we generate the same assembly used in the interpreter to retrieve the timer value. The 64-bit and 32-bit code generators, however, each have limitations that make the implementation non-trivial.

CodeGeneratorARM64

The proper instruction to read from a system register is MRS as we saw in the interpreter implementation. Unfortunately, the code generation library that CodeGeneratorARM64 leverages (VIXL) doesn't include a definition for the cntvct_el0 Generic Timer count register. To work around this, we define this register ourselves based on the cntvct_el0 register documentation:

uint16_t SYS_CNTVCT_EL0 = ((1 << SysO0_offset) |
                            (3 << SysOp1_offset) |
                            (14 << CRn_offset) |
                            (0 << CRm_offset) |
                            (2 << SysOp2_offset)) >> ImmSystemRegister_offset;

Once we have this register properly defined, generating the proper instruction is simple:

__ Mrs(timestamp, (SystemRegister) SYS_CNTVCT_EL0);
CodeGeneratorARM

Similarly to the 64-bit code generator, the 32-bit version, CodeGeneratorARM, also lacks a definition of the cntvct_el0 system register. To make matters worse, its assembler doesn't even support the the instruction we need (mrrc). So the first step is to implement the code generation logic for mrrc in assembler_thumb2.cc:

void Thumb2Assembler::mrrc(Register r1, Register r2, int opc1, int coproc, int crm) {
  // ARM Architecture Reference Manual: Thumb-2 Supplement 4.6.81
  uint32_t hw1 = (0b111011000101 << 4) | (static_cast<uint32_t>(r2) << 0);
  uint32_t hw2 = (static_cast<uint32_t>(r1) << 12) | (coproc << 8) | (opc1 << 4) | (crm << 0);
  // Note that Emit32 swaps hw1 and hw2 parts for us so we order them in the natural order here.
  Emit32(hw1 << 16 | hw2);
}

The code above writes the raw bits required to generate the mrrc instruction with the given parameters. With the new instruction defined, we can generate the register read:

__ mrrc(temp1, temp2, 0b0001, 0b1111, 0b1110);