explore CAPE sandbox report file format #1535

williballenthin · 2023-06-12T11:10:24Z

use this issue to describe the interesting parts of the CAPE sandbox report file format. describe how we could extract data into capa-level features.

williballenthin · 2023-06-12T11:22:15Z

using 0000a65749f5902c4d82ffa701198038f0b4870b00a27cfca109f8f933476d82.json from the avast repo

general layout like this:

behavior.processes[].calls[] has the API trace:

from this we can extract:

PID
TID
return address
API trace:
- API name
- return value
- arguments
  - value
  - name

the return address feature would potentially enable a "call stack scope", like "all events found with the same return address".
however, im not sure how to interpret the addresses listed there, because the memory map for the process isn't available? so im not sure how to restrict these values?

when the argument is a string, it is parsed as a string, not as a pointer to some memory region:

some enums are also parsed into human readable strings:

handles are not consistently tracked, such as the hKey referenced here:

williballenthin · 2023-06-12T12:06:00Z

0000a65749f5902c4d82ffa701198038f0b4870b00a27cfca109f8f933476d82.zip

yelhamer · 2023-06-13T11:54:40Z

Categorizing the report sections by level of utility (redundant, future-use, to-be-used):

to-be-used:

static: this report section contains useful information for extracting the file features of a sample, such as: imports and exports, sections, format, as well as other information that can be used in the section scope.
strings: this report section contains the strings extracted from the sample as well as files dropped by the sample. this will be useful for extracting string features.
network: this section gives a limited overview of the extracted network traffic, which is limited to: protocol, src ip and port, dst ip and port, as well as non-useful information such as the packets' offset and timestamp. For more in depth network analysis (content for example) we'd need to use the extracted pcap files.
commands and mutexes: these should be determinable from the call traces, so maybe we'd want to only extract them just as string features and not as separate features? If we chose to include a commands feature, then we'd probably want to extract that from the process tree section as well, since the environment variables (including the CommandLine variable) are specified for each process, which means that we could extract commands at a process scope:
files, registry keys, and services: I think these should be included in the case that they are manipulated by means of an obfuscated powershell command (which is common), which rules based on api trace matching wouldn't be able to detect. If we chose to include them, I think we should add a member per each feature specifying whether the file/key was created, read, or deleted, or whether the service was created or started.
procdump and payloads: these can be used to extract strings, albeit not many. Another use can also be look at the matched rules for each dumped payload/process image try to extract string/bytes features from that:
api calls: this section section should yield the api features, as well as number and string features from the arguments.
CAPE.config: this section includes the extracted configuration for known malware families. we should return strings from this when available.
signatures: can contain several features such as: commands, urls, etc.

future use:

CAPE.payloads: it might be interesting to give users the option in the future to download these payloads and pass them to static extractor (viv, ida, etc.), which would give capa the ability to unpack/deobfuscate executables.
detection2pid: these include the malware family cape thinks the malware belongs to for each pid.

redundant:

behavior.enhanced: this report section contains detected events such as "loads file" or "creates a registry key", all of which should be detectable using capa rules.
dropped files: this information can be deduced from the files section as well as api calls.

yelhamer · 2023-06-13T12:13:41Z

extracted features and the associated report locations:

api: the call trace for each process
strings: strings report section, api arguments, CAPE.config, yara matches (if they include strings), environ section of the process tree, signatures section.
numbers: api arguments.
bytes: yara matches (if they include bytes).
network: network section and pcap files parsing.
imports/exports: static section.
section names: static section.
commands: commands section, the environ field of the process tree section.
files: the {create, read, deleted} files section.
registry keys: the {create, read, deleted} registry keys section.
services: the {create, started} services section.

mr-tz · 2023-07-06T09:41:43Z

The info.version field lists the CAPE version, e.g. 2.2-CAPE that we currently use from the AVAST database.
We should ensure that this is what we expect as I've noticed small differences, e.g. to 2.4-CAPE (here regarding how imports are organized).

yelhamer · 2023-07-06T10:41:07Z

especially once we've added the call scope. once that has been added we should make sure the cape version being used has the msdn names (not the legacy ones).

doomedraven · 2023-07-06T13:49:54Z

yes 2.2 and 2.4 doesn't have big in changes. https://github.com/kevoreilly/CAPEv2/blob/master/changelog.md#2422023-cape-24--edition

mr-tz · 2023-07-06T14:23:53Z

I think the change may have been introduced when you improved the parser reusability (kevoreilly/CAPEv2#763) or before. Maybe I've also made it up when trying to fabricate the data locally 😮

williballenthin added documentation Improvements or additions to documentation question Further information is requested labels Jun 12, 2023

williballenthin added this to @yelhamer GSoC 2023 Jun 12, 2023

williballenthin moved this to in progress in @yelhamer GSoC 2023 Jun 12, 2023

williballenthin added the dynamic related to dynamic analysis flavor label Jun 14, 2023

yelhamer mentioned this issue Jun 19, 2023

add the CAPE feature extractor #1546

Merged

6 tasks

yelhamer linked a pull request Jun 19, 2023 that will close this issue

add the CAPE feature extractor #1546

Merged

6 tasks

williballenthin mentioned this issue Aug 16, 2023

add Pydantic models for CAPE sandbox #1729

Merged

3 tasks

williballenthin closed this as completed Aug 22, 2023

github-project-automation bot moved this from in progress to done in @yelhamer GSoC 2023 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore CAPE sandbox report file format #1535

explore CAPE sandbox report file format #1535

williballenthin commented Jun 12, 2023

williballenthin commented Jun 12, 2023 •

edited

Loading

williballenthin commented Jun 12, 2023

yelhamer commented Jun 13, 2023 •

edited

Loading

yelhamer commented Jun 13, 2023

mr-tz commented Jul 6, 2023

yelhamer commented Jul 6, 2023

doomedraven commented Jul 6, 2023 •

edited

Loading

mr-tz commented Jul 6, 2023

explore CAPE sandbox report file format #1535

explore CAPE sandbox report file format #1535

Comments

williballenthin commented Jun 12, 2023

williballenthin commented Jun 12, 2023 • edited Loading

williballenthin commented Jun 12, 2023

yelhamer commented Jun 13, 2023 • edited Loading

to-be-used:

future use:

redundant:

yelhamer commented Jun 13, 2023

extracted features and the associated report locations:

mr-tz commented Jul 6, 2023

yelhamer commented Jul 6, 2023

doomedraven commented Jul 6, 2023 • edited Loading

mr-tz commented Jul 6, 2023

williballenthin commented Jun 12, 2023 •

edited

Loading

yelhamer commented Jun 13, 2023 •

edited

Loading

doomedraven commented Jul 6, 2023 •

edited

Loading