Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM: add support for arm architecture #1803

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gauthier-wiemann
Copy link

The suggested enhancement involves adding support for ARM architecture in the CAPA tool and improving its capabilities on ELF files.

Here are the steps involved:

  1. Changes to capa/features/common.py and capa/main.py: Simple changes have been made to add ARM architecture to the list of supported architectures in CAPA.

  2. Changes to capa/features/extractors/viv/: More complex changes have been made to this directory to enable disassembly of ARM binaries.
    Specifically, a new file, capa/features/extractors/viv/insn_arm.py, has been added.
    This file replicates the logic of capa/features/extractors/viv/insn.py, but adapts its functions to ARM mnemonics and patterns specific to ARM binaries.

  3. Testing: Each feature has been tested using binaries written in assembler or compiled with GCC (version 9.4.0), and the final result of the modifications has been evaluated over Linux Arm Malware samples, mostly Mirai.

  4. Improving capabilities on ELF files: The modifications also aim to improve CAPA's capabilities on ELF files. Specifically, the following changes have been made:

    • ELF Stripper: CAPA's current version fails to detect the underlying operating system in ELF files, leading to premature termination of the analysis. To address this, Linux is assumed to be the default operating system if the analysis couldn't determine the underlying OS. This change is made in capa/features/extractors/elf.py.
    • Statically linked ELF files: The extractor was unable to yield the API features in statically linked ELF files, resulting in poor detection capabilities. To solve this problem, upstream detection of statically linked ELF files is performed, and certain basic blocks are marked as library function using a hard-coded table to translate the syscall number into their function name. This change is made in capa/features/extractors/viv/syscall.py.
Note:
  • While adding support for ARM architecture in CAPA, it was found that some ARM samples produce a lot of warnings in the provenance of the vivesct disassembler. However, these warnings do not seem to interfere with the final result of CAPA's analysis.

  • As for x86/x84 binaries, it was observed that the statement of some features is dependent on specific patterns, which may be related to the type and version of the compiler used. This dependency could potentially limit CAPA's abilities when analyzing these binaries.

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

Sorry for the new PR, we had to do it with another organization.

This PR is related to this issue #1774 and this PR #1796

@google-cla
Copy link

google-cla bot commented Oct 6, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@williballenthin
Copy link
Collaborator

🥳 so happy to see this reopened!

@gauthier-wiemann
Copy link
Author

We registered on Google CLA, what is the next step now ?

@mr-tz
Copy link
Collaborator

mr-tz commented Oct 11, 2023

The CLA check succeeds now, thanks! We'll review this shortly.

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a first set of comments. My main concern is around mixing the architectures and how to handle this to keep the code maintainable (little code duplication) and readable.

Great work overall and I look forward to other reviews and your feedback.

@@ -5,8 +5,11 @@
### New Features
- ghidra: add Ghidra feature extractor and supporting code #1770 @colton-gabertan
- ghidra: add entry script helping users run capa against a loaded Ghidra database #1767 @mike-hunhoff
- binja: add support for forwarded exports #1646 @xusheng6
binja: add support for forwarded exports #1646 @xusheng6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
binja: add support for forwarded exports #1646 @xusheng6
- binja: add support for forwarded exports #1646 @xusheng6

Comment on lines +106 to +110
if isinstance(dst, (envi.archs.i386.disasm.i386SibOper, envi.archs.i386.disasm.i386RegMemOper)):
rname = dst._dis_regctx.getRegisterName(dst.reg)
else:
rname = dst.reg
if rname not in ["ebp", "rbp", "esp", "rsp", envi.archs.arm.disasm.REG_SP, envi.archs.arm.disasm.REG_BP]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be simplified to just use the constant or the name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we import envi.archs.arm.disasm.REG_SP as REG_SP and just use REG_SP for example ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-tz i'm not sure i agree, because then it won't be obvious to a reader if the SP/BP are related to arm vs x86 vs ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to use:

  • envi.archs.i386.disasm.REG_EBP (or similar [untested])
  • envi.archs.amd64.disasm.REG_RBP (or similar [untested])
  • etc.

Instead of "ebp", "rbp", etc.

Comment on lines +97 to +98
not isinstance(dst, envi.archs.i386.disasm.i386SibOper)
and not isinstance(dst, envi.archs.i386.disasm.i386RegMemOper)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
not isinstance(dst, envi.archs.i386.disasm.i386SibOper)
and not isinstance(dst, envi.archs.i386.disasm.i386RegMemOper)
not isinstance(dst, (envi.archs.i386.disasm.i386SibOper, envi.archs.i386.disasm.i386RegMemOper))

@@ -390,10 +392,13 @@ def is_supported_format(sample: Path) -> bool:
return len(list(capa.features.extractors.common.extract_format(taste))) == 1


def is_supported_arch(sample: Path) -> bool:
def is_supported_arch(sample: Path) -> Union[str, int, float, bytes]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should still return bool. get_arch can return the string/handle the cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change that to use get_arch instead

super().__init__()
self.vw = vw
self.path = path
self.arm = arm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to use arch here instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I will change it

add_library_function(extractor.vw, bb, get_library_function_name_arm)


Sys_x64 = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to do this for the other backends as well, we can move the mappings to a file one directory up

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want us to create this file now ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't mind, please... and thanks!

Comment on lines +47 to +63
def interface_extract_instruction_XXX(
fh: FunctionHandle, bbh: BBHandle, ih: InsnHandle
) -> Iterator[Tuple[Feature, Address]]:
"""
parse features from the given instruction.

args:
fh: the function handle to process.
bbh: the basic block handle to process.
ih: the instruction handle to process.

yields:
(Feature, Address): the feature and the address at which its found.
"""
raise NotImplementedError


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def interface_extract_instruction_XXX(
fh: FunctionHandle, bbh: BBHandle, ih: InsnHandle
) -> Iterator[Tuple[Feature, Address]]:
"""
parse features from the given instruction.
args:
fh: the function handle to process.
bbh: the basic block handle to process.
ih: the instruction handle to process.
yields:
(Feature, Address): the feature and the address at which its found.
"""
raise NotImplementedError

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the separation, on the other hand seems like we have quite some duplicated code?
Do you have a feel for if only using insn.py would be more maintainable/less duplication?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We first began with a repository viv_arm/ and then merged it with viv/. insn.py was the most difficult one to merge. Many functions are very different but we import the common functions from insn.py. We can maybe merge other functions but I"m not sure if it is possible for all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be reasonable to:

  • move most of insn.py today to insn_x86.py, then
  • add the new insn_arm.py, and then
  • add new insn.py that dispatches to the appropriate insn_* file

?

once we have two supported arches, i would generally prefer that they are treated the same, not "x86 is primary and ARM is second class".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think having some duplication is ok, especially when we don't expect every contributor to be an expert in both x86 and arm (i am certainly not!). if we keep the code together, then few people can maintain it. if we have duplicates, each set of experts can work on their own code. of course, we'll want to keep things in sync, but it shouldn't prevent us from doing a reasonably good job of fixing bugs and adding features.

Comment on lines +92 to +93
if read_memory(f.vw, target, 4) != b"\x00\xc6\x8f\xe2":
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, I did not review most of this file

@gauthier-wiemann
Copy link
Author

Thank you for this first review. I tried to give a feedback for as many comments as possible but I need to consult my team for the other points. I will commit the changes when all the comments are answered.

from capa.features.extractors.base_extractor import BBHandle, FeatureExtractor


def getSyscallName(num: int, arch: bool) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def getSyscallName(num: int, arch: bool) -> str:
def get_syscall_name(num: int, arch: bool) -> str:

and maybe document this can raise KeyError when not found ?

@mr-tz
Copy link
Collaborator

mr-tz commented Jan 31, 2024

Hey @gauthier-wiemann, I've created a new feature branch for ARM support: https://github.com/mandiant/capa/tree/arm-support
Can you/we change to merge your suggestions in there for us to pick up future work?

@gauthier-wiemann
Copy link
Author

Hi, we have no problem with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants