Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for analysis of source code/scripted languages #1080

Draft
wants to merge 51 commits into
base: master
Choose a base branch
from

Conversation

adamstorek
Copy link

@adamstorek adamstorek commented Jul 1, 2022

This enhancement extends capa's functionality to the analysis of potentially malicious scripts and source code. A tree-sitter backend was added to parse the source files into a lightweight AST. Features akin to the PE-Vivisect capa are then extracted:

File-level:

  • trivial: language, file format
  • global string literals
  • global integer literals
  • namespaces
  • globally-instantiated imported classes
  • globally-called imported functions

Function-level:

  • string literals
  • integer literals
  • imported classes
  • imported functions

To install Tree-sitter:

  1. Pip-install Tree-sitter:
    pip3 install tree-sitter
  2. Install bindings:
    mkdir vendor build
    cd vendor
    git clone git@github.com:tree-sitter/tree-sitter-c-sharp.git
    git clone git@github.com:tree-sitter/tree-sitter-embedded-template.git
    git clone git@github.com:tree-sitter/tree-sitter-html.git
    git clone git@github.com:tree-sitter/tree-sitter-javascript.git

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start towards adding scripting language support to capa! thanks @adamstorek!

The code is already quite good and I don't anticipate any major issues to getting it merged; however, I have added a number of comments on regions I think should be tweaked.

One thing you should know about me is that I prefer to over-communicate review feedback with the understanding that everything is up for discussion. So, if anything feels weird or wrong, don't hesitate to ask for more details or deeper discussion.

General points:

  1. I really like the file range address type, I think that will work well.
  2. I think we can simplify the code a bit by merging the "Script" feature into "Format". thoughts?
  3. Some of the embedded data and configuration can be restructured into python globals.
  4. I'd like to hear a bit about what it takes to embed/depend on the TS languages so we can ensure its easy for people to download/use.
  5. Please add tests showing the features extracted on various example files

Comment on lines +408 to +411
class ScriptLanguage(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.name = "script language"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use format for this? e.g. format: C#.

pro:

  • fewer features to memorize
  • less duplication
  • less code

con:

  • maybe slightly less precise

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with overloading the file format feature is that file to language is a one-to-many mapping, e.g. there can be embedded templates that contain multiple different script languages such as C# for server-side scripts and JavaScript for client-side.

capa/features/extractors/script.py Show resolved Hide resolved
Comment on lines 26 to 37
def get_language_from_ext(path: str):
_, ext = os.path.splitext(path)
if ext == ".cs":
return LANG_CS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also think about maybe trying to guess the language based on the file contents if there's no extension.

also supporting things like .cs_ which some may use to prevent the file from accidentally getting executed.

Copy link
Author

@adamstorek adamstorek Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-identifying the script language is definitely a worthwhile feature which I have deprioritized for the minimal implementation and instead manually incorporated the extensions (now including the defanged extensions as suggested above). This is also because it might not always be very straightforward to do so (e.g. one file might include multiple scripts; sometimes context might be necessary).

capa/features/extractors/ts/build.py Outdated Show resolved Hide resolved
capa/features/extractors/ts/engine.py Outdated Show resolved Hide resolved
capa/features/extractors/ts/extractor.py Outdated Show resolved Hide resolved
capa/features/extractors/ts/extractor.py Outdated Show resolved Hide resolved
capa/features/extractors/ts/query.py Outdated Show resolved Hide resolved
capa/features/extractors/ts/sig.py Outdated Show resolved Hide resolved
capa/features/extractors/ts/signatures/cs.json Outdated Show resolved Hide resolved
@adamstorek adamstorek force-pushed the capa-scripts branch 3 times, most recently from 2fdcac2 to 0a61d86 Compare July 5, 2022 16:28
@williballenthin
Copy link
Collaborator

i think it would be worthwhile to get the tests running (and passing) in CI. this means:

  • add the example files to capa-testfiles and get those merged, and
  • update the github actions workflows to install the TS bindings (temporarily, until we have a better solution)

@adamstorek
Copy link
Author

adamstorek commented Jul 7, 2022

  • add the example files to capa-testfiles and get those merged, and

Just submitted the pull request pull request.

  • update the github actions workflows to install the TS bindings (temporarily, until we have a better solution)

On it.

adamstorek and others added 22 commits July 19, 2022 10:36
…rves as an interface to the language-specific tree-sitter queries.
…and function definition parsing for a pure C# sample.
…and not introduce unspecified rule-exceptions.
Comment on lines +141 to +142
if self.is_aspx_import_directive(node):
namespace = self.get_aspx_namespace(node)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document why we are only/specially handling ASPX here?

Copy link
Author

@adamstorek adamstorek Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the issue discussed here: #1080 (comment).

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice updates! I've left a few comments, questions, and suggestions for your review 🚀

tree = _parse(ts_language, buf)
except ValueError:
continue
if not _contains_errors(ts_language, tree.root_node):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment on what are assumptions are here? I'm not overly familiar with tree-sitter but it appears that we assume it will only throw errors when encountering a language mismatch e.g. we attempt to parse Python using tree-sitter C# tooling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more readable, what do you think?

def _parse(ts_language: Language, buf: bytes) -> Optional[Tree]:
    try:
        parser = Parser()
        parser.set_language(ts_language)
        return parser.parse(buf)
    except ValueError:
        return None


def _contains_errors(ts_language, node: Node) -> bool:
    return ts_language.query("(ERROR) @error").captures(node)


def get_language_ts(buf: bytes) -> str:
    for language, ts_language in TS_LANGUAGES.items():
        tree = _parse(ts_language, buf)
        if tree and not _contains_errors(ts_language, tree.root_node):
            return language
    raise ValueError("failed to parse the language")

@@ -0,0 +1,15 @@
from tree_sitter import Language

build_dir = "build/my-languages.so"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we only support Linux?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree-sitter needs to compile its (C) language bindings. Although I have a limited knowledge of package management, I've suggested to Moritz that we should precompile and package the supported tree-sitter bindings for each platform we support. The current state is a temporary measure.

def parse(self) -> Tree:
parser = Parser()
parser.set_language(self.query.language)
return parser.parse(self.buf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this call generate any exceptions?

Copy link
Author

@adamstorek adamstorek Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can throw type errors (which I believe we prevent with mypy) and a value error if parsing completely fails. Then the parse method will throw a ValueError, so the engine will throw a ValueError etc.: I can handle it in the following way at the Extractor level:

try:
    self.language = capa.features.extractors.ts.autodetect.get_language(path)
    self.template_engine = self.get_template_engine(buf)
    self.engines = self.get_engines(buf)
except ValueError as e:
    raise UnsupportedFormatError(e)

Comment on lines 38 to 39
def get_range(self, node: Node) -> str:
return self.get_byte_range(node).decode("utf-8")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended use of this function appears to be decoding a string found in a specific byte range? If so, consider changing the name to something more descriptive like get_str_from_range. Also, do we expect encoding exceptions to be thrown by the decode?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, but I doubt tree-sitter would be able to parse something that we can't decode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also changed get_range to get_str.

capa/features/extractors/ts/engine.py Outdated Show resolved Hide resolved
capa/features/extractors/ts/function.py Outdated Show resolved Hide resolved
def _extract_imported_constants(fn_node: Node, engine: TreeSitterExtractorEngine) -> Iterator[Tuple[Feature, Address]]:
for ic_node, ic_name in engine.get_processed_imported_constants(fn_node):
for name in get_imports(ic_name, engine.namespaces, engine):
yield API(engine.language_toolkit.format_imported_constant(name)), engine.get_address(ic_node)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more discussion on #1125

signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file))
return {category: set(namespaces) for category, namespaces in signatures.items()}

def _is_import(self, name: str) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo w/ _?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is merely a private method to handle import table lookups that the public is_import method uses.

capa/features/extractors/ts/tools.py Outdated Show resolved Hide resolved
Comment on lines +118 to +119
return int(integer, base)
return int(integer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect exceptions to occur here?

Copy link
Author

@adamstorek adamstorek Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can raise ValueError (and does when TS labels something as an int which is not an int), but the ValueError is handled by the caller (see extract_integers).

…es in order to make rules clearer; refactored the codebase to address the latest PR comments/suggestions.
@adamstorek adamstorek mentioned this pull request Aug 8, 2022
…(passes all test cases but by no means perfect); further clean up, especially of the signatures; synced with new Python test cases.
@williballenthin williballenthin marked this pull request as draft July 13, 2023 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont merge Indicate a PR that is still being worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants