Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for analysis of source code/scripted languages #1080

Draft
wants to merge 51 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bbd3f70
Added initial capa control flow for scripts in C#.
adamstorek Jun 27, 2022
8173397
Implemented some further basic TreeSitter Extractor-related concepts …
adamstorek Jun 27, 2022
428f6bc
Modified mypy config file to ignore tree-sitter's missing exports.
adamstorek Jun 28, 2022
a6d7ba2
Implemented core tree sitter engine component with C# queries that se…
adamstorek Jun 28, 2022
80bf78b
Implemented script global extraction handlers (mostly wrapping existi…
adamstorek Jun 28, 2022
cf3dc7e
Reworked format parsing to align better with the rest of capa logic.
adamstorek Jun 28, 2022
9d7f575
Implemented a large part of the C# functionality; refactored the Tree…
adamstorek Jun 29, 2022
3d4b4ec
Added function-level feature extraction.
adamstorek Jun 30, 2022
eca7ead
Bug fixes and code refactoring of the Tree Sitter extractor.
adamstorek Jun 30, 2022
5fd953f
Added tree_sitter to requirements in setup.py.
adamstorek Jun 30, 2022
1f79db9
Added tests for TreeSitterExtractorEngine initialization, new object …
adamstorek Jul 1, 2022
a58bc0b
Added more TreeSitterExtractorEngine tests for pure C#.
adamstorek Jul 1, 2022
5ddb8ba
Added last remaining tests for the TreeSitterExtractorEngine class an…
adamstorek Jul 1, 2022
31e2fb9
Reverted yielding only non-empty strings in order to stay consistent …
adamstorek Jul 5, 2022
5bf3f18
Removing functions that should not be used in tree-sitter extractor (…
adamstorek Jul 5, 2022
a4529fc
Modifying extraction of global statements to omit local function decl…
adamstorek Jul 5, 2022
d5de9a1
Added script language feature to freeze.
adamstorek Jul 5, 2022
6c10458
Added test cases for TS Extractor.
adamstorek Jul 5, 2022
9bd9824
Refactored query bindings.
adamstorek Jul 6, 2022
2594849
Added support for template parsing.
adamstorek Jul 6, 2022
619ed94
Added support for HTML parsing.
adamstorek Jul 6, 2022
5e23802
Implemented the necessary modifications to support embedded templates…
adamstorek Jul 7, 2022
5d83e8d
Added more buildings to build; minor style improvement.
adamstorek Jul 7, 2022
9570523
Further refactored the Tree-sitter queries and fixed minor template e…
adamstorek Jul 7, 2022
7c5e6e3
Refactored extractor engine tests and began adding new template tests.
adamstorek Jul 7, 2022
1e0326a
Added new tests for embedded template testing and refactored a few al…
adamstorek Jul 8, 2022
ca1939f
Bug fixes in extractor and HTML Tree-sitter engine.
adamstorek Jul 8, 2022
d7ab2db
Fixed important namespace-parsing bugs.
adamstorek Jul 11, 2022
5cfbecc
Further improvement to namespace parsing, including default namespace…
adamstorek Jul 11, 2022
26cc1bc
Added more tests and a few minor bug fixes.
adamstorek Jul 11, 2022
2a9e76f
Added language-specific integer parsing.
adamstorek Jul 12, 2022
672ca71
Fixed an important bug in FileOffsetRangeAddress comparison method.
adamstorek Jul 12, 2022
ca426ca
Added more ASPX tests.
adamstorek Jul 12, 2022
fd80277
Fixed the capa control flow to fully support capa scripts.
adamstorek Jul 12, 2022
d0c4acb
Major changes: switching imports and function names to properties, st…
adamstorek Jul 18, 2022
ad31d83
Fixed property-extraction bugs.
adamstorek Jul 19, 2022
e52a9b3
Added few more test cases.
adamstorek Jul 19, 2022
b27713b
Minor style improvements.
adamstorek Jul 19, 2022
b2df2b0
Removed deprecated parse_integer.
adamstorek Jul 19, 2022
a0379a6
Added more tests; fixed integer parsing related bugs.
adamstorek Jul 19, 2022
eeecb63
Fixing address range bug; refactoring and cleanup.
adamstorek Jul 20, 2022
cebc5e1
Incorporated more tests.
adamstorek Jul 20, 2022
d7dcc94
Added support for Python.
adamstorek Jul 26, 2022
32dc5ff
Added more python test cases; fixed a number of python bugs; further …
adamstorek Jul 29, 2022
5e85a6e
Implemented namespace aliasing; further refactored the codebase.
adamstorek Aug 2, 2022
614900f
Refactored/simplified parts of the codebase to improve readability; a…
adamstorek Aug 3, 2022
bb08181
Implemented script language auto-detection.
adamstorek Aug 3, 2022
1fd9d4a
Removed a spurious import.
adamstorek Aug 3, 2022
7ba978f
Added more test cases; moved script language feature to global featur…
adamstorek Aug 5, 2022
25cf09b
Introduced auto-detection to template-script parsing, builtins namesp…
adamstorek Aug 10, 2022
e693573
Attempted to implement the class extraction as specified last Friday …
adamstorek Aug 12, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/mypy/mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -76,4 +76,7 @@ ignore_missing_imports = True
ignore_missing_imports = True

[mypy-dncil.*]
ignore_missing_imports = True

[mypy-tree_sitter.*]
ignore_missing_imports = True
20 changes: 20 additions & 0 deletions capa/features/address.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,26 @@ def __repr__(self):
return f"file(0x{self:x})"


class FileOffsetRangeAddress(Address):
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
"""an address range relative to the start of a file"""

def __init__(self, start_byte, end_byte):
self.start_byte = start_byte
self.end_byte = end_byte

def __eq__(self, other):
return (self.start_byte, self.end_byte) == (self.start_byte, other.end_byte)
adamstorek marked this conversation as resolved.
Show resolved Hide resolved

def __lt__(self, other):
return (self.start_byte, self.end_byte) < (other.start_byte, other.end_byte)

def __hash__(self):
return hash((self.start_byte, self.end_byte))

def __repr__(self):
return f"file(0x{self.start_byte:x}, 0x{self.end_byte:x})"


class DNTokenAddress(Address):
"""a .NET token"""

Expand Down
7 changes: 7 additions & 0 deletions capa/features/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -405,6 +405,12 @@ def __init__(self, value: str, description=None):
self.name = "os"


class ScriptLanguage(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.name = "script language"
Comment on lines +408 to +411
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use format for this? e.g. format: C#.

pro:

  • fewer features to memorize
  • less duplication
  • less code

con:

  • maybe slightly less precise

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with overloading the file format feature is that file to language is a one-to-many mapping, e.g. there can be embedded templates that contain multiple different script languages such as C# for server-side scripts and JavaScript for client-side.



FORMAT_PE = "pe"
FORMAT_ELF = "elf"
FORMAT_DOTNET = "dotnet"
Expand All @@ -414,6 +420,7 @@ def __init__(self, value: str, description=None):
FORMAT_SC32 = "sc32"
FORMAT_SC64 = "sc64"
FORMAT_FREEZE = "freeze"
FORMAT_SCRIPT = "script"
FORMAT_UNKNOWN = "unknown"


Expand Down
40 changes: 40 additions & 0 deletions capa/features/extractors/script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import os
from typing import Tuple, Iterator

from capa.features.common import OS, OS_ANY, ARCH_ANY, FORMAT_SCRIPT, Arch, Format, Feature, ScriptLanguage
from capa.features.address import NO_ADDRESS, Address, FileOffsetRangeAddress

LANG_CS = "c_sharp"
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
LANG_HTML = "html"
LANG_JS = "javascript"
LANG_TEM = "embedded_template"

EXT_ASPX = ("aspx", "aspx_")
EXT_CS = ("cs", "cs_")
EXT_HTML = ("html", "html_")


def extract_arch() -> Iterator[Tuple[Feature, Address]]:
yield Arch(ARCH_ANY), NO_ADDRESS


def extract_language(language: str, addr: FileOffsetRangeAddress) -> Iterator[Tuple[Feature, Address]]:
yield ScriptLanguage(language), addr


def extract_os() -> Iterator[Tuple[Feature, Address]]:
yield OS(OS_ANY), NO_ADDRESS


def extract_format() -> Iterator[Tuple[Feature, Address]]:
yield Format(FORMAT_SCRIPT), NO_ADDRESS


def get_language_from_ext(path: str) -> str:
if path.endswith(EXT_ASPX):
return LANG_TEM
if path.endswith(EXT_CS):
return LANG_CS
if path.endswith(EXT_HTML):
return LANG_HTML
raise ValueError(f"{path} has an unrecognized or an unsupported extension.")
Empty file.
13 changes: 13 additions & 0 deletions capa/features/extractors/ts/build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from tree_sitter import Language

build_dir = "build/my-languages.so"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we only support Linux?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree-sitter needs to compile its (C) language bindings. Although I have a limited knowledge of package management, I've suggested to Moritz that we should precompile and package the supported tree-sitter bindings for each platform we support. The current state is a temporary measure.

languages = [
"vendor/tree-sitter-c-sharp",
"vendor/tree-sitter-embedded-template",
"vendor/tree-sitter-html",
"vendor/tree-sitter-javascript",
]


def ts_build():
Language.build_library(build_dir, languages)
214 changes: 214 additions & 0 deletions capa/features/extractors/ts/engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
import re
from typing import List, Tuple, Iterator, Optional

from tree_sitter import Node, Tree, Parser

import capa.features.extractors.ts.build
from capa.features.address import FileOffsetRangeAddress
from capa.features.extractors.script import LANG_CS, LANG_JS, LANG_TEM, LANG_HTML
from capa.features.extractors.ts.query import (
BINDINGS,
QueryBinding,
HTMLQueryBinding,
ScriptQueryBinding,
TemplateQueryBinding,
)
from capa.features.extractors.ts.tools import LANGUAGE_TOOLKITS, LanguageToolkit


class TreeSitterBaseEngine:
buf: bytes
language: str
query: QueryBinding
tree: Tree

def __init__(self, language: str, buf: bytes):
capa.features.extractors.ts.build.ts_build()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm lets find a better place for this initialization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global in this file is a good place to start

self.language = language
self.query = BINDINGS[language]
self.buf = buf
self.tree = self.parse()

def parse(self) -> Tree:
parser = Parser()
parser.set_language(self.query.language)
return parser.parse(self.buf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this call generate any exceptions?

Copy link
Author

@adamstorek adamstorek Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can throw type errors (which I believe we prevent with mypy) and a value error if parsing completely fails. Then the parse method will throw a ValueError, so the engine will throw a ValueError etc.: I can handle it in the following way at the Extractor level:

try:
    self.language = capa.features.extractors.ts.autodetect.get_language(path)
    self.template_engine = self.get_template_engine(buf)
    self.engines = self.get_engines(buf)
except ValueError as e:
    raise UnsupportedFormatError(e)


def get_byte_range(self, node: Node) -> bytes:
return self.buf[node.start_byte : node.end_byte]

def get_range(self, node: Node) -> str:
return self.get_byte_range(node).decode()
adamstorek marked this conversation as resolved.
Show resolved Hide resolved

def get_address(self, node: Node) -> FileOffsetRangeAddress:
return FileOffsetRangeAddress(node.start_byte, node.end_byte)

def get_default_address(self) -> FileOffsetRangeAddress:
return self.get_address(self.tree.root_node)


class TreeSitterExtractorEngine(TreeSitterBaseEngine):
query: ScriptQueryBinding
language_toolkit: LanguageToolkit
buf_offset: int
namespaces: set[str]

def __init__(
self,
language: str,
buf: bytes,
buf_offset: int = 0,
additional_namespaces: set[str] = None,
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
):
super().__init__(language, buf)
self.buf_offset = buf_offset
self.language_toolkit = LANGUAGE_TOOLKITS[language]
self.namespaces = set(self.get_range(ns_node) for ns_node, _ in self.get_namespaces())
if additional_namespaces:
self.namespaces = self.namespaces.union(additional_namespaces)

def get_address(self, node: Node) -> FileOffsetRangeAddress:
return FileOffsetRangeAddress(self.buf_offset + node.start_byte, self.buf_offset + node.end_byte)

def get_new_object_names(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.new_object_name.captures(node)
adamstorek marked this conversation as resolved.
Show resolved Hide resolved

def get_assigned_property_names(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.assigned_property_name.captures(node)

def get_function_definitions(self, node: Node = None) -> List[Tuple[Node, str]]:
return self.query.function_definition.captures(node if node is not None else self.tree.root_node)

def get_function_definition_name(self, node: Node) -> Node:
return node.child_by_field_name(self.query.function_definition_field_name)

def get_function_definition_names(self, node: Node) -> Iterator[Node]:
for fn_node, _ in self.get_function_definitions(node):
yield self.get_function_definition_name(fn_node)

def get_function_call_names(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.function_call_name.captures(node)

def get_string_literals(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.string_literal.captures(node)

def get_integer_literals(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.integer_literal.captures(node)

def get_namespaces(self, node: Node = None) -> List[Tuple[Node, str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_namespaces(self, node: Node = None) -> List[Tuple[Node, str]]:
def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]:

return self.query.namespace.captures(node if node is not None else self.tree.root_node)

def get_global_statements(self) -> List[Tuple[Node, str]]:
return self.query.global_statement.captures(self.tree.root_node)


class TreeSitterTemplateEngine(TreeSitterBaseEngine):
query: TemplateQueryBinding
language_toolkit: LanguageToolkit
embedded_language: str

def __init__(self, buf: bytes):
super().__init__(LANG_TEM, buf)
self.embedded_language = self.identify_language()
self.language_toolkit = LANGUAGE_TOOLKITS[self.embedded_language]
self.template_namespaces = set(name for _, name in self.get_template_namespaces())
adamstorek marked this conversation as resolved.
Show resolved Hide resolved

def get_code_sections(self) -> List[Tuple[Node, str]]:
return self.query.code.captures(self.tree.root_node)
adamstorek marked this conversation as resolved.
Show resolved Hide resolved

def get_parsed_code_sections(self) -> Iterator[TreeSitterExtractorEngine]:
for node, _ in self.get_code_sections():
# TODO: support JS
if self.embedded_language == LANG_CS:
yield TreeSitterExtractorEngine(
self.embedded_language,
self.get_byte_range(node),
node.start_byte,
self.template_namespaces,
)
adamstorek marked this conversation as resolved.
Show resolved Hide resolved

def get_content_sections(self) -> List[Tuple[Node, str]]:
return self.query.content.captures(self.tree.root_node)

def identify_language(self) -> str:
for node, _ in self.get_code_sections():
if self.is_c_sharp(node):
return LANG_CS
return LANG_JS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it is neither?

Copy link
Author

@adamstorek adamstorek Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is no easy way to remedy this. From what I understand about templates in general is that the syntax is determined by the templating engine. In other words, there is easy way to detect from an unknown template which templating engine is being used (asp.net (and if so, what language), razor, ejs, erb, mako, jinja2, django, cheetah, go's html/template etc., not to mention each has their own syntax (some might use regular programming languages like C# to embed server logic, some might just contain very rudimentary placeholders/logic.

Here I am assuming that we only support EJS and C# in ASPX at the moment as embedded templates. This is because Tree-sitter embedded templates parser can only parse EJS and ERB (and we are not interested in embedded Ruby at the moment as far as I'm concerned). What's more, the default language for ASPX is VB, therefore if anyone wants to use C#, they need to include a @ Page directive with a Language attribute (see: https://docs.microsoft.com/en-us/previous-versions/aspnet/k33801s3(v=vs.100), https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ydy4x04a(v=vs.100)?redirectedfrom=MSDN, https://docs.microsoft.com/en-us/previous-versions/aspnet/fbdt8kk7(v=vs.100)).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still assume that it's JS whenever it's not CS?
Could raise an Exception instead or are there other safe-guards in place before we get here?


def get_imported_namespaces(self) -> Iterator[Tuple[Node, str]]:
for node, _ in self.get_code_sections():
if self.is_aspx_import_directive(node):
namespace = self.get_aspx_namespace(node)
Comment on lines +187 to +188
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document why we are only/specially handling ASPX here?

Copy link
Author

@adamstorek adamstorek Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the issue discussed here: #1080 (comment).

if namespace is not None:
yield node, namespace

def get_template_namespaces(self) -> Iterator[Tuple[Optional[Node], str]]:
for namespace in self.language_toolkit.get_default_namespaces(True):
yield None, namespace
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
for node, namespace in self.get_imported_namespaces():
yield node, namespace

def is_c_sharp(self, node: Node) -> bool:
return bool(
re.match(
r'@ .*Page Language\s*=\s*"C#".*'.encode(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to add an explicit encoding for these calls.

self.get_byte_range(node),
re.IGNORECASE,
)
)

def is_aspx_import_directive(self, node: Node) -> bool:
return bool(
re.match(
r"@\s*Import Namespace=".encode(),
self.get_byte_range(node),
re.IGNORECASE,
)
)

def get_aspx_namespace(self, node: Node) -> Optional[str]:
match = re.search(
r'@\s*Import namespace="(.*?)"'.encode(),
self.get_byte_range(node),
re.IGNORECASE,
)
return match.group(1).decode() if match is not None else None
adamstorek marked this conversation as resolved.
Show resolved Hide resolved


class TreeSitterHTMLEngine(TreeSitterBaseEngine):
query: HTMLQueryBinding
namespaces: set[str]

def __init__(self, buf: bytes, additional_namespaces: set[str] = None):
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
super().__init__(LANG_HTML, buf)
self.namespaces = additional_namespaces if additional_namespaces is not None else set()
adamstorek marked this conversation as resolved.
Show resolved Hide resolved

def get_scripts(self) -> List[Tuple[Node, str]]:
return self.query.script_element.captures(self.tree.root_node)

def get_attributes(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.attribute.captures(node)

def get_identified_scripts(self) -> Iterator[Tuple[Node, str]]:
for node, _ in self.get_scripts():
for content_node, _ in self.get_script_contents(node):
yield content_node, self.identify_language(node)

def get_script_contents(self, node: Node) -> Iterator[Tuple[Node, str]]:
return self.query.script_content.captures(node)

def get_parsed_code_sections(self) -> Iterator[TreeSitterExtractorEngine]:
for node, language in self.get_identified_scripts():
# TODO: support JS
if language == LANG_CS:
yield TreeSitterExtractorEngine(language, self.get_byte_range(node), node.start_byte, self.namespaces)

def identify_language(self, node: Node) -> str:
for attribute_node, _ in self.get_attributes(node):
if self.is_server_side_c_sharp(attribute_node):
return LANG_CS
return LANG_JS

def is_server_side_c_sharp(self, node: Node) -> bool:
return len(re.findall(r'runat\s*=\s*"server"'.encode(), self.get_byte_range(node))) > 0
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
Loading