Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix string length >= 4 and remove bytes/string overlaps #1298

Merged
merged 2 commits into from
Feb 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@
- handle vivisect bug around strings at instruction level, use min length 4 #1271 @williballenthin @mr-tz
- extractor: guard against invalid "calls from" features #1177 @mr-tz
- extractor: add format to global features #1258 @mr-tz
- extractor: discover all strings with length >= 4 #1280 @mr-tz
- extractor: don't extract byte features for strings #1293 @mr-tz

### capa explorer IDA Pro plugin
- fix: display instruction items #1154 @mr-tz
Expand Down
3 changes: 2 additions & 1 deletion capa/features/extractors/dnfile/insn.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,8 @@ def extract_insn_string_features(fh: FunctionHandle, bh, ih: InsnHandle) -> Iter
if user_string is None:
return

yield String(user_string), ih.address
if len(user_string) >= 4:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this configurable and consistent across backends?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR was to make this consistent for viv, IDA, and dotnet. Did I miss anything?

Making it configurable, would be a neat enhancement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant that we might have a single place where "4" is defined and then passed into the string extractors (consistently). i dont think its important to hit in this PR, but something nice to do sometime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good!

yield String(user_string), ih.address


def extract_unmanaged_call_characteristic_features(
Expand Down
2 changes: 1 addition & 1 deletion capa/features/extractors/ida/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ def read_bytes_at(ea: int, count: int) -> bytes:
def find_string_at(ea: int, min_: int = 4) -> str:
"""check if ASCII string exists at a given virtual address"""
found = idaapi.get_strlit_contents(ea, -1, idaapi.STRTYPE_C)
if found and len(found) > min_:
if found and len(found) >= min_:
try:
found = found.decode("ascii")
# hacky check for IDA bug; get_strlit_contents also reads Unicode as
Expand Down
4 changes: 3 additions & 1 deletion capa/features/extractors/ida/insn.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,9 @@ def extract_insn_bytes_features(fh: FunctionHandle, bbh: BBHandle, ih: InsnHandl
if ref != insn.ea:
extracted_bytes = capa.features.extractors.ida.helpers.read_bytes_at(ref, MAX_BYTES_FEATURE_SIZE)
if extracted_bytes and not capa.features.extractors.helpers.all_zeros(extracted_bytes):
yield Bytes(extracted_bytes), ih.address
if not capa.features.extractors.ida.helpers.find_string_at(insn.ea):
# don't extract byte features for obvious strings
yield Bytes(extracted_bytes), ih.address
Comment on lines -175 to +177
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not necessarily convinced by this change in logic.

for example, consider a file format (like capa) that has a magic header/signature of ASCII text, like "capa" and subsequent binary data. with this change, we can only match against the "capa" and not subsequent data.

is this really a problem? maybe not, but i do want to raise the issue.

perhaps we do merge this change, but be prepared to revert if we come up with a real exampe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be an example to leave as is. My argument here is that we extract many superfluous features here for all references strings.

If we find an example/use case we can also fine tune the extraction logic. E.g. is it a reference to a null-terminated string that's followed by another string in the .data section. If you think it's worthwhile we can add this here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think its worthwhile to make any further changes here (or revert) unless we can come up with some specific examples. just explaining a possibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, happy to revisit this if the need arises



def extract_insn_string_features(
Expand Down
6 changes: 5 additions & 1 deletion capa/features/extractors/viv/insn.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,10 @@ def extract_insn_bytes_features(fh: FunctionHandle, bb, ih: InsnHandle) -> Itera
if capa.features.extractors.helpers.all_zeros(buf):
continue

if f.vw.isProbablyString(v):
# don't extract byte features for obvious strings
continue

yield Bytes(buf), ih.address


Expand Down Expand Up @@ -676,7 +680,7 @@ def extract_op_string_features(
except ValueError:
continue
else:
if len(s) > 4:
if len(s) >= 4:
yield String(s), ih.address


Expand Down