Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix string length >= 4 and remove bytes/string overlaps #1298

Merged
merged 2 commits into from
Feb 1, 2023

Conversation

mr-tz
Copy link
Collaborator

@mr-tz mr-tz commented Jan 30, 2023

noticed via test fails in weekly thorough lint: https://github.com/mandiant/capa-rules/actions/runs/4035045319/jobs/6936699063

Added wrongly in ca9c940 and discovered similar bugs in related code.

closes #1293

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

@mr-tz
Copy link
Collaborator Author

mr-tz commented Jan 30, 2023

example diff for features from 5a2f620f29ca2f44fc22df67b674198f.exe_

4 character strings are now extracted and bytes which also are strings are not

left before, right after

> diff -y --suppress-common-lines 5af2.feats.txt 5af2.feats2.txt
  insn: 0x40186D: bytes(50 4B 44 46 49 4C 45 30 59 55 49      <
  insn: 0x401AB6: bytes(43 52 59 50 54 45 44 30 59 55 49      <
  insn: 0x401F11: bytes(53 6F 66 74 77 61 72 65 5C 4D 69      <
  insn: 0x40211C: bytes(53 4F 46 54 57 41 52 45 5C 4D 69      <
  insn: 0x402178: bytes(53 4F 46 54 57 41 52 45 5C 4D 69      <
  insn: 0x4021AE: bytes(55 6E 69 6E 73 74 61 6C 6C 53 74      <
  insn: 0x4021EE: bytes(44 69 73 70 6C 61 79 4E 61 6D 65      <
  insn: 0x402458: bytes(2E 65 78 65 00 53 6F 66 74 77 61      <
                                                              >   insn: 0x402458: string(.exe)
  insn: 0x402535: bytes(6F 6C 65 33 32 2E 64 6C 6C 00 53      <
  insn: 0x402544: bytes(63 72 79 70 74 33 32 2E 64 6C 6C      <
  insn: 0x402553: bytes(61 64 76 61 70 69 33 32 2E 64 6C      <
  insn: 0x402562: bytes(73 68 65 6C 6C 33 32 2E 64 6C 6C      <
  insn: 0x402571: bytes(6E 65 74 61 70 69 33 32 2E 64 6C      <
  insn: 0x402580: bytes(6B 65 72 6E 65 6C 33 32 2E 64 6C      <
  insn: 0x40258F: bytes(6D 73 69 2E 64 6C 6C 00 4D 73 69      <
  insn: 0x40259E: bytes(70 73 74 6F 72 65 63 2E 64 6C 6C      <
  insn: 0x4026ED: bytes(53 6F 66 74 77 61 72 65 5C 57 69      <
  insn: 0x402807: bytes(53 6F 66 74 77 61 72 65 5C 57 69      <
  insn: 0x402BDF: bytes(53 65 49 6D 70 65 72 73 6F 6E 61      <
  insn: 0x402CB8: bytes(53 2D 31 2D 35 2D 31 38 00 00 00      <
  insn: 0x402D6F: bytes(65 78 70 6C 6F 72 65 72 2E 65 78      <
  insn: 0x403AF8: bytes(43 6F 6E 74 65 6E 74 2D 4C 65 6E      <
  insn: 0x403B0C: bytes(43 6F 6E 74 65 6E 74 2D 4C 65 6E      <
  insn: 0x403B41: bytes(4C 6F 63 61 74 69 6F 6E 3A 00 47      <
  insn: 0x403B55: bytes(4C 6F 63 61 74 69 6F 6E 3A 00 47      <
  insn: 0x403CF3: bytes(47 45 54 20 25 73 20 48 54 54 50      <
  insn: 0x403EE6: bytes(50 4F 53 54 20 25 73 20 48 54 54      <
  insn: 0x4041D6: bytes(5C 2A 2E 2A 00 2A 2E 2A 00 48 57      <
                                                              >   insn: 0x4041D6: string(\\*.*)
  insn: 0x404473: bytes(6B 65 72 6E 65 6C 33 32 2E 64 6C      <
  insn: 0x40448A: bytes(47 65 74 4E 61 74 69 76 65 53 79      <
  insn: 0x40449B: bytes(49 73 57 6F 77 36 34 50 72 6F 63      <
  insn: 0x404576: bytes(48 57 49 44 00 7B 25 30 38 58 2D      <
                                                              >   insn: 0x404576: string(HWID)
  insn: 0x4045CD: bytes(7B 25 30 38 58 2D 25 30 34 58 2D      <
  insn: 0x4045EC: bytes(48 57 49 44 00 7B 25 30 38 58 2D      <
                                                              >   insn: 0x4045EC: string(HWID)
  insn: 0x404722: bytes(48 57 49 44 00 7B 25 30 38 58 2D      <
                                                              >   insn: 0x404722: string(HWID)
  insn: 0x4047A4: bytes(6B 65 72 6E 65 6C 33 32 2E 64 6C      <
  insn: 0x4047BD: bytes(47 65 74 4E 61 74 69 76 65 53 79      <
  insn: 0x404881: bytes(4C 69 6E 65 00 77 63 78 5F 66 74      <
                                                              >   insn: 0x404881: string(Line)
  insn: 0x40499D: bytes(50 61 73 73 77 6F 72 64 00 48 6F      <
  insn: 0x4049BB: bytes(48 6F 73 74 4E 61 6D 65 00 55 73      <
  insn: 0x4049D9: bytes(55 73 65 72 00 4C 69 6E 65 00 77      <
                                                              >   insn: 0x4049D9: string(User)
  insn: 0x404AAA: bytes(53 6F 66 74 77 61 72 65 5C 46 61      <
  insn: 0x404AB7: bytes(53 6F 66 74 77 61 72 65 5C 46 61      <
  insn: 0x404AC4: bytes(53 6F 66 74 77 61 72 65 5C 46 61      <
  insn: 0x404AD1: bytes(53 6F 66 74 77 61 72 65 5C 46 61      <
  insn: 0x404ADE: bytes(53 6F 66 74 77 61 72 65 5C 46 61      <
  insn: 0x404AEB: bytes(53 6F 66 74 77 61 72 65 5C 46 61      <
  insn: 0x404B49: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404B4E: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404B82: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404B87: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404BBA: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404BBF: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404BF3: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404BF8: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404C29: bytes(77 63 78 5F 66 74 70 2E 69 6E 69      <
  insn: 0x404CBB: bytes(5C 47 48 49 53 4C 45 52 00 49 6E      <
  insn: 0x404CDA: bytes(5C 47 48 49 53 4C 45 52 00 49 6E      <
  insn: 0x404CF9: bytes(5C 47 48 49 53 4C 45 52 00 49 6E      <
  insn: 0x404D0F: bytes(49 6E 73 74 61 6C 6C 44 69 72 00      <
  insn: 0x404D14: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404D2F: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404D34: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404D60: bytes(49 6E 73 74 61 6C 6C 44 69 72 00      <
  insn: 0x404D65: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404D80: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404D85: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404DB1: bytes(49 6E 73 74 61 6C 6C 44 69 72 00      <
  insn: 0x404DB6: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404DD0: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404DD5: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404E00: bytes(49 6E 73 74 61 6C 6C 44 69 72 00      <
  insn: 0x404E05: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404E1F: bytes(46 74 70 49 6E 69 4E 61 6D 65 00      <
  insn: 0x404E24: bytes(53 6F 66 74 77 61 72 65 5C 47 68      <
  insn: 0x404EA3: bytes(5C 2A 2E 2A 00 2A 2E 2A 00 48 57      <
                                                              >   insn: 0x404EA3: string(\\*.*)
  insn: 0x404F71: bytes(2E 69 6E 69 00 57 53 5F 46 54 50      <
                                                              >   insn: 0x404F71: string(.ini)
  insn: 0x404FA6: bytes(53 69 74 65 73 5C 00 5C 49 70 73      <
  insn: 0x405087: bytes(5C 77 69 6E 2E 69 6E 69 00 2E 69      <
  insn: 0x4050BA: bytes(57 53 5F 46 54 50 00 44 49 52 00      <
  insn: 0x4050F0: bytes(44 45 46 44 49 52 00 43 55 54 45      <
  insn: 0x4050F5: bytes(57 53 5F 46 54 50 00 44 49 52 00      <
  insn: 0x40512A: bytes(5C 49 70 73 77 69 74 63 68 5C 57      <
  insn: 0x405146: bytes(5C 49 70 73 77 69 74 63 68 00 53      <
  insn: 0x405155: bytes(5C 49 70 73 77 69 74 63 68 00 53      <
  insn: 0x405164: bytes(5C 49 70 73 77 69 74 63 68 00 53      <
  insn: 0x4051AA: bytes(51 43 48 69 73 74 6F 72 79 00 53      <
  insn: 0x405221: bytes(5C 2A 2E 2A 00 2A 2E 2A 00 48 57      <
                                                              >   insn: 0x405221: string(\\*.*)
  insn: 0x405333: bytes(5C 47 6C 6F 62 61 6C 53 43 41 50      <
  insn: 0x405341: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66      <
  insn: 0x405354: bytes(5C 47 6C 6F 62 61 6C 53 43 41 50      <
  insn: 0x405362: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66      <
  insn: 0x405375: bytes(5C 47 6C 6F 62 61 6C 53 43 41 50      <
  insn: 0x405383: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66      <
  insn: 0x405396: bytes(5C 43 75 74 65 46 54 50 00 5C 73      <
  insn: 0x4053A4: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66      <
  insn: 0x4053E4: bytes(43 55 54 45 46 54 50 00 51 43 48      <
  insn: 0x4053FE: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66      <
  insn: 0x405447: bytes(53 6F 66 74 77 61 72 65 5C 47 6C      <
  insn: 0x405454: bytes(53 6F 66 74 77 61 72 65 5C 47 6C      <
  insn: 0x405461: bytes(53 6F 66 74 77 61 72 65 5C 47 6C      <
  insn: 0x40546E: bytes(53 6F 66 74 77 61 72 65 5C 47 6C      <
  insn: 0x40547B: bytes(53 6F 66 74 77 61 72 65 5C 47 6C      <
  insn: 0x405488: bytes(53 6F 66 74 77 61 72 65 5C 47 6C      <
  insn: 0x4054E5: bytes(5C 53 69 74 65 73 2E 64 61 74 00      <
  insn: 0x405500: bytes(5C 51 75 69 63 6B 2E 64 61 74 00      <
  insn: 0x40551B: bytes(5C 48 69 73 74 6F 72 79 2E 64 61      <
  insn: 0x405545: bytes(5C 46 6C 61 73 68 46 58 50 5C 33      <
  insn: 0x405567: bytes(5C 46 6C 61 73 68 46 58 50 5C 34      <
  insn: 0x4055A7: bytes(5C 53 69 74 65 73 2E 64 61 74 00      <
  insn: 0x4055BC: bytes(5C 51 75 69 63 6B 2E 64 61 74 00      <
  insn: 0x4055D1: bytes(5C 48 69 73 74 6F 72 79 2E 64 61      <
  insn: 0x405602: bytes(49 6E 73 74 61 6C 6C 65 72 44 61      <
  insn: 0x405607: bytes(53 6F 66 74 77 61 72 65 5C 46 6C      <
  insn: 0x40561A: bytes(70 61 74 68 00 49 6E 73 74 61 6C      <
  insn: 0x40561F: bytes(53 6F 66 74 77 61 72 65 5C 46 6C      |   insn: 0x40561A: string(path)

@mr-tz mr-tz changed the title Fix/string length regression bu Fix string length >= 4 and remove bytes/string overlaps Jan 30, 2023
@@ -191,7 +191,8 @@ def extract_insn_string_features(fh: FunctionHandle, bh, ih: InsnHandle) -> Iter
if user_string is None:
return

yield String(user_string), ih.address
if len(user_string) >= 4:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this configurable and consistent across backends?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR was to make this consistent for viv, IDA, and dotnet. Did I miss anything?

Making it configurable, would be a neat enhancement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant that we might have a single place where "4" is defined and then passed into the string extractors (consistently). i dont think its important to hit in this PR, but something nice to do sometime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good!

Comment on lines -175 to +177
yield Bytes(extracted_bytes), ih.address
if not capa.features.extractors.ida.helpers.find_string_at(insn.ea):
# don't extract byte features for obvious strings
yield Bytes(extracted_bytes), ih.address
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not necessarily convinced by this change in logic.

for example, consider a file format (like capa) that has a magic header/signature of ASCII text, like "capa" and subsequent binary data. with this change, we can only match against the "capa" and not subsequent data.

is this really a problem? maybe not, but i do want to raise the issue.

perhaps we do merge this change, but be prepared to revert if we come up with a real exampe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be an example to leave as is. My argument here is that we extract many superfluous features here for all references strings.

If we find an example/use case we can also fine tune the extraction logic. E.g. is it a reference to a null-terminated string that's followed by another string in the .data section. If you think it's worthwhile we can add this here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think its worthwhile to make any further changes here (or revert) unless we can come up with some specific examples. just explaining a possibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, happy to revisit this if the need arises

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

overlap string and bytes features
2 participants