-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix string length >= 4 and remove bytes/string overlaps #1298
Conversation
example diff for features from 5a2f620f29ca2f44fc22df67b674198f.exe_ 4 character strings are now extracted and bytes which also are strings are not left before, right after > diff -y --suppress-common-lines 5af2.feats.txt 5af2.feats2.txt
insn: 0x40186D: bytes(50 4B 44 46 49 4C 45 30 59 55 49 <
insn: 0x401AB6: bytes(43 52 59 50 54 45 44 30 59 55 49 <
insn: 0x401F11: bytes(53 6F 66 74 77 61 72 65 5C 4D 69 <
insn: 0x40211C: bytes(53 4F 46 54 57 41 52 45 5C 4D 69 <
insn: 0x402178: bytes(53 4F 46 54 57 41 52 45 5C 4D 69 <
insn: 0x4021AE: bytes(55 6E 69 6E 73 74 61 6C 6C 53 74 <
insn: 0x4021EE: bytes(44 69 73 70 6C 61 79 4E 61 6D 65 <
insn: 0x402458: bytes(2E 65 78 65 00 53 6F 66 74 77 61 <
> insn: 0x402458: string(.exe)
insn: 0x402535: bytes(6F 6C 65 33 32 2E 64 6C 6C 00 53 <
insn: 0x402544: bytes(63 72 79 70 74 33 32 2E 64 6C 6C <
insn: 0x402553: bytes(61 64 76 61 70 69 33 32 2E 64 6C <
insn: 0x402562: bytes(73 68 65 6C 6C 33 32 2E 64 6C 6C <
insn: 0x402571: bytes(6E 65 74 61 70 69 33 32 2E 64 6C <
insn: 0x402580: bytes(6B 65 72 6E 65 6C 33 32 2E 64 6C <
insn: 0x40258F: bytes(6D 73 69 2E 64 6C 6C 00 4D 73 69 <
insn: 0x40259E: bytes(70 73 74 6F 72 65 63 2E 64 6C 6C <
insn: 0x4026ED: bytes(53 6F 66 74 77 61 72 65 5C 57 69 <
insn: 0x402807: bytes(53 6F 66 74 77 61 72 65 5C 57 69 <
insn: 0x402BDF: bytes(53 65 49 6D 70 65 72 73 6F 6E 61 <
insn: 0x402CB8: bytes(53 2D 31 2D 35 2D 31 38 00 00 00 <
insn: 0x402D6F: bytes(65 78 70 6C 6F 72 65 72 2E 65 78 <
insn: 0x403AF8: bytes(43 6F 6E 74 65 6E 74 2D 4C 65 6E <
insn: 0x403B0C: bytes(43 6F 6E 74 65 6E 74 2D 4C 65 6E <
insn: 0x403B41: bytes(4C 6F 63 61 74 69 6F 6E 3A 00 47 <
insn: 0x403B55: bytes(4C 6F 63 61 74 69 6F 6E 3A 00 47 <
insn: 0x403CF3: bytes(47 45 54 20 25 73 20 48 54 54 50 <
insn: 0x403EE6: bytes(50 4F 53 54 20 25 73 20 48 54 54 <
insn: 0x4041D6: bytes(5C 2A 2E 2A 00 2A 2E 2A 00 48 57 <
> insn: 0x4041D6: string(\\*.*)
insn: 0x404473: bytes(6B 65 72 6E 65 6C 33 32 2E 64 6C <
insn: 0x40448A: bytes(47 65 74 4E 61 74 69 76 65 53 79 <
insn: 0x40449B: bytes(49 73 57 6F 77 36 34 50 72 6F 63 <
insn: 0x404576: bytes(48 57 49 44 00 7B 25 30 38 58 2D <
> insn: 0x404576: string(HWID)
insn: 0x4045CD: bytes(7B 25 30 38 58 2D 25 30 34 58 2D <
insn: 0x4045EC: bytes(48 57 49 44 00 7B 25 30 38 58 2D <
> insn: 0x4045EC: string(HWID)
insn: 0x404722: bytes(48 57 49 44 00 7B 25 30 38 58 2D <
> insn: 0x404722: string(HWID)
insn: 0x4047A4: bytes(6B 65 72 6E 65 6C 33 32 2E 64 6C <
insn: 0x4047BD: bytes(47 65 74 4E 61 74 69 76 65 53 79 <
insn: 0x404881: bytes(4C 69 6E 65 00 77 63 78 5F 66 74 <
> insn: 0x404881: string(Line)
insn: 0x40499D: bytes(50 61 73 73 77 6F 72 64 00 48 6F <
insn: 0x4049BB: bytes(48 6F 73 74 4E 61 6D 65 00 55 73 <
insn: 0x4049D9: bytes(55 73 65 72 00 4C 69 6E 65 00 77 <
> insn: 0x4049D9: string(User)
insn: 0x404AAA: bytes(53 6F 66 74 77 61 72 65 5C 46 61 <
insn: 0x404AB7: bytes(53 6F 66 74 77 61 72 65 5C 46 61 <
insn: 0x404AC4: bytes(53 6F 66 74 77 61 72 65 5C 46 61 <
insn: 0x404AD1: bytes(53 6F 66 74 77 61 72 65 5C 46 61 <
insn: 0x404ADE: bytes(53 6F 66 74 77 61 72 65 5C 46 61 <
insn: 0x404AEB: bytes(53 6F 66 74 77 61 72 65 5C 46 61 <
insn: 0x404B49: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404B4E: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404B82: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404B87: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404BBA: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404BBF: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404BF3: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404BF8: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404C29: bytes(77 63 78 5F 66 74 70 2E 69 6E 69 <
insn: 0x404CBB: bytes(5C 47 48 49 53 4C 45 52 00 49 6E <
insn: 0x404CDA: bytes(5C 47 48 49 53 4C 45 52 00 49 6E <
insn: 0x404CF9: bytes(5C 47 48 49 53 4C 45 52 00 49 6E <
insn: 0x404D0F: bytes(49 6E 73 74 61 6C 6C 44 69 72 00 <
insn: 0x404D14: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404D2F: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404D34: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404D60: bytes(49 6E 73 74 61 6C 6C 44 69 72 00 <
insn: 0x404D65: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404D80: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404D85: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404DB1: bytes(49 6E 73 74 61 6C 6C 44 69 72 00 <
insn: 0x404DB6: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404DD0: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404DD5: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404E00: bytes(49 6E 73 74 61 6C 6C 44 69 72 00 <
insn: 0x404E05: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404E1F: bytes(46 74 70 49 6E 69 4E 61 6D 65 00 <
insn: 0x404E24: bytes(53 6F 66 74 77 61 72 65 5C 47 68 <
insn: 0x404EA3: bytes(5C 2A 2E 2A 00 2A 2E 2A 00 48 57 <
> insn: 0x404EA3: string(\\*.*)
insn: 0x404F71: bytes(2E 69 6E 69 00 57 53 5F 46 54 50 <
> insn: 0x404F71: string(.ini)
insn: 0x404FA6: bytes(53 69 74 65 73 5C 00 5C 49 70 73 <
insn: 0x405087: bytes(5C 77 69 6E 2E 69 6E 69 00 2E 69 <
insn: 0x4050BA: bytes(57 53 5F 46 54 50 00 44 49 52 00 <
insn: 0x4050F0: bytes(44 45 46 44 49 52 00 43 55 54 45 <
insn: 0x4050F5: bytes(57 53 5F 46 54 50 00 44 49 52 00 <
insn: 0x40512A: bytes(5C 49 70 73 77 69 74 63 68 5C 57 <
insn: 0x405146: bytes(5C 49 70 73 77 69 74 63 68 00 53 <
insn: 0x405155: bytes(5C 49 70 73 77 69 74 63 68 00 53 <
insn: 0x405164: bytes(5C 49 70 73 77 69 74 63 68 00 53 <
insn: 0x4051AA: bytes(51 43 48 69 73 74 6F 72 79 00 53 <
insn: 0x405221: bytes(5C 2A 2E 2A 00 2A 2E 2A 00 48 57 <
> insn: 0x405221: string(\\*.*)
insn: 0x405333: bytes(5C 47 6C 6F 62 61 6C 53 43 41 50 <
insn: 0x405341: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66 <
insn: 0x405354: bytes(5C 47 6C 6F 62 61 6C 53 43 41 50 <
insn: 0x405362: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66 <
insn: 0x405375: bytes(5C 47 6C 6F 62 61 6C 53 43 41 50 <
insn: 0x405383: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66 <
insn: 0x405396: bytes(5C 43 75 74 65 46 54 50 00 5C 73 <
insn: 0x4053A4: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66 <
insn: 0x4053E4: bytes(43 55 54 45 46 54 50 00 51 43 48 <
insn: 0x4053FE: bytes(5C 73 6D 2E 64 61 74 00 53 6F 66 <
insn: 0x405447: bytes(53 6F 66 74 77 61 72 65 5C 47 6C <
insn: 0x405454: bytes(53 6F 66 74 77 61 72 65 5C 47 6C <
insn: 0x405461: bytes(53 6F 66 74 77 61 72 65 5C 47 6C <
insn: 0x40546E: bytes(53 6F 66 74 77 61 72 65 5C 47 6C <
insn: 0x40547B: bytes(53 6F 66 74 77 61 72 65 5C 47 6C <
insn: 0x405488: bytes(53 6F 66 74 77 61 72 65 5C 47 6C <
insn: 0x4054E5: bytes(5C 53 69 74 65 73 2E 64 61 74 00 <
insn: 0x405500: bytes(5C 51 75 69 63 6B 2E 64 61 74 00 <
insn: 0x40551B: bytes(5C 48 69 73 74 6F 72 79 2E 64 61 <
insn: 0x405545: bytes(5C 46 6C 61 73 68 46 58 50 5C 33 <
insn: 0x405567: bytes(5C 46 6C 61 73 68 46 58 50 5C 34 <
insn: 0x4055A7: bytes(5C 53 69 74 65 73 2E 64 61 74 00 <
insn: 0x4055BC: bytes(5C 51 75 69 63 6B 2E 64 61 74 00 <
insn: 0x4055D1: bytes(5C 48 69 73 74 6F 72 79 2E 64 61 <
insn: 0x405602: bytes(49 6E 73 74 61 6C 6C 65 72 44 61 <
insn: 0x405607: bytes(53 6F 66 74 77 61 72 65 5C 46 6C <
insn: 0x40561A: bytes(70 61 74 68 00 49 6E 73 74 61 6C <
insn: 0x40561F: bytes(53 6F 66 74 77 61 72 65 5C 46 6C | insn: 0x40561A: string(path) |
@@ -191,7 +191,8 @@ def extract_insn_string_features(fh: FunctionHandle, bh, ih: InsnHandle) -> Iter | |||
if user_string is None: | |||
return | |||
|
|||
yield String(user_string), ih.address | |||
if len(user_string) >= 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this configurable and consistent across backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this PR was to make this consistent for viv, IDA, and dotnet. Did I miss anything?
Making it configurable, would be a neat enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant that we might have a single place where "4" is defined and then passed into the string extractors (consistently). i dont think its important to hit in this PR, but something nice to do sometime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sounds good!
yield Bytes(extracted_bytes), ih.address | ||
if not capa.features.extractors.ida.helpers.find_string_at(insn.ea): | ||
# don't extract byte features for obvious strings | ||
yield Bytes(extracted_bytes), ih.address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not necessarily convinced by this change in logic.
for example, consider a file format (like capa) that has a magic header/signature of ASCII text, like "capa"
and subsequent binary data. with this change, we can only match against the "capa" and not subsequent data.
is this really a problem? maybe not, but i do want to raise the issue.
perhaps we do merge this change, but be prepared to revert if we come up with a real exampe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be an example to leave as is. My argument here is that we extract many superfluous features here for all references strings.
If we find an example/use case we can also fine tune the extraction logic. E.g. is it a reference to a null-terminated string that's followed by another string in the .data
section. If you think it's worthwhile we can add this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think its worthwhile to make any further changes here (or revert) unless we can come up with some specific examples. just explaining a possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, happy to revisit this if the need arises
noticed via test fails in weekly thorough lint: https://github.com/mandiant/capa-rules/actions/runs/4035045319/jobs/6936699063
Added wrongly in ca9c940 and discovered similar bugs in related code.
closes #1293
Checklist