-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long signatures #20
Comments
The DCCS file format hard-wires it to 23 bytes. Sprinkled in the DCC code you find:
Goodness knows why the original authors chose that size; it's likely to avoid excessive file sizes on a project that may have been developed on an x86 MS-DOS environment. You could try changing PATLEN to a larger number and then running |
What I want: Regenerate new libs, that will be with all functions, and almost without collisions. Then, I want apply them in IDA. |
You'll need access to the original LIB files from the various compilers to accomplish this, since only the first 23 bytes are available in the existing DCCS files. |
I understand. Another question: signature models: "l" and "c" - what the difference? |
The naming convention used here is:
|
Ok. There is some compiler lib-file. Which model will be selected and which criteria will be used when naming it? There is some lib-inner memory model? |
When generating the signatures using makedsig, the user herself has to know what vendor, version and model the LIB file was compiled with. |
But makedsig only asks libname as parameter. |
If you look at
So it's the user's responsibility to provide a correct file name for the |
Ah, I see. dcc selects correct sig file. And I should provide correct file name. |
Exactly so. AFAIK there is no identification information contained inside lib/tpl files. |
Reko will probably use a variant of this scheme, but the mapping of signature files may be happening in the configuration file to avoid dependencies on the naming of the signature files themselves. |
Yup, the format of signature files could be made a bit more robust: {
"Vendor": "Borland",
"CompilerName" : "TurboC 3.0",
"Language": "C",
"Version": "3.0",
"SignatureBlocks": [{
"Model": "Large",
"SigLength": 29,
"Signatures": []
}, {
"Model": "Small",
"SigLength": 23,
"Signatures": []
}]
} and |
Makedsig asks me for "Seed:". What is it? |
And second question: how to merge signatures from different lib fies? |
Consider using a schema as well, so a JSON parser can identify what kind of data this is:
Merging signatures from different lib files should done by relevant decompilers when they "ingest" the JSON described above. Ie. there should be a function This work is underway on the Reko project: there are at least three signature file formats that Reko is aware of, and I'm making it so that they all get unified internally . It would be cool if dcc and Reko could interoperate on this level. |
The DCC signature file format creates a perfect hash. The algorithm they are using requires a random number generator (RNG). The Seed: prompt is asking you for a seed to the (RNG). Not sure why this is provided explicitly, perhaps for making sure, during development, that the hashtable is getting created correctly and reproducibly. Just enter some number < 32637 and you should be OK. |
.lib files (and .obj files) are OMF files. Sadly, they have no magic number at the beginning, so you have to depend on file extensions to figure out what's inside. This is why I'm suggesting the |
Common signature format: agreed - will try to flesh it out and post it here. |
Also, consider looking at the Yara format. It's not JSON, but we could consider making a JSON compatible version. |
John, should we consider other pattern schemes ? Once upon a time I've had some fun with an xbox emulator that used pattern matching to identify SDK functions, and rewritten it to use pre-generated per-SDK TRIE ( string with wildcards ) |
As for YARA, I think their pattern matching language is not a very good match for our purposes ? What we might consider is pattern disambiguation by symbol names ? given two patterns with the same signature:
we would be unable to correctly locate those patterns in the binary, but if previously we managed to locate either FuncX, or FuncY then we could use those to augment the pattern matcher? |
Reko uses another signature format, provided by @halsten, for identifying packers and unpackers. It is again different: <SIGNATURES>
<ENTRY>
<NAME>Microsoft Visual C++ 7</NAME>
<COMMENTS />
<ENTRYPOINT>????4100000000000000630000000000??00??????????00??00??????????????????????????????????00??00??00??????????????????????????????00????20????00??00??????????????00??????????????????????00??00??????00??????????????00??00??00??00??00??00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????????????????????????????00??????????????????????00??00??00??????00??00??00??00??00??00</ENTRYPOINT>
<ENTIREPE />
</ENTRY>
<ENTRY>
<NAME>Microsoft Visual C++ 8.0</NAME>
<COMMENTS>
</COMMENTS>
<ENTRYPOINT>4883EC28E8????00004883C428E9????FFFFCCCCCCCCCCCCCCCCCCCCCCCCCCCC</ENTRYPOINT>
<ENTIREPE>
</ENTIREPE>
</ENTRY> As you can see it is just yet another variant with its own benefits and flaws. It's easy enough to add parsers to these simple formats. The hard part is building an efficient automaton from the patterns in order to scan the decompiled image fast enough. My recent commits in Reko have introduced a suffix array implementation that lets me locate a pattern in O(log n) time, where n is the size of the binary file. Once that work is complete, I should be able to rip through any signature file format (like the one above, the one you're proposing, or the Amiga index hunks, or the DCC signature files) and in O(p * log N) time find all matching signatures located in the file, where p is the number of patterns. Any way to decrease the p -- say by partitioning signature files based on detected compiler manufacturer and version -- is of course highly beneficial. My intent with Reko is to be able to handle as many formats as possible, but drawing the line when it gets too complex and distracts me from actual decompilation :-) |
IDA understands BCC's libs format. (plb utility). |
@lab313ru I don't believe we can use any of their tooling in our open source projects though ? |
We can't, yes. But idea of FLIRT signatures is good for using it. And my goal was to use bcc signatures in IDA, so...) |
@lab313ru: are FLIRT signatures stored as text, or as a binary format described somewhere? I have no access to IDA so I can't go check myself. |
Hmm.. I think, pat description is only IDA SDK-inner. But, it is not problem to rewrite signmake to use max length for symbol names and for pattern length. |
Maybe, for current moment it will be better to allow makedsig read file list with lib-files? I mean combining symbols from many lib-files. |
Started writing the spec for the pattern files: https://github.com/nemerle/dcc/wiki/Cross-decompiler-signature-specification |
Patterns need to be specified:
|
Updated with: PATTERN : ("Offset" Number (MATCH_BYTES | SYM_REF_NAME))+
MATCH_BYTES : (HEX_BYTE | WILDCARD)+
HEXBYTE : "0x" HEX_DIGIT HEX_DIGIT
WILDCARD : "." | "?"
SYM_REF_NAME : Ident |
Although more compact representation of MATCH_BYTES might be in order ? |
If it's OK to assume hexadecimal representation and 8-bit bytes, you could get rid Here's my take on a pattern file format, generalizing a little because not all emitters
Here is a pattern that could be used to identify a binary as Msdos EXE or ELF
It would be cool if "Offset" could be specified to not only be a fixed number of bytes
|
We might want/need to add a |
Ok, I've extended/updated the EBNF for PATTERN and DATA parts to incorporate Your suggesstions: PATTERN: PATTERN : PATTERN_ID? ("Offset" OFFSET_SPEC (MATCH_BYTES | SYM_REF_NAME))+ | "@" PATTERN_REF;
PATTERN_ID : Ident;
OFFSET_SPEC : Number | "$EntryPoint";
PATTERN_REF : Ident;
MATCH_BYTES : "[" (HEX_BYTE | WILDCARD)+ "]";
HEXBYTE : HEX_DIGIT HEX_DIGIT;
WILDCARD : "." | "?";
SYM_REF_NAME : Ident; DATA: DATA: (SYMBOL_DEF META_DEF?) | META_DEF;
META_DEF: "Meta" FREEFORM_DATA;
SYMBOL_DEF: "Symbol" SYMBOL_NAME ("Typedef" C_TYPEDEF)?;
SYMBOL_NAME: "Name" Ident; // Ident is a raw symbol name - no demangling should be done here
C_TYPEDEF: QuotedString; // C typedef extended with custom calling convention attributes
FREEFORM_DATA: (Ident "=" QuotedString)+; // comments, links to documentation, etc. As for FREEFORM_DATA - it could be extended into: META_ENTRY: PACKER_SPEC | LOADER_SPEC | FREEFORM_DATA;
PACKER_SPEC: "Packer" QuotedString;
LOADER_SPEC: "Loader" QuotedString;
FREEFORM_DATA: (Ident "=" QuotedString)+; |
I have read readsig.txt, and have found that currently, signatures are 23 bytes long. Is it true?
If so, is it possible to create signatures that will be longer? And, why is 23? How much signatures are missed in DCC because of this (and because of collisions)?
The text was updated successfully, but these errors were encountered: