Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long signatures #20

Open
lab313ru opened this issue May 18, 2016 · 36 comments
Open

Long signatures #20

lab313ru opened this issue May 18, 2016 · 36 comments

Comments

@lab313ru
Copy link
Contributor

I have read readsig.txt, and have found that currently, signatures are 23 bytes long. Is it true?
If so, is it possible to create signatures that will be longer? And, why is 23? How much signatures are missed in DCC because of this (and because of collisions)?

@uxmal
Copy link

uxmal commented May 18, 2016

The DCCS file format hard-wires it to 23 bytes. Sprinkled in the DCC code you find:

#define  PATLEN 23

Goodness knows why the original authors chose that size; it's likely to avoid excessive file sizes on a project that may have been developed on an x86 MS-DOS environment.

You could try changing PATLEN to a larger number and then running makedsig on a *.LIB file, but the resulting signature file will be incompatible with older DCCS files.

@lab313ru
Copy link
Contributor Author

What I want: Regenerate new libs, that will be with all functions, and almost without collisions. Then, I want apply them in IDA.

@uxmal
Copy link

uxmal commented May 18, 2016

You'll need access to the original LIB files from the various compilers to accomplish this, since only the first 23 bytes are available in the existing DCCS files.

@lab313ru
Copy link
Contributor Author

I understand.

Another question: signature models:
dccb3c.sig
dccb3l.sig

"l" and "c" - what the difference?

@uxmal
Copy link

uxmal commented May 19, 2016

The naming convention used here is: dcc<v><n><m> where

  • v is the vendor of the compiler library (b= Borland m= Microsoft)
  • n is the version of the library
  • m is the x86 memory/pointer model, where c is "compact", s is "small", m is "medium" and l is large. For more about the x86 memory models, see https://en.wikipedia.org/wiki/Intel_Memory_Model

@lab313ru
Copy link
Contributor Author

Ok. There is some compiler lib-file. Which model will be selected and which criteria will be used when naming it? There is some lib-inner memory model?

@uxmal
Copy link

uxmal commented May 19, 2016

When generating the signatures using makedsig, the user herself has to know what vendor, version and model the LIB file was compiled with.

@lab313ru
Copy link
Contributor Author

But makedsig only asks libname as parameter.

@uxmal
Copy link

uxmal commented May 19, 2016

If you look at makedsig.cpp, you'll find the usage:

"This program is to make 'signatures' of known c and tpl library calls for the dcc program.\n"
"It needs as the first arg the name of a library file, and as the second arg, the name "
"of the signature file to be generated.\n"
"Example: makedsig CL.LIB dccb3l.sig\n"
"      or makedsig turbo.tpl dcct4p.sig\n"

So it's the user's responsibility to provide a correct file name for the .sig file.

@lab313ru
Copy link
Contributor Author

Ah, I see. dcc selects correct sig file. And I should provide correct file name.

@nemerle
Copy link
Owner

nemerle commented May 19, 2016

Exactly so.

AFAIK there is no identification information contained inside lib/tpl files.

@uxmal
Copy link

uxmal commented May 19, 2016

Reko will probably use a variant of this scheme, but the mapping of signature files may be happening in the configuration file to avoid dependencies on the naming of the signature files themselves.

@nemerle
Copy link
Owner

nemerle commented May 19, 2016

Yup, the format of signature files could be made a bit more robust:

{
    "Vendor": "Borland",
    "CompilerName" : "TurboC 3.0",
    "Language": "C",
    "Version": "3.0",
    "SignatureBlocks": [{
    "Model": "Large",
    "SigLength": 29,
    "Signatures": []
    }, {
    "Model": "Small",
    "SigLength": 23,
    "Signatures": []
    }]
}

and makedsig could be made to work with this to 'add'/'update' signatures inside this files

@lab313ru
Copy link
Contributor Author

Makedsig asks me for "Seed:". What is it?

@lab313ru
Copy link
Contributor Author

And second question: how to merge signatures from different lib fies?

@uxmal
Copy link

uxmal commented May 19, 2016

Consider using a schema as well, so a JSON parser can identify what kind of data this is:

{
    "$schema":  "urn:executable:signature",
    "Vendor": ....
} 

Merging signatures from different lib files should done by relevant decompilers when they "ingest" the JSON described above. Ie. there should be a function
LoadSignaturesFromFiles: list<filename> => internal-signature-representation that collects all relevant metadata and "cooks" it as appropriate.

This work is underway on the Reko project: there are at least three signature file formats that Reko is aware of, and I'm making it so that they all get unified internally . It would be cool if dcc and Reko could interoperate on this level.

@uxmal
Copy link

uxmal commented May 19, 2016

The DCC signature file format creates a perfect hash. The algorithm they are using requires a random number generator (RNG). The Seed: prompt is asking you for a seed to the (RNG). Not sure why this is provided explicitly, perhaps for making sure, during development, that the hashtable is getting created correctly and reproducibly. Just enter some number < 32637 and you should be OK.

@uxmal
Copy link

uxmal commented May 19, 2016

.lib files (and .obj files) are OMF files. Sadly, they have no magic number at the beginning, so you have to depend on file extensions to figure out what's inside. This is why I'm suggesting the $schema above -- so that both humans and computers can figure out the contents of the file.

@nemerle
Copy link
Owner

nemerle commented May 19, 2016

Common signature format: agreed - will try to flesh it out and post it here.

@uxmal
Copy link

uxmal commented May 19, 2016

Also, consider looking at the Yara format. It's not JSON, but we could consider making a JSON compatible version.

@nemerle
Copy link
Owner

nemerle commented May 19, 2016

John, should we consider other pattern schemes ?

Once upon a time I've had some fun with an xbox emulator that used pattern matching to identify SDK functions, and rewritten it to use pre-generated per-SDK TRIE ( string with wildcards )

@nemerle
Copy link
Owner

nemerle commented May 19, 2016

As for YARA, I think their pattern matching language is not a very good match for our purposes ?

What we might consider is pattern disambiguation by symbol names ?

given two patterns with the same signature:

FuncA:  12 43 65 [xx xx xx xx] 44 55 66 ...    where [xx xx xx xx] is reference to symbol FuncX
FuncB:  12 43 65 [xx xx xx xx] 44 55 66 ...    where [xx xx xx xx] is reference to symbol FuncY

we would be unable to correctly locate those patterns in the binary, but if previously we managed to locate either FuncX, or FuncY then we could use those to augment the pattern matcher?

@uxmal
Copy link

uxmal commented May 19, 2016

Reko uses another signature format, provided by @halsten, for identifying packers and unpackers. It is again different:

<SIGNATURES>
  <ENTRY>
    <NAME>Microsoft Visual C++ 7</NAME>
    <COMMENTS />
    <ENTRYPOINT>????4100000000000000630000000000??00??????????00??00??????????????????????????????????00??00??00??????????????????????????????00????20????00??00??????????????00??????????????????????00??00??????00??????????????00??00??00??00??00??00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????00??00??00??00??00??00??00??????????????????????????????00??????????????????????00??00??00??????00??00??00??00??00??00</ENTRYPOINT>
    <ENTIREPE />
  </ENTRY>
  <ENTRY>
    <NAME>Microsoft Visual C++ 8.0</NAME>
    <COMMENTS>
    </COMMENTS>
    <ENTRYPOINT>4883EC28E8????00004883C428E9????FFFFCCCCCCCCCCCCCCCCCCCCCCCCCCCC</ENTRYPOINT>
    <ENTIREPE>
    </ENTIREPE>
  </ENTRY>

As you can see it is just yet another variant with its own benefits and flaws. It's easy enough to add parsers to these simple formats. The hard part is building an efficient automaton from the patterns in order to scan the decompiled image fast enough. My recent commits in Reko have introduced a suffix array implementation that lets me locate a pattern in O(log n) time, where n is the size of the binary file. Once that work is complete, I should be able to rip through any signature file format (like the one above, the one you're proposing, or the Amiga index hunks, or the DCC signature files) and in O(p * log N) time find all matching signatures located in the file, where p is the number of patterns. Any way to decrease the p -- say by partitioning signature files based on detected compiler manufacturer and version -- is of course highly beneficial.

My intent with Reko is to be able to handle as many formats as possible, but drawing the line when it gets too complex and distracts me from actual decompilation :-)

@lab313ru
Copy link
Contributor Author

IDA understands BCC's libs format. (plb utility).

@nemerle
Copy link
Owner

nemerle commented May 19, 2016

@lab313ru I don't believe we can use any of their tooling in our open source projects though ?

@lab313ru
Copy link
Contributor Author

We can't, yes. But idea of FLIRT signatures is good for using it.
https://www.hex-rays.com/products/ida/tech/flirt/in_depth.shtml

And my goal was to use bcc signatures in IDA, so...)

@uxmal
Copy link

uxmal commented May 19, 2016

@lab313ru: are FLIRT signatures stored as text, or as a binary format described somewhere? I have no access to IDA so I can't go check myself.

@lab313ru
Copy link
Contributor Author

Hmm.. I think, pat description is only IDA SDK-inner.

But, it is not problem to rewrite signmake to use max length for symbol names and for pattern length.

@lab313ru
Copy link
Contributor Author

lab313ru commented May 19, 2016

Maybe, for current moment it will be better to allow makedsig read file list with lib-files?
Then add signatures from them to map, and parse as it were before?

I mean combining symbols from many lib-files.

@nemerle
Copy link
Owner

nemerle commented May 23, 2016

Started writing the spec for the pattern files:

https://github.com/nemerle/dcc/wiki/Cross-decompiler-signature-specification

@uxmal
Copy link

uxmal commented May 23, 2016

Patterns need to be specified:

  • should characters other than hex digits be allowed? For instance it's convenient to allow spaces in the pattern strings since they may be coming from other tools.
  • Should wildcard patterns be allowed? What character should be used in wildcars? I've seen '?' and '.', and don't see any reason for not allowing both.

@nemerle
Copy link
Owner

nemerle commented May 23, 2016

Updated with:
EBNF-like definition for PATTERN definition

PATTERN :  ("Offset" Number (MATCH_BYTES | SYM_REF_NAME))+
MATCH_BYTES : (HEX_BYTE | WILDCARD)+
HEXBYTE : "0x" HEX_DIGIT HEX_DIGIT
WILDCARD : "." | "?"
SYM_REF_NAME :  Ident

@nemerle
Copy link
Owner

nemerle commented May 23, 2016

Although more compact representation of MATCH_BYTES might be in order ?

@uxmal
Copy link

uxmal commented May 23, 2016

If it's OK to assume hexadecimal representation and 8-bit bytes, you could get rid
of the "0x" which adds nothing but padding in that case. Reko has a couple of megabytes
of signature files donated by @halsten which all have following look: AD3351?????AEB1A2?????. It appears
to be widely used in the community, and would be nice to provide support for it.

Here's my take on a pattern file format, generalizing a little because not all emitters
of machine code are compilers (think obfuscators and packers)

{
    // The defaults if nothing else has been specified
    "Tags": {
        "Vendor": "Borland",
        "Product": "Turbo C",
        "Version": "2.0",
        "Target_machine": "x86-16",
        "Endianness": "little".
        "SourceLanguage", "C"
    },
    "Patterns": [
        {
            "Tags": {
                "Version": "3.0"
            },
            //  4-byte reference to a symbol
            "Match": [ "AAbbCC??D1e2", { "symref": "foo", "size": 4 }, "Fa",

            "Result": { "symbol": "malloc" }
        }
    ]
}

Here is a pattern that could be used to identify a binary as Msdos EXE or ELF

{
    "Patterns": [
        {
            // must be at start of file. Not specifying offset means "anywhere"
            "Offset": 0,
            "Match": ["4D5A"],
            "Result": { "imagefile": "MzExecutable" }
        },
        {
            "Offset": 0,
            "Match: ["7F454C46"],
            "Result": { "imagefile", "ElfExecutable" }
        }
    ]
}

It would be cool if "Offset" could be specified to not only be a fixed number of bytes
from the start of file, but a special symbol "$EntryPoint" which would be the starting point
of the program as defined by the image format (PE, ELF etc)

{
   "Offset", "$EntryPoint",
   "Match":  ["7F3A39....A3B8"],
   "Result": { "Packer": "FileCrusher", "Vendor": "Packers'R'us", "Version": "0.3" }
}

@nemerle
Copy link
Owner

nemerle commented May 23, 2016

We might want/need to add a Compiler_Flags as a required tag, since patterns for Debug/Release Small/Medium/Large builds will differ

@nemerle
Copy link
Owner

nemerle commented May 23, 2016

Ok, I've extended/updated the EBNF for PATTERN and DATA parts to incorporate Your suggesstions:

PATTERN:

PATTERN :  PATTERN_ID? ("Offset" OFFSET_SPEC (MATCH_BYTES | SYM_REF_NAME))+ | "@" PATTERN_REF;
PATTERN_ID : Ident;
OFFSET_SPEC : Number | "$EntryPoint";
PATTERN_REF : Ident;
MATCH_BYTES : "[" (HEX_BYTE | WILDCARD)+ "]";
HEXBYTE : HEX_DIGIT HEX_DIGIT;
WILDCARD : "." | "?";
SYM_REF_NAME : Ident;

DATA:

DATA:          (SYMBOL_DEF META_DEF?) | META_DEF;
META_DEF:      "Meta" FREEFORM_DATA;
SYMBOL_DEF:    "Symbol" SYMBOL_NAME ("Typedef" C_TYPEDEF)?;
SYMBOL_NAME:   "Name" Ident; // Ident is a raw symbol name - no demangling should be done here
C_TYPEDEF:     QuotedString; // C typedef extended with custom calling convention attributes
FREEFORM_DATA: (Ident "=" QuotedString)+; // comments, links to documentation, etc.

As for FREEFORM_DATA - it could be extended into:

META_ENTRY: PACKER_SPEC | LOADER_SPEC | FREEFORM_DATA;
PACKER_SPEC: "Packer" QuotedString;
LOADER_SPEC: "Loader" QuotedString;
FREEFORM_DATA: (Ident "=" QuotedString)+;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants