Earlier this year a post was published of
examining data format without using the program that reads the format. That post discusses patterns to look for, in order to identify certain constructs. This post focuses on static methods of examining code that can be either the complete code section of the file, memory dump, or just fragment. It also describes selected ideas what patterns to look for when examining a given code.
The reason one may look for patterns in code is to locate certain functionalities or to get high-level understanding of what the code does. Others may look for certain construct that may be the key part of the program in security point of view.
It's true to say one can expect this to be a rapid method compared to other methods such as line-by-line instruction analysis.
But, it's always good to read documentation, if possible at all, to get an overview of the expectations.
There are methods that more effective if performed on small region. Therefore to narrow the scope of the search wherein to look for pattern is something good to do at the beginning of analysis albeit it's not always feasible to do with enough certainty. Anyway, one can always widen the search region if required at a later stage.
Compilers tend to produce executable files with particular layout. Some have the library code at the beginning of the code section, while others have it at the end of the code section.
If there is no information about the compiler or no information about the layout there are other ways to locate the library in the code.
You may look for library function calls that can be visible in disassembler. Library code may have distinct color in disassembler.
Library/runtime code often have many implementations of functions to use the advantage of latest hardware. An example is MSVC. And so SSE instructions/functions may indicate the presence of library/runtime code.
Library code can be spotted by looking for strings can be associated with particular libraries.
Library/runtime code can be spotted by looking for constant values that can be associated with particular libraries such as cryptographic libraries that tend to have many constants.
To guess the compiler that was used to generate the code is possible by analyzing the library/runtime code.
In case the code is just a fragment of user code you may consider examining the instructions how they are encoded. Intel encodings are redundant and one instruction can have multiple encodings. This is something to make guess on what compiler was used.
If multiple encodings of an instruction is found in a binary the code that could be generated with a polymorphic encoder.
Also, code has other characteristics that may differ between compilers such as padding and stack allocation.
Imports and exports as well as strings can tell a lot. You may check where they are referenced in the code.
Debugging symbols can help awfully lot if the disassembler can handle that. Sometimes it's available sometimes it's not.
No matter what code you're looking at it most likely deals with input data. That case it may get the data from file, from network, via standard API calls. These are valuable areas to audit for security problems, and it's possible to follow how the data returned by these APIs. It may require to analyze caller functions as usually these APIs are wrapped around many calls before using the input.
Just like when reading the data the code may write data, or send data via standard API calls. These areas may be security-sensitive.
Programs have centralized, well-established functions. These functions, for example, read dword values, read data into structures and propagate any other internal storage. Discovery of these functions not considered hard, they are normally small, and have instructions of memory read and write. By looking where they referenced from we can find good attack surfaces.
Good to keep in mind that code sections can contain data besides code. But normally data is stored in data section. In the disassembler it's convenient to see how the data is referenced, and may decide if there is an attack surface nearby.
CRC and hash constants may indicate there is some data which is being CRC'd or hashed. You may figure out where is that data from and how can you perform security testing around.
When a library is using a parameter hardcoded it's often encoded as a part of the instruction rather than stored in data section of the executable. Example encoding looks like mov eax, <param> or mov al, <param>.
When a data format is parsed often a magic value is tested. Looking for instructions like cmp reg, <magic> or cmp dword ptr [addr], <magic> or similar instructions can help to locate attack surfaces.
Longer strings may be broken into immediate values and compared with multiple cmp instructions.
Looking for strcmp function calls is good idea to look for if you want to find code that test for data format as often strcmp functions are used for this purpose.
If the code is optimized for speed there are many ways to confirm. Normally the readability of code bad, for example when the code performs division or use the same memory address for multiple variables. If EBP register is used in arithmetic or other than to store stack base address that could indicate the code is optimized.
Perhaps there are circumstances when looking at the frequency of instructions, looking for undocumented instructions, or rare instruction, or instructions that not present can give us valuable clues that help the examination.
Intuitively going through the code and looking for undefined patterns can be good idea if the scientific ways have been exhausted.