April 28, 2014

Order of Memory Reads of Intel's String Instructions

Neither the Intel Manual nor Kip R. Irvine's assembly book discusses the behavior I'm describing about x86 string instructions in this post.

Given the following instruction that compares the byte at ESI to the byte at EDI.

cmps    byte ptr [esi],byte ptr es:[edi]

To perform comparison the instruction must read the bytes first. The question is whether byte at ESI or byte at EDI is read first?

Intel Manual says:
Compares the byte, word, doubleword, or quadword specified with the first source operand with the byte, word, doubleword, or quadword specified with the second source operand
Kip R. Irvine's book titled Assembly Language for Intel-Based Computers (5th edition) says:
The CMPSB, CMPSW, and CMPSD instructions each compare a memory operand pointed to by ESI to a memory operand pointed to by EDI
Both of the descriptions explain what the instructions do but none of them says how. So I needed to do some experiments in Windbg to find the answer to the question.

The first experiment was not a good one. Initially, I thought I'd put processor breakpoint (aka memory breakpoint) at ESI and another one at EDI. I also thought to execute CMPS and let the debugger to break-in on either of the processor breakpoints. And here it goes why it was a bad idea. The execution of CMPS has to complete for debugger break-in. And, by the time the CMPS completes it hits both of the breakpoints.

The other experiment I came up with is like this. Set both ESI and EDI to point two distinct memory addresses that are unmapped. The assumption is when CMPS is executed it raises an exception when trying to read memory. By looking at the exception record we can tell the address the instruction tries to read from. Given that, we can tell if that value was assigned to ESI, or to EDI, and so we can tell whether byte at ESI or byte at EDI is read first.

Here is how I did the experiment in Windbg.

I opened an executable test file in Windbg. I assembled CMPS to be placed in the memory at EIP.

0:000> a
011113be cmpsb
cmpsb

I changed ESI to point to invalid memory. And I did the same with EDI.

0:000> resi=51515151
0:000> redi=d1d1d1d1

Here is the disassembly of CMPS. Also, you know from the highlighted text that both ESI and EDI point to unmapped memory addresses.

0:000> r
eax=cccccccc ebx=7efde000 ecx=00000000 edx=00000001 esi=51515151 edi=d1d1d1d1
eip=011113be esp=0022fb70 ebp=0022fc3c iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
test!wmain+0x1e:
011113be a6              cmps    byte ptr [esi],byte ptr es:[edi] ds:002b:51515151=?? es:002b:d1d1d1d1=??

I executed the process leading to an expected access violation triggered by CMPS instruction.

0:000> g
(5468.1eec8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=cccccccc ebx=7efde000 ecx=00000000 edx=00000001 esi=51515151 edi=d1d1d1d1
eip=011113be esp=0022fb70 ebp=0022fc3c iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
test!wmain+0x1e:
011113be a6              cmps    byte ptr [esi],byte ptr es:[edi] ds:002b:51515151=?? es:002b:d1d1d1d1=?? 

I got the details of the exception like below.

0:000> .exr -1
ExceptionAddress: 011113be (test!wmain+0x0000001e)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: d1d1d1d1
Attempt to read from address d1d1d1d1

As you can see the access violation was occurred due to an attempt to read from d1d1d1d1 that is the value of EDI. Therefore, to answer the question at the beginning of the article, the byte at EDI is read first.

To give more to the fun, you may try how emulators handle this - that is the order of memory reads of CMPS instructions.

If you like this you may click to read more debugging posts.

UPDATE 28/April/2014 Of course I don't encourage people to rely on undocumented behavior when developing software. Stephen Canon says Intel may optimize the microcode for operands from time to time, and so we don't have a good reason to believe this behavior to be stable.

UPDATE 29/April/2014 Here is a Windows C++ program to test this behavior on your architecture: cmps-probe.cpp

April 23, 2014

Inspection of Division & Multiplication

Division and multiplication calculations can lead to trigger bugs, and potentially pose as security risks. Here are few things that I believe to be helpful for those who do binary inspection.

Division

Production quality binaries are normally built with optimization enabled which makes the binary to run fast. One of the optimizations technique for the compiler is to emit a series of fast instructions instead of a single slow instruction.

DIV and IDIV instructions are known to be slow. As a part of optimization the compiler emits a series of fast instructions that are functionally equivalent to DIVs. The fast instructions are shift, multiplication, and addition instructions that take magic (constant) values depending on the divisor. Therefore the divisor has to be known at compile-time to apply optimization.

If the optimized binary has any DIVs, that means, the divisor was not known at compile time. Thus it's known at run-time, and so it could be a user-controlled value or a user input taken as it is.

Division can cause exception if the divisor is 0, or if the result is to large to store.

Division by Zero in CLR's Native Code

As an interesting experiment I looked at what happens when an integer is divided by zero in C#.

CLR generates native code with division instruction in it. When the instruction of division by zero is executed, an exception is raised that is handled by CLR's exception handler.

So the generated code with division in it doesn't have a test for the divisor. It's left for the exception handler to handle division by zero situations.

Multiplication

Like division, multiplication can be optimized, too, by using a sequence of fast instructions (sometimes one instruction). Whether or not it's worth optimizing depends on the value of multiplier (or multiplicand).

The multiplication you can see in binary might not be seen on source-code level. And some multiplication cannot be easily spotted in binary code due to optimization. And, multiplications can lead to trigger bugs.

Overflow in Multiplication

Multiplication can lead to integer overflow. Multiplication of two values are more likely to lead to integer overflow than addition of the two values. Multiplication of two word length integers can overflow on 32-bit but addition can't.

Few instances of the IMUL instruction can take immediate value, that is the multiplier. It's easily possible to calculate what multiplicand overflows the multiplication. The challenging part is to determine how the value could be assigned to the multiplicand to trigger overflow.

It's worth searching for MUL and IMUL instructions in the binary using the # (hash mark) Windbg command.

Overflow in Multiplication by Scale Factor

Scale factor is a constant value by which the register in the instruction is multiplied. The scale factor is either 1, 2, 4 or 8.

Use of scale factor could be the result of optimization of multiplication. The example below demonstrates to multiply a value by 4.

lea eax,[edi*4]

Other common use case involves to dereference an element of the array. In the below example the array consists of elements of size 8. On source-level there is no multiplication but on low-level there is.

mov [edi+edx*8],eax

If the value of the register by which the scale factor is multiplied is large enough an integer overflow can occur.

Look at the below instruction. Even the multiplication might not overflow the result can, due to base (esp) and displacement (8000).

mov [esp+ebx*4+8000],eax

Method of Inspection

Generally, it's not feasible to review all the occurrences of certain instructions but on critical areas it might be reasonable to do. Instruction tracing, and tracing like this can be a good start to narrow the area that can be inspected closer.

April 16, 2014

You May Not Need to Debug SSE Instructions

There are binaries that contain implementation of an algorithm in two ways. The first one is optimized to run on all architectures and so it consists of i386 instructions only. The second one is optimized to run fast and therefore it has SSE instructions. When the application runs it checks the architecture to decide which implementation of the algorithm to be executed.

It is common thing that binaries can contain various implementations of the same algorithm. One example is the Microsoft Visual C++ runtime.

You may not need to debug SSE instructions though. What you need to do is to tell your application that SSE support is not available - which is most likely a lie in 2014.

Recently, when I debugged a Windows application I noticed it executes SSE instructions. Here is how I got my application to believe that there is no SSE support available.

I knew about CPUID instruction. It can come back with plenty information about the processor. If CPUID is used with input EAX set to 1 feature information is returned in ECX and EDX.

We only need the SSE-related bits of the feature information. Here are they (source: Intel Developer Manual).

In ECX:
    Bit 0  SSE3 Extensions
    Bit 9  SSSE3 Extensions
    Bit 19 SSE4.1
    Bit 20 SSE4.2

In EDX:
    Bit 25 SSE Extensions
    Bit 26 SSE2 Extensions

The idea is when CPUID is executed with EAX set to 1 we need to clear SSE bits in ECX and EDX. To clear SSE bits we have to mask the registers like below.

ECX<-ECX&FFE7FDFE
EDX<-EDX&F9FFFFFF


I used the following Windbg command to search for CPUID instructions in the code section of the virtual image.

# cpuid <address> L?<size>

I saw CPUID at few places. I checked all of them to find the ones that have EAX set to 1 input. I found few fragments like these.

xor eax,eax
inc eax
cpuid

I put breakpoints just after each of the right CPUID instructions. When the breakpoint hit the SSE flags are cleared and the execution resumes.

bp <address> "reip; recx=ecx&0ffe7fdfe; redx=edx&0f9ffffff; gc"

And it worked as expected in my experiment. The application took the alternate, but slower, code path of i386 instructions.

A final note, this technique may be used to avoid debugging SSE instructions but it can also be useful to increase code coverage during security testing.

April 8, 2014

Examining Unknown Binary Formats

This post is about to discuss the methods for examining unknown binary formats that can be either a file, file fragment, or memory dump.

Before discussing the methods I'm describing few scenarios when examination of an unknown format is appropriate.

Imagine you deal with an application that handles a certain file type that you want to fuzz. You think to carry out some dumb fuzzing with a bit of improvement. Before doing so, you may be examining the format to create an approximate map of the layout. So you'll get an idea what parts of the files are worth fuzzing, and what fuzzing method is reasonable to apply for each part.

In other scenario you might have the binary file but don't have the program that parses it. You want to know as much as possible of the format of the binary file to understand it's layout.

If the application that reads the file format is available you can use debugger to watch how the data is parsed. This scenario is not discussed here.

If the application that writes the format is available you can try the following idea. You may produce output file using the application. This can be done by save, export, convert options available in the application. Next time when producing output you change something minor in the settings that may produce a similar output file. Comparing the two output files you may see what changed.

Entropy analysis is very useful to locate compressed, encrypted, and other way encoded data. Higher entropy can indicate encoding of some kind. Lower entropy is likely anything else including text, code, header, data structures. Redundancy analysis is analogue to entropy analysis; the lower the redundancy the most likely the data is encoded.

Encoded data could be anything, even multimedia data. The compressed streams can have headers and/or magic bytes identify the compression type.

Character distribution of the file can tell us a lot. Creating a byte frequency map is very straightforward by using modern programming languages. That can tell us what are the most and less frequent bytes. We can easily know what are the bytes that are not present at all.

Strings can be discovered even with popular tools like a hex-editor. Most common encodings are ASCII and Unicode. If there is no terminating zero the length of the string is likely stored somewhere in the binary. It's often the preceding byte(s) of the first letter of the string.

Consecutive patterns, byte sequences are seen to be used for padding, for alignment, or to fill slack space.

Random-looking printable characters can indicate some kind of encoding of any data in plain text.

Scattered bytes, scattered zeros, scattered 0FFh bytes can indicate sequence of encoded integers. Integers can be offsets and lengths. Scattered zeros might indicate text in Unicode format.

It could be useful to analyze the density of zeros, printable characters, or of other patterns. This could be applied on the whole file or on a particular region of the file.

Consecutive values, integers might indicate an array of pointers. It might be useful to know if the values increasing, decreasing, or random values.

Also, good to know in what endianness the integers stored.

x86 code can be be detected by running disassembler on the binary. If you see a sequence of meaningful instructions that might be code-area.

There is a simpler way to look for x86 code though. You write a small program in some high level language that searchers for E8 (CALL) / E9 (JMP) patterns and calculates the absolute offset where the instruction jumps. If there is an absolute offset referenced from different places that might be an entry point of a real function. The more functions are identified the better the chance you have found code.

If you know what native code to look for you can search for a sequence of common instructions, like bytes at function entry point.

Meaningful text fragment in high-entropy area might indicate run-length encoding which is also known as RLE compression.

There is data format that looks like this. It consists of a sequence of structures, or chunks. The size of each structure is encoded sometimes as a first value in the structure. It's commonly seen that a sequence of compressed data is stored like that.

If it's known the binary is associated with certain time stamp or version number those constants might worth searching for.

Some methods described here can be combined with xor-search, and with other simple decoding techniques to discover the structure of the file.

April 5, 2014

Thoughts About Finding Race Condition Bugs

Race condition bugs can exist in multi-threaded applications. Improper synchronization can be the root cause of race condition bugs.

Executing stress testing is a good start to find bugs. It might not be an ideal black-box testing method though as it is mostly for developers to test their proprietary software. Injecting delays at various points into the target could help finding bugs but we need to know the right locations to inject the delays. Cuzz is a Microsoft tool for finding concurrency bugs by injecting random delays - it looks promising.

Using DBI (Dynamic Binary Instrumentation) it's possible to tell if an EIP is executed, and if so by what thread(s). Therefore it's possible to tell what code is executed by what thread(s).

Using DBI it is also possible to tell where (value of EIP) the thread context switch happens.

By having the above information we can make educated guesses where to inject the delays.

If a bug is found it might not be reachable from outside. That's always a possibility. However it's good to see if you can provide input that makes the application to run longer near the location of the intended delay. There might be a ReadFile that can take longer to complete if the file is large enough. Or there might be a loop where the iteration count can be controlled by user...

April 3, 2014

Change of Execution Flow in Debugger

When debugging sometimes we need to force the execution to either take or not take the conditional jump.

There are several ways to achieve this. One possibility is to overwrite the conditional jump with either JMP or NOP instruction to force the execution into the desired path.

The next trick is to simply change the instruction pointer. The below example demonstrates to increment the instruction pointer by 2 in Windbg.

reip=eip+2

Another idea involves to see what are the conditions of taking or not taking the conditional jump. Knowing the conditions you can change the register or data at the right memory location to influence the execution flow.

My favorite is to change the x86 flags when the instruction pointer points to the conditional jump. Below is how to set the zero flag in Windbg.

rzf=1

To see more info about flags check out msdn or Windbg's help.

April 1, 2014

Tracking Down by Pin

Recently, there was a challenging situation I had faced. At first sight it looked like a common debugging problem that can be solved with some experiment but the more time I spent on it the more difficult the situation looked like.

The situation was the following. The below instruction reads memory.

00400000+006026de mov     eax,dword ptr [ebp+4]

What is the EIP of the instruction that writes [ebp+4]? This is all I wanted to know that stage.

Note, while looking at the instruction it looks like ebp+4 reads a stack address -- it reads actually a heap address.

First I was looking at the function if I can find the instruction in it that writes [ebp+4]. It wasn't there so I investigated the caller functions, and their callers, and so on. Again, it wasn't there but noticed something. The functions passed a pointer to a context as a parameter containing many variables including [ebp+4].

At this point I had a good reason to believe the situation looked difficult because the context is likely to set by an initializer that may be on a completely different code path to the one I was investigating.

You may ask why I didn't use processor breakpoint too see what instruction writes [ebp+4]. It was a heap address kept changing on every execution and the address was not known to put breakpoint on.

I could have gone back to the point when the structure is allocated, and I could have set a breakpoint relative to the base address and see what code writes [ebp+4]. That sounded good and I would have gone to that direction if I hadn't had a better idea.

I thought I could write a PinTool that tracks write and read memory accesses. It adds all instructions writing memory to the list. When the instruction that reads memory is reached the program searches the list for instructions wrote that address. Of course this has to be thread safe.

It took me a day to develop the PinTool and find the EIP that writes [ebp+4].

This is how I executed the PinTool from command line.

pin.exe -t TrackDownWrite.dll -ip 0x6026de -s 4 -- c:\work\<redacted>.exe

-ip is the instruction pointer where [ebp+4] is read
-s is the size of read/write to track

The result looked like this.

0bab05dc is read by 006026de
0bab05dc was written by 005ee358 before read


The wanted instruction is below.

00400000+005ee358 mov     dword ptr [eax+4],edx

The prototype is available for download. After finished my debugging task I also tested it a bit on Windows x86 and I think it looks useful for similar problems might arise in the future.
  This blog is written and maintained by Attila Suszter. Read in Feed Reader.