December 18, 2011

Analysis of Data Formats by Guessing - Part I

Usually, the most straightforward way to do an analysis of data format is to observe how the data is parsed by the application. However, there might be cases when the application parses the data is not available. In that case, the analysis migh be done by relying on the data itself and on general tools only.

In this post, I discuss three random ideas that might be helpful to explore unknown data formats without using the application parses the data.
  • To spot the presence of run-length encoding (RLE) you might look at string chunks in a relatively non-redundant (high entropy) data dump. String chunks might indicate that the original data contains string; also the string chunks in RLE dump might be found in the original data multiple times, likely, as a part of longer strings.
  • To spot the presence of IA32 code, you might look at instruction patterns such as CALL or JUMP in a relatively redundant data dump. The above instructions have bytecode of E8 and E9 followed by four bytes relative address that could also be included in the validation for more precise discovery.
  • Repetitive bytes could mean padding bytes, for example, to align a structure. Scattered bytes, usually 00s or FFs, could indicate the representation of positive or negative numbers. It could mean either pointer, or offset, or table of numbers, etc.
I call this process abductive reverse engineering. We don't have deductive evidence about the meaning of the structures in the data but we are able to form an explanatory hypothesis about their meaning.
  This blog is written and maintained by Attila Suszter. Read in Feed Reader.