In this post, I discuss three random ideas that might be helpful to explore unknown data formats without using the application parses the data.
- To spot the presence of run-length encoding (RLE) you might look at string chunks in a relatively non-redundant (high entropy) data dump. String chunks might indicate that the original data contains string; also the string chunks in RLE dump might be found in the original data multiple times, likely, as a part of longer strings.
- To spot the presence of IA32 code, you might look at instruction patterns such as CALL or JUMP in a relatively redundant data dump. The above instructions have bytecode of E8 and E9 followed by four bytes relative address that could also be included in the validation for more precise discovery.
- Repetitive bytes could mean padding bytes, for example, to align a structure. Scattered bytes, usually 00s or FFs, could indicate the representation of positive or negative numbers. It could mean either pointer, or offset, or table of numbers, etc.