Machine code decompilers have the ability to provide the key to software evolution: source code. Before their considerable potential can be realised, however, several prob- lems need to be solved.
In essence, decompiler is the inverse of a compiler, as shown below:
Despite the complete reversal of direction, compilers and decompilers often employ similar techniques in the analysis phases, such as data ow analysis.
Source code is so important to software development that at times it becomes worthwhile to derive it from the executable form of computer programs.
Decompilation is a form of reverse engineering, starting with an executable le, and ending where traditional reverse engineering starts (i.e. with source code that does not fully expose the design of a program).
The following picture shows the relationship between various engineering and reverse engineering tools and processes.
Decompilers have many useful applications, broadly divided into those based on browsing parts of a program, providing the foundation for an automated tool, and those requiring the ability to compile the decompiler's output.
Applications of decompilation, which depend on how the output is to be used, and whether the user modi es the automatically generated output.
The state of the art as of 2002 needed improvement in many areas, including the recovery of parameters and returns, type analysis, and the handling of indirect jumps and calls.
Comment: This part may not work in 2017, since the thesis was writen in 2007
Various reverse engineering tools are compared in terms of the basic problems that they need to solve; the machine code decompiler has the largest number of such problems.
However, There are some advantages to reverse engineering over forward engineering:
Table above shows problems to be solved by various reverse engineering tools.
And following table shows the information and separations lost at various stages in the compilation of a machine code program
There are suffcient important and legal uses for decompilation to warrant this research, and decompilation may facilitate the transfer of facts and functional concepts to the public domain.
The main goals are better identi cation of parameters and returns; reconstructing types, and correctly translating indirect jumps and calls; all are facilitated by the Static Single Assignment form.
The goals of this thesis could be summarized as below:
Following chapters review the limitations of existing machine code decompilers, and show how many of their limitations in the areas of data flow analysis, type analysis, and the translation of indirect jumps and calls are solved with the Static Single Assignment form.
Existing decompilers have evaded many of the issues faced by machine code decompilers, or have been deficient in the areas detailed in Chapter 1. Related work has also not addressed these issues.
Static Single Assignment form assists with most data flow components of decompilers, assisting with such fundamental tasks as expression propagation, identifying parameters and return values, deciding if locations are preserved, and eliminating dead code.
Data flow analysis and control flow analysis are two of the main classes of analyses for machine code decompilers, as shown below.
The intermediate representation (IR) for a decompiler resembles that of a compiler. It consists of statements containing expressions.
Data flow analysis concerns the definnitions and uses of locations:
Expression propagation is the most common transformation used by decompilers, and there are two simple rules, yet diffcult to check, for when it can be applied.
First comes an example assembly code:
All examples will come from the assembly above, and the IR for the first seven instructions of example assembly code is shown below:
A propagation of
x from a source statement
s of the form
x := exp to a destination statement
u can be performed if the following conditions hold:
smust be the only definition of
u, there are no assignments to any location used by
With traditional data flow analysis, two separate analyses are required:
After expression propagation, the example IR becomes:
Expression propagation, in which the quantity being propagated is an expression, differs from the constant propagation and copy propagation performed in compilers. More specifically, Compilers propagate statements of the form
x := y, where
x is a variable and
y is either a constant or another variable, but not a more complex expression.
Performing expression propagation tends to make the result more complex, copy propagation leaves the complexity the same, and constant propagation usually makes the result simpler.
Expression propagation, when applied repeatedly, tends to result in expressions that are in terms of the input values of the procedure.
Although expression propagation is usually bene cial in a decompiler, in some circum- stances limiting propagation produces more readable output.
Limiting Propagation is needed, because of the following situation, in which too much propagation has been performed:
Common Subexpression Elimination (CSE) is a compiler optimisation that would appear to solve this problem. However, implementing CSE everywhere possible e ectively reverses all propagations.
Excessive propagation can be limited by heuristics, as shown. In some cases, it may be desirable to undo certain propagations, e.g. where specified manually. For those cases, CSE could be used to undo specific propagations.
Simple measures of expression complexity will probably suffice, e.g. the number of operators is easy to calculate and is e ective.
Dead code elimination is facilitated by storing all uses for each definition (definition-use information).
Dead code consists of assignments for which the definition is no longer used; its elimination is called Dead Code Elimination (DCE).
Dead code contrasts with unreachable code, which can be any kind of statement, to which there is no valid path from the program's entry point(s).
Propagation often leads to dead code, and following is the IR of example assembly code after DCE:
Expression propagation can be used to implement machine independent combining of condition code definitions and uses.
Another machine specific detail which must usually be eliminated is the condition code register, also called the flags register or status register.
The effect in these situations is the result of combining the semantics of the instruction which sets the condition codes, and the instruction which uses them. Often, these instructions are adjacent or near each other, but it is quite possible that they are separated by many instructions.
IR including Condition code could be dealed like below:
In summary, expression propagation makes it easy to combine condition code setting and use. Special propagation or simplification rules extract the semantics of the condition code manipulations. Finally, dead code elimination removes machine dependent condition codes and the %flags abstract location.
Following is another example:
Expression propagation can also be used to transform away machine details such as those revealed by older x86 oating point compare instruction sequences.
In particular, floating point branches were performed with the aid of the fnstsw (floating point store status word) instruction, which copied the floating point status word to the ax register of the CPU.
Machine code for the floating point compare is shown below:
Some basic knowlegde about floating point could be seen at here
The effects of calls are best summarised by the locations modified by the callee, and the locations used before definition in the callee.
If data flow analysis is performed on the whole program at once, the effects of called procedures are automatically incorporated into their callers; Section 3.5 will consider this possibility. Assuming that the whole program is not analysed at once, a summary of some kind is required of the effects of calls.
The semantics of call statements and their side effects necessitates terminology that extends terms used by the compiler community.
Following picture shows these terminology in IR:
The locations used to store parameters and return locations are a matter of convention, called the Application Binary Interface (ABI); there are many possible conventions.
Care must be taken to separate those locations used before being defined which are preserved, parameters, or both. This can be achieved by a combination of expression propagation and dead code elimination.
Memory and register expressions are frequently communicated between callers and callee(s); the difference in context requires some substitutions to obtain expressions which are valid in the other context.
Without expression propagation to canonicalize the memory expressions in the callee, the situation is even worse, since parameters could be accessed at varying offsets from an ever changing stack pointer register, or sometimes at offsets from the stack pointer register and at other times at offsets from the frame pointer register (if used).
Three propositions determine how registers and global variables that are assigned to along only some paths should be treated.
DO NOT UNDERSTAND
The calculation of call-related data flow elements such as parameters, defines, and results can be concisely expressed in a series of data flow equations.
DO NOT UNDERSTAND
The stack pointer, and occasionally other special pointers, can appear to be a parameter to every procedure, and is handled as a special case.
DO NOT UNDERSTAND
Static Single Assignment form assists with most data flow components of decompilers, including such fundamental tasks as expression propagation, identifying parameters and return values, deciding if locations are preserved, and eliminating dead code.
SSA form vastly simplifies expression propagation, provides economical data flow information, and is strong enough to solve problems that most other analysis techniques cannot solve.
The Static Single Assignment (SSA) form is a form of intermediate representation which maintains the property that each variable or location is defined only once in the program.
The algorithms for transforming ordinary code into and out of SSA form are somewhat complex, but well known and effcient [BCHS98, App02, Mor98, Wol96].
Assignments with φ-functions are inserted at the beginning of basic blocks where control flow merges, i.e. where a basic block has more than one in-edge. In general, a φ-function at the top of a basic block with n in-edges will have n parameters.
Following is an example code and its SSA form:
A relation called the dominance frontier eficiently determines where φ-functions are necessary: for every node
n defining location
a, every basic block in the dominance frontier of
a φ-function for
a. The dominance frontier for a program can be calculated in practically linear time. If required, φ-functions can be implemented (made executable) by inserting copy statements. For example,
st1 := φ(st0, st3) could be implemented by inserting
st1 := st0 just before the loop (i.e. at the end of the basic block which is the first predecessor to the current basic block), and inserting
st1 := st3 at the end of the loop (which is the end of the basic block which is the second predecessor to the current basic block). In many instances the copy statements will not be necessary, as will be shown in a later section.
The SSA form makes propagation very easy; initial parameters are readily identified, preservations are facilitated, and SSA's implicit use-definition information requires no maintenance.