How to display only the IR for your program code

Question

I have a simple C program that is literally

int main(void) {
    return 0;
}

When I convert this to assembly, gcc -S, it grows to around 10 lines. As expected.

Then when I convert it into binary, and then from that to VEX IR it grows to * a lot * of instructions. You can see this with valgrind --tool=lackey --trace-mem=yes <FILE>

My question is two-folded:

Why does the program grow so much when you are looking at the IR even though the Assembly was tiny.

I think its because you now have all the overhead calls / instructions that are needed to run the binary and not just your program code. But a more detailed explanation would be helpful
How can I isolate the IR that represents instructions in the code I wrote?

I'm not sure if this is 100% possible, so if not are there any ways to help me narrow down what I want to look at?

score 4 · Answer 1 · answered Aug 12 '17 at 04:20

What follows is based primarily on information found in Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation.

part 1

Why does the program grow so much when you are looking at the IR even though the Assembly was tiny.

There are at least 2 reasons for this:

While the machine code produced by GCC for the main() routine may by small, all of the code in the binary is transformed by Valgrind, including code in dynamically linked libraries.

Valgrind is a dynamic binary intrumentation (DBI) framework, similar in some ways to DynamoRIO and PIN. It is implemented as a process virtual machine (PVM) that loads a binary into its own virtual memory space and executes a transformed and intrumented version of its code:

Valgrind uses dynamic binary re-compilation, similar to many other DBI frameworks. A Valgrind tool is invoked by adding valgrind --tool=<toolname> (plus any Valgrind or tool options) before a command. The named tool starts up, loads the client program into the same process, and then (re)compiles the client’s machine code, one small code block at a time, in a just-in-time, execution-driven fashion. The core disassembles the code block into an intermediate representation (IR) which is instrumented with analysis code by the tool plug-in, and then converted by the core back into machine code. The resulting translation is stored in a code cache to be rerun as necessary. Valgrind’s core spends most of its time making, finding, and running translations. None of the client’s original code is run.

Code handled correctly includes: normal executable code, dynamically linked libraries, shared libraries, and dynamically generated code.^[1]
In the intermediate representation (IR) used by Valgrind, every effect a machine code instruction in a binary has is explicitly represented by an IR operation. This means that CISC instructions with side-effects will be represented by multiple IR operations.
- Valgrind uses disassemble-and-resynthesise (D&R): machine code is converted to an IR in which each instruction becomes one or more IR operations. This IR is instrumented (by adding more IR) and then converted back to machine code. All of the original code’s effects on guest state (e.g. condition code setting) must be explicitly represented in the IR because the original client instructions are discarded and the final code is generated purely from the IR.
- The IR has some RISC-like features: it is load/store, each primitive operation only does one thing (many CISC instructions are broken up into multiple operations), and when flattened, all operations operate only on temporaries and literals. Nonetheless, supporting all the standard integer, FP and SIMD operations of different sizes requires more than 200 primitive arithmetic/logical operations.
The instruction set architecture of the test binary is not explicitly stated in the original post and is assumed to be x86. x86 is an example of a CISC ISA, which means that the number of IR operations >> number of original machine code instructions. This due to the complexity of x86 instructions in terms of side effects and operations performed as the result of executing a single instruction.

When I executed valgrind --tool=lackey --trace-mem=yes test, where test was an ELF32 binary created from the example C code in the original post using GCC, these were the results (truncated, plus arrows pointing to lines that will be discussed subsequently):

.
.
.
I  04cec101,3
I  04cec104,3
I  04cec107,2
==11736== 
==11736== Counted 0 calls to main()                <---------- (1)
==11736== 
==11736== Jccs:
==11736==   total:         44,420
==11736==   taken:         21,288 ( 47%)
==11736== 
==11736== Executed:
==11736==   SBs entered:   44,083
==11736==   SBs completed: 30,750
==11736==   guest instrs:  211,953                 
==11736==   IRStmts:       1,304,900               
==11736== 
==11736== Ratios:
==11736==   guest instrs : SB entered  = 48 : 10
==11736==        IRStmts : SB entered  = 296 : 10
==11736==        IRStmts : guest instr = 61 : 10   <---------- (2)
==11736== 
==11736== Exit code:       0

As we can see at (2), there are significantly more IR statements than machine code instructions, which is in line with what is expected of translation of CISC instructions into Valgrind's IR.

(1) has to do with the second part of the question

Here is an example of a single x86 instruction producing multiple IR statements:

0x24F27C: addl %ebx,%eax                 <---------- x86 instruction + operands
4: ------ IMark(0x24F27C, 2) ------
5: PUT(60) = 0x24F27C:I32       # put %eip
6: t3 = GET:I32(0)              # get %eax
7: t2 = GET:I32(12)             # get %ebx
8: t1 = Add32(t3,t2)            # addl
9: PUT(32) = 0x3:I32            # put eflags val1
10: PUT(36) = t3                # put eflags val2
11: PUT(40) = t2                # put eflags val3
12: PUT(44) = 0x0:I32           # put eflags val4
13: PUT(0) = t1                 # put %eax

part 2

How can I isolate the IR that represents instructions in the code I wrote?

This does not seem possible due to how Valgrind transforms machine code (disassembly + resynthesis).

We observe in (1) that 0 calls to main() were made when Valgrind instrumented the test binary. Since main() does nothing, it is possible that it is optimized out during the machine code -> IL -> instrumented IR -> machine code translation process.

The translation process actually consists of 8 phases, where

All phases are performed by the core, except instrumentation, which is performed by the tool. Phases marked with a ‘*’ are architecture-specific.

Phase 1. Disassembly*: machine code → tree IR. The disassembler converts machine code into (unoptimised) tree IR. Each instruction is disassembled independently into one or more statements. These statements fully update the affected guest registers in memory: guest registers are pulled from the ThreadState into temporaries, operated on, and then written back.
Phase 2. Optimisation 1: tree IR → flat IR. The first optimisation phase flattens the IR and does several optimisations: redundant get and put elimination (to remove unnecessary copying of guest registers to/from the ThreadState), copy and constant propagation, constant folding, dead code removal, common sub-expression elimination, and even simple loop unrolling for intra-block loops.
Phase 3. Instrumentation: flat IR → flat IR. The code block is then passed to the tool, which can transform it arbitrarily. It is important that the IR is flattened at this point as it makes instrumentation easier, particularly for shadow value tools.
Phase 4. Optimisation 2: flat IR → flat IR. A second, simpler optimisation pass performs constant folding and dead code removal.
Phase 5. Tree building: flat IR → tree IR. The tree builder converts flat IR back to tree IR in preparation for instruction selection. Expressions assigned to temporaries which are used only once are usually substituted into the temporary’s use point, and the assignment is deleted. The resulting code may perform loads in a different order to the original code, but loads are never moved past stores
Phase 6. Instruction selection*: tree IR → instruction list. The instruction selector converts the tree IR into a list of instructions which use virtual registers (except for those instructions that are hard-wired to use particular registers; these are common on x86 and AMD64). The instruction selector uses a simple, greedy, topdown tree-matching algorithm.
Phase 7. Register allocation: instruction list → instruction list. The linear-scan register allocator [26] replaces virtual registers with host registers, inserting spills as necessary. One general-purpose host register is always reserved to point to the ThreadState.
Phase 8. Assembly*: instruction list → machine code. The final assembly phase simply encodes the selected instructions appropriately and writes them to a block of memory.

After optimization and potentially arbitrary transformation, it is an open question as to whether any of the IR code output by lackey bears any discernible resemblance to the machine code generated by GCC for main().

Supplementary resources:

https://fosdem.org/2017/schedule/event/valgrind_vex_future/

https://fosdem.org/2017/schedule/event/valgrind_vex_future/attachments/slides/1842/export/events/attachments/valgrind_vex_future/slides/1842/valgrind_vex_future.pdf

https://github.com/trailofbits/libvex/blob/master/VEX/pub/libvex_ir.h

https://arxiv.org/pdf/0810.0372.pdf

http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter3.pdf

https://docs.angr.io/docs/ir.html

_{1. Nicholas Nethercote and Julian Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proc. of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI), June 2007.}

How to display only the IR for your program code

1 Answers1

part 1

part 2