What follows is based primarily on information found in Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation.
part 1
Why does the program grow so much when you are looking at the IR even though the Assembly was tiny.
There are at least 2 reasons for this:
While the machine code produced by GCC for the main()
routine may by small, all of the code in the binary is transformed by Valgrind, including code in dynamically linked libraries.
Valgrind is a dynamic binary intrumentation (DBI) framework, similar in some ways to DynamoRIO and PIN. It is implemented as a process virtual machine (PVM) that loads a binary into its own virtual memory space and executes a transformed and intrumented version of its code:
Valgrind uses dynamic binary re-compilation, similar to many other
DBI frameworks. A Valgrind tool is invoked by adding valgrind --tool=<toolname>
(plus any Valgrind or tool options) before a
command. The named tool starts up, loads the client program into
the same process, and then (re)compiles the client’s machine code,
one small code block at a time, in a just-in-time, execution-driven
fashion. The core disassembles the code block into an intermediate
representation (IR) which is instrumented with analysis code by
the tool plug-in, and then converted by the core back into machine
code. The resulting translation is stored in a code cache to be
rerun as necessary. Valgrind’s core spends most of its time making,
finding, and running translations. None of the client’s original code
is run.
Code handled correctly includes: normal executable code, dynamically
linked libraries, shared libraries, and dynamically generated
code.[1]
In the intermediate representation (IR) used by Valgrind, every effect a machine code instruction in a binary has is explicitly represented by an IR operation. This means that CISC instructions with side-effects will be represented by multiple IR operations.
Valgrind uses disassemble-and-resynthesise (D&R): machine
code is converted to an IR in which each instruction becomes
one or more IR operations. This IR is instrumented (by adding
more IR) and then converted back to machine code. All of the
original code’s effects on guest state (e.g. condition code setting)
must be explicitly represented in the IR because the original client
instructions are discarded and the final code is generated purely
from the IR.
The IR has some RISC-like features: it is load/store, each primitive
operation only does one thing (many CISC instructions are broken
up into multiple operations), and when flattened, all operations
operate only on temporaries and literals. Nonetheless, supporting
all the standard integer, FP and SIMD operations of different sizes
requires more than 200 primitive arithmetic/logical operations.
The instruction set architecture of the test binary is not explicitly stated in the original post and is assumed to be x86. x86 is an example of a CISC ISA, which means that the number of IR operations >> number of original machine code instructions. This due to the complexity of x86 instructions in terms of side effects and operations performed as the result of executing a single instruction.
When I executed valgrind --tool=lackey --trace-mem=yes test
, where test
was an ELF32 binary created from the example C code in the original post using GCC, these were the results (truncated, plus arrows pointing to lines that will be discussed subsequently):
.
.
.
I 04cec101,3
I 04cec104,3
I 04cec107,2
==11736==
==11736== Counted 0 calls to main() <---------- (1)
==11736==
==11736== Jccs:
==11736== total: 44,420
==11736== taken: 21,288 ( 47%)
==11736==
==11736== Executed:
==11736== SBs entered: 44,083
==11736== SBs completed: 30,750
==11736== guest instrs: 211,953
==11736== IRStmts: 1,304,900
==11736==
==11736== Ratios:
==11736== guest instrs : SB entered = 48 : 10
==11736== IRStmts : SB entered = 296 : 10
==11736== IRStmts : guest instr = 61 : 10 <---------- (2)
==11736==
==11736== Exit code: 0
As we can see at (2), there are significantly more IR statements than machine code instructions, which is in line with what is expected of translation of CISC instructions into Valgrind's IR.
(1) has to do with the second part of the question
Here is an example of a single x86 instruction producing multiple IR statements:
0x24F27C: addl %ebx,%eax <---------- x86 instruction + operands
4: ------ IMark(0x24F27C, 2) ------
5: PUT(60) = 0x24F27C:I32 # put %eip
6: t3 = GET:I32(0) # get %eax
7: t2 = GET:I32(12) # get %ebx
8: t1 = Add32(t3,t2) # addl
9: PUT(32) = 0x3:I32 # put eflags val1
10: PUT(36) = t3 # put eflags val2
11: PUT(40) = t2 # put eflags val3
12: PUT(44) = 0x0:I32 # put eflags val4
13: PUT(0) = t1 # put %eax
part 2
How can I isolate the IR that represents instructions in the code I wrote?
This does not seem possible due to how Valgrind transforms machine code (disassembly + resynthesis).
We observe in (1) that 0 calls to main()
were made when Valgrind instrumented the test binary. Since main()
does nothing, it is possible that it is optimized out during the machine code -> IL -> instrumented IR -> machine code translation process.
The translation process actually consists of 8 phases, where
All phases are performed by the core, except
instrumentation, which is performed by the tool. Phases marked
with a ‘*’ are architecture-specific.
Phase 1. Disassembly*: machine code → tree IR. The disassembler
converts machine code into (unoptimised) tree IR. Each
instruction is disassembled independently into one or more statements.
These statements fully update the affected guest registers in
memory: guest registers are pulled from the ThreadState into temporaries,
operated on, and then written back.
Phase 2. Optimisation 1: tree IR → flat IR. The first optimisation phase flattens the IR and does several optimisations: redundant get and put elimination (to remove unnecessary copying of guest
registers to/from the ThreadState), copy and constant propagation,
constant folding, dead code removal, common sub-expression elimination,
and even simple loop unrolling for intra-block loops.
Phase 3. Instrumentation: flat IR → flat IR. The code block is
then passed to the tool, which can transform it arbitrarily. It is important that the IR is flattened at this point as it makes instrumentation easier, particularly for shadow value tools.
Phase 4. Optimisation 2: flat IR → flat IR. A second, simpler optimisation pass performs constant folding and dead code removal.
Phase 5. Tree building: flat IR → tree IR. The tree builder converts flat IR back to tree IR in preparation for instruction selection.
Expressions assigned to temporaries which are used only once are
usually substituted into the temporary’s use point, and the assignment
is deleted. The resulting code may perform loads in a different
order to the original code, but loads are never moved past stores
Phase 6. Instruction selection*: tree IR → instruction list. The
instruction selector converts the tree IR into a list of instructions
which use virtual registers (except for those instructions that are
hard-wired to use particular registers; these are common on x86
and AMD64). The instruction selector uses a simple, greedy, topdown
tree-matching algorithm.
Phase 7. Register allocation: instruction list → instruction list. The linear-scan register allocator [26] replaces virtual registers with host registers, inserting spills as necessary. One general-purpose
host register is always reserved to point to the ThreadState.
Phase 8. Assembly*: instruction list → machine code. The final
assembly phase simply encodes the selected instructions appropriately
and writes them to a block of memory.
After optimization and potentially arbitrary transformation, it is an open question as to whether any of the IR code output by lackey
bears any discernible resemblance to the machine code generated by GCC for main()
.
Supplementary resources:
https://fosdem.org/2017/schedule/event/valgrind_vex_future/
https://fosdem.org/2017/schedule/event/valgrind_vex_future/attachments/slides/1842/export/events/attachments/valgrind_vex_future/slides/1842/valgrind_vex_future.pdf
https://github.com/trailofbits/libvex/blob/master/VEX/pub/libvex_ir.h
https://arxiv.org/pdf/0810.0372.pdf
http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter3.pdf
https://docs.angr.io/docs/ir.html
1. Nicholas Nethercote and Julian Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proc. of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI), June 2007.