Introduction |
|
xiii | |
|
Part 1 Prolog: Optimizing Compilation |
|
|
1 | (22) |
|
Chapter 1 On the Decidability of Phase Ordering in Optimizing Compilation |
|
|
3 | (20) |
|
1.1 Introduction to the phase ordering problem |
|
|
3 | (2) |
|
1.2 Background on phase ordering |
|
|
5 | (2) |
|
1.2.1 Performance modeling and prediction |
|
|
5 | (1) |
|
1.2.2 Some attempts in phase ordering |
|
|
6 | (1) |
|
1.3 Toward a theoretical model for the phase ordering problem |
|
|
7 | (5) |
|
1.3.1 Decidability results |
|
|
8 | (3) |
|
1.3.2 Another formulation of the phase ordering problem |
|
|
11 | (1) |
|
1.4 Examples of decidable simplified cases |
|
|
12 | (4) |
|
1.4.1 Models with compilation costs |
|
|
12 | (1) |
|
1.4.2 One-pass generative compilers |
|
|
13 | (3) |
|
1.5 Compiler optimization parameter" space exploration |
|
|
16 | (4) |
|
1.5.1 Toward a theoretical model |
|
|
17 | (2) |
|
1.5.2 Examples of simplified decidable cases |
|
|
19 | (1) |
|
1.6 Conclusion on phase ordering in optimizing compilation |
|
|
20 | (3) |
|
Part 2 Instruction Scheduling |
|
|
23 | (96) |
|
Chapter 2 Instruction Scheduling Problems and Overview |
|
|
25 | (14) |
|
2.1 VLIW instruction scheduling problems |
|
|
25 | (4) |
|
2.1.1 Instruction scheduling and register allocation in a code generator |
|
|
25 | (2) |
|
2.1.2 The block and pipeline VLIW instruction scheduling problems |
|
|
27 | (2) |
|
|
29 | (6) |
|
2.2.1 Cyclic, periodic and pipeline scheduling problems |
|
|
29 | (3) |
|
2.2.2 Modulo instruction scheduling problems and techniques |
|
|
32 | (3) |
|
2.3 Instruction scheduling and register allocation |
|
|
35 | (4) |
|
2.3.1 Register instruction scheduling problem solving approaches |
|
|
35 | (4) |
|
Chapter 3 Applications of Machine Scheduling to Instruction Scheduling |
|
|
39 | (12) |
|
3.1 Advances in machine scheduling |
|
|
39 | (4) |
|
3.1.1 Parallel machine scheduling problems |
|
|
39 | (2) |
|
3.1.2 Parallel machine scheduling extensions and relaxations |
|
|
41 | (2) |
|
3.2 List scheduling algorithms |
|
|
43 | (4) |
|
3.2.1 List scheduling algorithms and list scheduling priorities |
|
|
43 | (2) |
|
3.2.2 The scheduling algorithm of Leung, Palem and Pnueli |
|
|
45 | (2) |
|
3.3 Time-indexed scheduling problem formulations |
|
|
47 | (4) |
|
3.3.1 The non-preemptive time-indexed RCPSP formulation |
|
|
47 | (1) |
|
3.3.2 Time-indexed formulation for the modulo RPISP |
|
|
48 | (3) |
|
Chapter 4 Instruction Scheduling Before Register Allocation |
|
|
51 | (26) |
|
4.1 Instruction scheduling for an ILP processor: case of a VLIW architecture |
|
|
51 | (16) |
|
4.1.1 Minimum cumulative register lifetime modulo scheduling |
|
|
51 | (3) |
|
4.1.2 Resource modeling in instruction scheduling problems |
|
|
54 | (2) |
|
4.1.3 The modulo insertion scheduling theorems |
|
|
56 | (2) |
|
4.1.4 Insertion scheduling in a backend compiler |
|
|
58 | (2) |
|
4.1.5 Example of an industrial production compiler from STMicroelectronics |
|
|
60 | (4) |
|
4.1.6 Time-indexed formulation of the modulo RCISP |
|
|
64 | (3) |
|
4.2 Large neighborhood search for the resource-constrained modulo scheduling problem |
|
|
67 | (1) |
|
4.3 Resource-constrained modulo scheduling problem |
|
|
68 | (3) |
|
4.3.1 Resource-constrained cyclic scheduling problems |
|
|
68 | (1) |
|
4.3.2 Resource-constrained modulo scheduling problem statement |
|
|
69 | (1) |
|
4.3.3 Solving resource-constrained modulo scheduling problems |
|
|
70 | (1) |
|
4.4 Time-indexed integer programming formulations |
|
|
71 | (3) |
|
4.4.1 The non-preemptive time-indexed RCPSP formulation |
|
|
71 | (1) |
|
4.4.2 The classic modulo scheduling integer programming formulation |
|
|
72 | (1) |
|
4.4.3 A new time-indexed formulation for modulo scheduling |
|
|
73 | (1) |
|
4.5 Large neighborhood search heuristic |
|
|
74 | (2) |
|
4.5.1 Variables and constraints in time-indexed formulations |
|
|
74 | (1) |
|
4.5.2 A large neighborhood search heuristic for modulo scheduling |
|
|
74 | (1) |
|
4.5.3 Experimental results with a production compiler |
|
|
75 | (1) |
|
4.6 Summary and conclusions |
|
|
76 | (1) |
|
Chapter 5 Instruction Scheduling After Register Allocation |
|
|
77 | (14) |
|
|
77 | (2) |
|
5.2 Local instruction scheduling |
|
|
79 | (5) |
|
5.2.1 Acyclic instruction scheduling |
|
|
79 | (1) |
|
5.2.2 Scoreboard Scheduling principles |
|
|
80 | (2) |
|
5.2.3 Scoreboard Scheduling implementation |
|
|
82 | (2) |
|
5.3 Global instruction scheduling |
|
|
84 | (3) |
|
5.3.1 Postpass inter-region scheduling |
|
|
84 | (2) |
|
5.3.2 Inter-block Scoreboard Scheduling |
|
|
86 | (1) |
|
5.3.3 Characterization of fixed points |
|
|
87 | (1) |
|
|
87 | (2) |
|
|
89 | (2) |
|
Chapter 6 Dealing in Practice with Memory Hierarchy Effects and Instruction Level Parallelism |
|
|
91 | (28) |
|
6.1 The problem of hardware memory disambiguation at runtime |
|
|
92 | (12) |
|
|
92 | (1) |
|
|
93 | (1) |
|
6.1.3 Experimental environment |
|
|
94 | (1) |
|
6.1.4 Experimentation methodology |
|
|
95 | (1) |
|
6.1.5 Precise experimental study of memory hierarchy performance |
|
|
95 | (5) |
|
6.1.6 The effectiveness of load/store vectorization |
|
|
100 | (3) |
|
6.1.7 Conclusion on hardware memory disambiguation mechanisms |
|
|
103 | (1) |
|
6.2 Data preloading and prefetching |
|
|
104 | (15) |
|
|
104 | (1) |
|
|
105 | (2) |
|
6.2.3 Problems of optimizing cache effects at the instruction level |
|
|
107 | (2) |
|
6.2.4 Target processor description |
|
|
109 | (1) |
|
6.2.5 Our method of instruction-level code optimization |
|
|
110 | (6) |
|
6.2.6 Experimental results |
|
|
116 | (1) |
|
6.2.7 Conclusion on prefetching and preloading at instruction level |
|
|
117 | (2) |
|
Part 3 Register Optimization |
|
|
119 | (112) |
|
Chapter 7 The Register Need of a Fixed Instruction Schedule |
|
|
121 | (20) |
|
7.1 Data dependence graph and processor model for register optimization |
|
|
122 | (1) |
|
7.1.1 NUAL and UAL semantics |
|
|
122 | (1) |
|
7.2 The acyclic register need |
|
|
123 | (2) |
|
7.3 The periodic register need |
|
|
125 | (4) |
|
7.3.1 Software pipelining, periodic scheduling and cyclic scheduling |
|
|
125 | (2) |
|
7.3.2 The circular lifetime intervals |
|
|
127 | (2) |
|
7.4 Computing the periodic register need |
|
|
129 | (3) |
|
7.5 Some theoretical results on the periodic register need |
|
|
132 | (7) |
|
7.5.1 Minimal periodic register need versus initiation interval |
|
|
133 | (1) |
|
7.5.2 Computing the periodic register sufficiency |
|
|
133 | (1) |
|
7.5.3 Stage scheduling under register constraints |
|
|
134 | (5) |
|
7.6 Conclusion on the register requirement |
|
|
139 | (2) |
|
Chapter 8 The Register Saturation |
|
|
141 | (18) |
|
8.1 Motivations on the register saturation concept |
|
|
141 | (3) |
|
8.2 Computing the acyclic register saturation |
|
|
144 | (9) |
|
8.2.1 Characterizing the register saturation |
|
|
146 | (3) |
|
8.2.2 Efficient algorithmic heuristic for register saturation computation |
|
|
149 | (2) |
|
8.2.3 Experimental efficiency of Greedy-k |
|
|
151 | (2) |
|
8.3 Computing the periodic register saturation |
|
|
153 | (4) |
|
8.3.1 Basic integer linear variables |
|
|
154 | (1) |
|
8.3.2 Integer linear constraints |
|
|
154 | (2) |
|
8.3.3 Linear objective function |
|
|
156 | (1) |
|
8.4 Conclusion on the register saturation |
|
|
157 | (2) |
|
Chapter 9 Spill Code Reduction |
|
|
159 | (18) |
|
9.1 Introduction to register constraints in software pipelining |
|
|
159 | (1) |
|
9.2 Related work in periodic register allocation |
|
|
160 | (2) |
|
9.3 SIRA: schedule independant register allocation |
|
|
162 | (7) |
|
|
162 | (2) |
|
9.3.2 DDG associated with reuse graph |
|
|
164 | (2) |
|
9.3.3 Exact SIRA with integer linear programming |
|
|
166 | (2) |
|
9.3.4 SIRA-with fixed reuse edges |
|
|
168 | (1) |
|
9.4 SIRALINA: an efficient polynomial heuristic for SIRA |
|
|
169 | (4) |
|
9.4.1 Integer variables for the linear problem |
|
|
170 | (1) |
|
9.4.2 Step 1: the scheduling problem |
|
|
170 | (2) |
|
9.4.3 Step 2: the linear assignment problem |
|
|
172 | (1) |
|
9.5 Experimental results with SIRA |
|
|
173 | (2) |
|
9.6 Conclusion on spill code reduction |
|
|
175 | (2) |
|
Chapter 10 Exploiting the Register Access Delays Before Instruction Scheduling |
|
|
177 | (14) |
|
|
177 | (2) |
|
10.2 Problem description of DDG circuits with non-positive distances |
|
|
179 | (1) |
|
10.3 Necessary and sufficient condition to avoid non-positive circuits |
|
|
180 | (2) |
|
10.4 Application to the SIRA framework |
|
|
182 | (5) |
|
10.4.1 Recall on SIRALINA heuristic |
|
|
183 | (1) |
|
10.4.2 Step 1: the scheduling problem for a fixed II |
|
|
183 | (1) |
|
10.4.3 Step 2: the linear assignment problem |
|
|
184 | (1) |
|
10.4.4 Eliminating non-positive circuits in SIRALINA |
|
|
184 | (2) |
|
10.4.5 Updating reuse distances |
|
|
186 | (1) |
|
10.5 Experimental results on eliminating non-positive circuits |
|
|
187 | (1) |
|
10.6 Conclusion on non-positive circuit elimination |
|
|
188 | (3) |
|
Chapter 11 Loop Unrolling Degree Minimization for Periodic Register Allocation |
|
|
191 | (40) |
|
|
191 | (4) |
|
|
195 | (9) |
|
11.2.1 Loop unrolling after SWP with modulo variable expansion |
|
|
196 | (1) |
|
11.2.2 Meeting graphs (MG) |
|
|
197 | (3) |
|
11.2.3 SIRA, reuse graphs and loop unrolling |
|
|
200 | (4) |
|
11.3 Problem description of unroll factor minimization for unscheduled loops |
|
|
204 | (1) |
|
11.4 Algorithmic solution for unroll factor minimization: single register type |
|
|
205 | (8) |
|
11.4.1 Fixed loop unrolling problem |
|
|
206 | (1) |
|
11.4.2 Solution for the fixed loop unrolling problem |
|
|
207 | (2) |
|
11.4.3 Solution for LCM-MIN problem |
|
|
209 | (4) |
|
11.5 Unroll factor minimization in the presence of multiple register types |
|
|
213 | (8) |
|
11.5.1 Search space for minimal kernel loop unrolling |
|
|
217 | (1) |
|
11.5.2 Generalization of the fixed loop unrolling problem in the presence of multiple register types |
|
|
218 | (1) |
|
11.5.3 Algorithmic solution for the loop unrolling minimization (LUM, problem 11.1) |
|
|
219 | (2) |
|
11.6 Unroll factor reduction for already scheduled loops |
|
|
221 | (3) |
|
11.6.1 Improving algorithm 11.4 (LCM-MIN) for the meeting graph framework |
|
|
224 | (1) |
|
11.7 Experimental results |
|
|
224 | (2) |
|
|
226 | (2) |
|
11.8.1 Rotating register files |
|
|
226 | (1) |
|
11.8.2 Inserting move operations |
|
|
227 | (1) |
|
11.8.3 Loop unrolling after software pipelining |
|
|
228 | (1) |
|
11.8.4 Code generation for multidimensional loops |
|
|
228 | (1) |
|
11.9 Conclusion on loop unroll degree minimization |
|
|
228 | (3) |
|
Part 4 Epilog: Performance, Open Problems |
|
|
231 | (26) |
|
Chapter 12 Statistical Performance Analysis: The Speedup-Test Protocol |
|
|
233 | (24) |
|
12.1 Code performance variation |
|
|
233 | (3) |
|
12.2 Background and notations |
|
|
236 | (3) |
|
12.3 Analyzing the statistical significance of the observed speedups |
|
|
239 | (5) |
|
12.3.1 The speedup of the observed average execution time |
|
|
239 | (2) |
|
12.3.2 The speedup of the observed median execution time, as well as individual runs |
|
|
241 | (3) |
|
12.4 The Speedup-Test software |
|
|
244 | (2) |
|
12.5 Evaluating the proportion of accelerated benchmarks by a confidence interval |
|
|
246 | (2) |
|
12.6 Experiments and applications |
|
|
248 | (5) |
|
12.6.1 Comparing the performances of compiler optimization levels |
|
|
249 | (1) |
|
12.6.2 Testing the performances of parallel executions of OpenMP applications |
|
|
250 | (1) |
|
12.6.3 Comparing the efficiency of two compilers |
|
|
251 | (2) |
|
12.6.4 The impact of the Speedup-Test protocol over some observed speedups |
|
|
253 | (1) |
|
|
253 | (2) |
|
12.7.1 Observing execution times variability |
|
|
253 | (1) |
|
12.7.2 Program performance evaluation in presence of variability |
|
|
254 | (1) |
|
12.8 Discussion and conclusion on the Speedup-Test protocol |
|
|
255 | (2) |
Conclusion |
|
257 | (6) |
Appendix 1 Presentation of the Benchmarks used in our Experiments |
|
263 | (8) |
Appendix 2 Register Saturation Computation on Stand-alone DDG |
|
271 | (8) |
Appendix 3 Efficiency of SIRA on the Benchmarks |
|
279 | (14) |
Appendix 4 Efficiency of Non-Positive Circuit Elimination in the SIRA Framework |
|
293 | (10) |
Appendix 5 Loop Unroll Degree Minimization: Experimental Results |
|
303 | (12) |
Appendix 6 Experimental Efficiency of Software Data Preloading and Prefetching for Embedded VLIW |
|
315 | (4) |
Appendix 7 Appendix of the Speedup-Test Protocol |
|
319 | (2) |
Bibliography |
|
321 | (24) |
Lists of Figures, Tables and Algorithms |
|
345 | (8) |
Index |
|
353 | |