October 17, 2005
Sheraton Parkway Toronto North Hotel and Convention Centre,
Richmond Hill, Ontario, Canada
Associated with CASCON 2005
(http://www.cas.ibm.com/cascon)
Final Program
Mini Section 1 - Chair: José Nelson Amaral - University of Alberta
Mini Section 2 - Chair: Chen Ding - University of Rochester
Mini Section 3 - Chair: Arie Tal - IBM Toronto Lab
Mini Section 4 - Chair: Calin Cascaval - IBM T. J. Watson Research Center
A Probabilistic Pointer Analysis for Speculative Optimizations
Jeff Da Silva and Gregg Stephan - EECG - University of Toronto
Presentation Slides: [PPT] [PDF]
Pointer analysis is a critical compiler analysis used to disambiguate
the indirect memory references that result from the use of pointers
and pointer-based data structures in C and other similar programming
languages. Since memory references often remain ambiguous--even after
applying an aggressive pointer analysis--optimizations called "data
speculative optimizations" have recently been proposed which can be
applied despite ambiguous memory references. Currently, all data
speculative optimizations require extensive profile information to
drive the compiler's decision of when to speculate. Hence, we are
motivated to take a fresh look at pointer analysis with speculative
optimizations in mind. A conventional pointer analysis deduces for
every pair of pointers, at any program point, whether a points-to
relation between them (i) definitely exists, (ii) definitely does not
exist, or (iii) maybe exists. Typically, a large majority of points-to
relations are categorized as maybe, especially if a
fast-but-inaccurate approach is used. Unfortunately, most analyses
must treat the maybe case the same as the definitely case to ensure
correctness. However, speculative optimizations can capitalize on the
maybe case, especially if we can quantify the likelihood that two
pointers alias.
We propose a Probabilistic Pointer Analysis (PPA) that statically
predicts the probability of each points-to relation at every program
point. We have developed an innovative PPA algorithm which is both
scalable and accurate, the key to which is the use of linear transfer
functions to collect points-to probability information. Building on
simple control flow edge profiling and other simplifying heuristics,
our analysis is one-level context, flow and field sensitive, yet can
still scale to large programs, including the entire SPEC integer
benchmark suite. Our preliminary results have shown that the
algorithm (even without the probability information) is as accurate as
the best algorithms ever developed; we are currently evaluating the
accuracy of the resulting probabilities.
Back to CDP05 Program
Architectural Cloning For PowerPC Processor
Edwin Chan - IBM Toronto Lab
Presentation Slides: [PDF]
Currently the IBM XL compiler generates generic instructions to
produce an executable that is compatible on multiple PowerPC
platforms. One distinct drawback with the generic instructions is that
they do not exploit the underlying hardware features, and therefore
the executable generated will have suboptimal performance on all
PowerPC platforms. This paper describes a novel method for the
compiler to build a single executable that is targeted for multiple
PowerPC architectures. During compilation, the compiler analyzes all
the procedures to find the profitable candidates. Then it generates
both architectural specific instructions and generic instructions for
each candidate, and also inserts additional instructions in the
program to dispatch the appropriate instructions at runtime. Thus the
resulting executable can take advantage of the hardware features
available on the latest PowerPC processors, while at the same time is
also backward compatible with older PowerPC processors. Experimental
results with SPEC CPU2000 benchmarks have demonstrated that the
executable generated by this feature delivers high performance on
different PowerPC processors, without sacrificing backward
compatibility.
Back to CDP05 Program
Does Training Input Selection Matter for Feedback-Directed Optimizations?
Paul Berube - Computing Science - University of Alberta
Presentation Slides: [PPT] [PDF]
Feedback-directed optimization (FDO) uses a training run that provides
a compiler with a profile that summarizes the run-time behavior of the
program. Most studies that use FDO techniques use either a single
input for both training and performance evaluation, or a single input
for training and a single input for evaluation. However, the run-time
behavior of a program is influenced by the data it is processing.
This exploratory study addresses an important open question: How
important is the selection of training data for FDO? Likely, the
answer to this question is not constant across all optimizations that
use profile information. How sensitive are individual compiler
transformations to the selection of training data used with FDO? Does
training on different inputs result in different optimization
decisions at compile time? Furthermore, do these different decisions
result in changes in program performance? We develop a tool called
Aestimo to quantify the differences between FDO logs for inlining and
if conversion from the Open Research Compiler (ORC) for SPEC CINT2000
benchmark programs trained on a large number of inputs. Aestimo also
compares the performance of programs trained on different inputs, and
the performance of programs compiled with and without FDO. Training on
different inputs results in as much as a 6% difference in performance
on a workload of inputs. Also, evaluating FDO performance on different
inputs can lead to substantially different performance results.
Aestimo finds differences in best case FDO performance on different
inputs for the same program larger than 13% for if conversion, and larger
than 20% for inlining. Finally, Aestimo reveals that the current
if conversion heuristics in the ORC always results in performance
degradation for the Itanium 2 processor when FDO is used.
Back to CDP05 Program
Context Threading: A Flexible and Efficient Dispatch Technique for Virtual Machine Interpreters
Marc Berndl, Benjamin Vitale, Mathew Zaleski, and Angela Demke Brown
Presenter: Mathew Zaleski - Computing Science - University of Toronto
Presentation Slides: [PDF]
Direct-threaded interpreters use indirect branches to dispatch
bytecodes, but deeply-pipelined architectures rely on branch
prediction for performance. Due to the poor correlation between the
virtual program's control flow and the hardware program counter, which
we call the "context problem", direct threading's indirect branches
are poorly predicted by the hardware, limiting performance. Our
dispatch technique, context threading, improves branch prediction and
performance by aligning hardware and virtual machine state.
Linear virtual instructions are dispatched with native calls and
returns, aligning the hardware and virtual PC. Thus, sequential
control flow is predicted by the hardware return stack. We convert
virtual branching instructions to native branches, mobilizing the
hardware's branch prediction resources. We evaluate the impact of
context threading on both branch prediction and performance using
interpreters for Java and OCaml on the Pentium and PowerPC
architectures. On the Pentium IV, our technique reduces mean
mispredicted branches by 95%. On the PowerPC, it reduces mean branch
stall cycles by 75% for OCaml and 82% for Java. Due to reduced branch
hazards, context threading reduces mean execution time by 25% for Java
and by 19% and 37% for OCaml on the P4 and PPC970, respectively. We
also combine context threading with a conservative inlining technique
and find its performance comparable to that of selective inlining.
Back to CDP05 Program
Inlining Java Native Calls At Runtime
Levon Stepanian, Angela Demke Brown, Allan Kielstra, Gita Koblents, and Kevin Stoodley
Presenter: Levon Stepanian - Computing Science - University of Toronto
Presentation Slides: [PPT] [PDF]
We introduce a strategy for inlining native functions into Java
B
applications using a JIT compiler. We perform further optimizations to
transform inlined "callbacks" into semantically equivalent lightweight
operations. We show that this strategy can substantially reduce the
overhead of performing JNI calls, while preserving the key safety and
portability properties of the JNI. Our work leverages the ability to
store statically-generated IL alongside native binaries, to facilitate
native inlining at Java callsites at JIT compilation time.
Preliminary results with our prototype implementation show speedups of
up to 93X when inlining and callback transformation are combined.
Back to CDP05 Program
Toward Deterministic Performance in JIT Compilers
Mark Stoodley - IBM Toronto
Presentation Slides: [PPT] [PDF]
Real-time systems require a higher level of predictable performance
than other types of systems. A Java execution environment designed
for real-time systems must deliver high performance without
sacrificing deterministic behaviour. While Java is a dynamic language
exploiting Just-In-Time compilers that employ a variety of highly
speculative optimizations, many of the non-deterministic aspects of
high performance Java execution can be mitigated albeit at reduced
levels of sustained performance. For example, rather than compiling
code as the program executes, ahead-of-time (AOT) compilation
strategies can be used to achieve compiled-code speed without a JIT
compiler interrupting the program. The Real Time Specification for
Java also defines several new features that programmers can use as
tools to write programs that meet real-time deadlines. In this talk,
I will describe several areas of the JIT compiler and JVM that must be
modified to provide more deterministic performance, as well as talk
about the various AOT compiler changes mandated by the RTSJ.
Back to CDP05 Program
ISPRE = Isothermal Speculative Partial Redundancy Elimination
Nigel Horspool - University of Victoria
Presentation Slides: [PDF]
Speculative partial redundancy elimination (SPRE) attempts to minimize
the expected number of computations for each expression in a program.
Unfortunately, existing algorithms which solve the SPRE problem are
computationally expensive, so much so that they could consume a
significant fraction of the total compilation time. The new approach
of ISPRE solves the problem for all expressions simultaneously, but it
had to trade optimality of the results for speed. However, as
preliminary results indicate, the solutions are not very far from
optimal in practice.
Back to CDP05 Program
A Safe Automatic Data Reorganizations
Shimin Cui - IBM Toronto
Presentation Slides: [PPT] [PDF]
Improving cache efficiency is critical to achieve high performance on
modern machines. This presentation will describe a practical safe
automatic optimization technique for data reorganization to increase
the data cache utilization of pointer based programs. The fully
automatic transformation will be discussed including (1) data
splitting and data outlining (2) memory allocation merging, and (3)
data grouping.
Back to CDP05 Program
Dynamic Shape and Data Structure Analysis in Java
Clark Verbrugge - McGill University
Presentation Slides: [PPT] [PDF]
Analysis of dynamic data structure usage is important for both program
understanding and for improving the accuracy of other program
analyses. Static shape analysis techniques, however, suffer from
reduced accuracy in complex situations. We have designed and
implemented a dynamic shape analysis system that allows one to examine
and analyze how Java programs build and modify data structures. Using
a complete execution trace from a profiled run of the program, we
build an internal representation that mirrors the evolving runtime
data structures. The resulting series of representations can then be
analyzed and visualized, and we show how to use our approach to help
understand how programs use data structures, the precise effect of
garbage collection, and to establish limits on static techniques for
shape analysis. A deep understanding of dynamic data structures is
particularly important for modern, object-oriented languages that make
extensive use of heap-based data structures.
Back to CDP05 Program
Detecting Variable-Length Phases in Utility Programs
Xipeng Shen, Chen Ding, Sandhya Dwarkadas, and Michael Scott
Presenter: Xipeng Shen - University of Rochester
Presentation Slides: [PDF]
The behavior of utility programs depends strongly on the input. A
compiler, for example, behaves differently when compiling different
B
functions. Similar input dependences can be seen in interpreters,
compression and encoding tools, databases, and document parsers.
Because their behavior is hard to predict, these programs pose a
special challenge for such dynamic adaptation mechanisms as
phase-based memory management or hardware reconfiguration.
We present a two-step technique to detect phases in utility programs.
Active profiling constructs a regular input to induce repeating
behavior. Phase-boundary analysis then examines a basic-block trace
to identify patterns that occur periodically with regular inputs and
B
consistently (but at irregular scale) with real inputs. Once phases
have been identified, we use binary rewriting to insert phase markers
that will announce phase transitions at run time, with arbitrary
input. The technique can handle unmodified binary code. Experiments
with utility programs from the Spec95 and Spec2K benchmark suites
indicate that program behavior within phases is surprisingly
predictable in many (though not all) cases. This in turn suggests
that dynamic adaptation, either in hardware or in software, may be
applicable to a wider class of programs than previously believed.
Back to CDP05 Program
A Technique for Generic Iteration and Its Optimization
Stephen M. Watt
Presentation Slides: [PDF]
Software libraries rely increasingly on iterator objects to provide
generic traversal of data structures. These iterators can be
represented either as structures that save state or as programs that
suspend and resume execution. This paper addresses two problems that
remain in the use of iterators today: The first problem is that
iterators represented as state-saving objects in languages such as C++
typically have logic that is much more complicated than the
suspend/resume iterators as pioneered by {\sc clu}. This paper
presents a program structuring technique that allows state-saving
iterators to be implemented with the same clarity as suspending
iterators. The second problem is that the usual implementations of
suspend/resume iterators is not suitable for inner loops in
high-performance applications. This paper presents a code
optimization technique that can be applied to suspend/resume iteration
to produce efficient natural loops at the machine code
level. Combined, these two results allow iterators for complex data
structures to be easily understood and implemented efficiently.
Back to CDP05 Program
Partial Copying for First and Last Privatization of Arrays
Guangsong Zhang, Eik Charlebois, and Roch Archambault - IBM Toronto
Presentation Slides: [PPT] [PDF]
Array data flow analysis is a known method for privatizing arrays
discussed in [1], [2] and [3]. The core idea is to examine array
sections to see whether they belong to a loop iteration and can
therefore be privatized. To prove that it is safe to privatize an
array, every use of an array element must have a dominating definition
in the same loop iteration. That is, every path from the top of the
loop to the use of the array element must pass the definition site of
the same array element. [1] supplies the basic algorithm for this
analysis and [2] further refines it. However, the approaches taken for
array copy-in and copy-out are inefficient. Particular, supporting
first privatization is discouraged due to this drawback. With well
defined semantics in OpenMP, we find it is easier to partition
different kind of array uses, such as copy-in and copy-out as
firstprivate and lastprivate variables. In our approach, we are more
focusing on calculating those sets in an efficient way.
References
[1] Peng Tu. Automatic array privatization and demand-driven symbolic
analysis. Technical report, 1995. Thesis, University of Illinois at
Urbana-Champaign.
[2] Junjie Gu and Zhiyuan Li. Efficient interprocedural array data-flow
analysis for automatic program parallelization. IEEE Transactions on
Software Engineering, 26(3):244?261, 2000.
[3] Manish Gupta, Sayak Mukhopadhyay, and Navin Sinha. Automatic
parallelization of recursive procedures. Technical report, 1999.
Proceedings of International Conference on Parallel Architectures and
Compilation Techniques (PACT).
[4] OpenMP Architecture Review Board. OpenMP application program interface, version 2.5, 2005. http://www.openmp.org.
Back to CDP05 Program
JIT Compilation Strategy Overview in J9 JVM
Marius Pirvu, Derek Inglis, and Vijay Sundaresan
Presentation Slides: [PPT] [PDF]
Modern Java Virtual Machines (JVMs) employ Just-In-Time (JIT)
compilers that optimize and generate native code at runtime in order
to improve performance of Java programs. Because compilation is
inherently part of the application running-time, minimizing
compilation overhead is a major concern in JIT compilation. Short
running programs and the start-up phase in large server or GUI-based
applications are examples of scenarios where compilation time is a
significant proportion of the overall execution time. To address these
problems JIT developers have a few alternatives: (1) Develop cheaper
optimization algorithms; (2) Restrict compilation activity to a subset
of methods that are deemed to be "frequently executed"; (3)
Attenuate the negative impact of compilation on the overall execution
of the program. In this paper we will elaborate on solutions for the
latter two options to reduce compilation overhead, solutions
implemented in the TR JIT compiler developed for the J9 virtual
machine at IBM.
Adaptive JIT compilers, like Sun's Hot Spot or Jikes RVM, start from
the assumption that most of an application's time is spent in a small
portion of the code. They dynamically identify the set of performance
critical routines and optimize them heavily, while the rest of the
code is either interpreted or compiled with minimal
optimizations. Both J9 and the previous JVM offered by IBM (codename
Sovereign) follow this principle. However, from the point of view of
compilation strategy, J9 is different from Sovereign in the following
respects: (1) it provides a higher optimization granularity by
offering multiple optimization levels; (2) it monitors program
execution continuously and may recompile methods at higher
optimization levels. We will describe the compilation infrastructure
implemented in the TR JIT compiler and present a comparison among
different optimization strategies regarding compilation time and
application's performance.
Some applications have a very flat execution profile with no clear
hot-spots. A case in point is WebSphere Application Server where
thousands of methods are, more or less, equally important. In these
situations the JIT compiler cannot make discriminatory decisions and
will have to either spend a significant amount of time compiling these
methods or lower the optimization level at the expense of runtime
performance. To cope with this scenario we have developed a mechanism
that selectively lowers the optimization level for some parts of the
execution. We will describe this mechanism and show how it enabled us
to significantly reduce the start-up of WebSphere (up to
33%) without degrading the performance of J2EE applications running on
%top of it.
To attenuate the negative effects of compilation we have recently
implemented an asynchronous compilation mechanism: All compilations
are carried out by a separate compilation thread that runs
asynchronously to the Java threads. This means that a Java thread will
not have to wait for the compiled version of the method, but can
instead go ahead and make progress. The major benefit of the
asynchronous compilation is that it increases the application's
parallelism when running on multiprocessors. In addition to detailing
our implementation of asynchronous compilation, we will describe other
work done to minimize the impact of a long-running compilation on
overall performance. We will also present experimental data showing
substantial improvements in start-up of WebSphere (up to
25%) and speedup of short running applications (up to 24%).
Back to CDP05 Program
Speeding Up Floating-Point Division with In-Lined Iterative Algorithms
Robert Enenkel and Allan Martin - IBM Toronto lab
Presentation Slides: [PRZ] (Lotus Notes) [PPT]
[PDF]
Hardware floating-point division in many modern pipelined processors is one
of the most expensive instructions, and may completely occupy the
floating-point unit for the duration of the instruction. Alternatively, a
software algorithm for floating-point division could be used. Although a
software algorithm would have longer latency than the hardware instruction,
if multiple independent computations can be interleaved so as to
effectively utilize the cycles during pipeline delays, the software
algorithm can result in higher throughput. This approach can fill the gap
between hardware division and vector subroutine libraries such as IBM MASS
(Mathematical Acceleration Subsystem).
This talk describes our implementation of software floating-point division
for the PowerPC family of processors in the IBM Fortran and C/C++
compilers. The algorithms are available both as user-callable built-in
functions, or automatically through the compiler. We first describe the
division algorithms used, which are based on hardware table look-up
estimates and Newton iteration, with variations for different accuracy and
exception-handling requirements. We then explain how the compiler decides
whether the software algorithm is likely to be profitable (and hence
whether to automatically invoke it), based on a measurement of the amount
of independent computation available to interleave with the software
division.
Back to CDP05 Program
Auto-SIMDization Challenges
Amy Wang - IBM Toronto Lab
Presentation Slides: [PPT] [PDF]
Automatic SIMDization is an optimization in TPO that exploits SIMD
parallelism for VMX hardware on PPC970 by vectorizing loops. This
feature was first released in XL C/C++ V7 and XL Fortran V9 for Linux
on SLES9 last year. In this talk, we will present some of the
challenges we encountered and the solutions being
implemented. Specifically, we will address the memory
protection/multi-threading issues that arose from the memory
truncation effect in the VMX load and store instructions. We will also
discuss the strengths and weaknesses of our current design by looking
into the effect of loop distribution on simdization. We will then
present our attempts to determine the profitability of SIMDization and
our novel solution of mix-mode SIMDization. We will also outline some
of the on-going tuning efforts to better integrate the simd technology
with existing optimizations. A selective set of performance data will
be shown to illustrate the above efforts. The talk will conclude with
our future plans for the technology and for future hardwares.
Back to CDP05 Program