MSP 2004, Memory Systems Performance June 8, 2004
Program

Accepted Papers

Paper Submission

Call for Papers

Important Dates


Committees




Accepted Papers

The organizers are delighted to announce that the following high quality papers have been accepted for publication at MSP 2004:

Reuse-distance-based Miss-rate Prediction on a Per Instruction Basis
Changpeng Fang, Michigan Technological University
Steve Carr, Michigan Technological University
Soner Onder, Michigan Technological University
Zhenlin Wang, Michigan Technological University

Feedback-directed optimization has become an increasingly important tool in designing and building optimizing compilers. Recently, reuse-distance analysis has shown much promise in predicting the memory behavior of programs over a wide range of input sizes. Reuse-distance analysis predicts program locality by experimentally determining locality properties as a function of the data size of a program, allowing accurate locality analysis when the program's data size changes.Prior work has established the effectiveness of reuse distance analysis in predicting whole-program locality and miss rates. In this paper, we show that reuse distance can also effectively predict locality and miss rates on a per instruction basis. Rather than predict locality by analyzing reuse distances for memory addresses alone, we relate those addresses to particular static memory operations and predict the locality of each instruction. Our experiments show that using reuse distance without cache simulation to predict miss rates of instructions is superior to using cache simulations on a single representative data set to predict miss rates on various data sizes. Our analysis allows us to specifically identify memory operations that are likely to produce a significant number of cache misses either at the L1 or L2 level for a given data size.

Programmer Specified Pointer Independence
David Koes, Carnegie Mellon University
Mihai Budiu, Carnegie Mellon University
Girish Venkataramani, Carnegie Mellon University
Seth Goldstein, Carnegie Mellon University

Good alias analysis is essential in order to achieve high performance on modern processors, yet precise interprocedural analysis does not scale well. We present a source code annotation, #pragma independent, which provides precise pointer aliasing information to the compiler, and describe a tool which highlights the most important and most likely correct locations at which a programmer should insert these annotations. Using this tool we perform a limit study on the effectiveness of pointer independence in improving program performance through improved compilation.

Improving Trace Cache Hit Rates Using the Sliding Window Fill Mechanism and Fill Select Table
Muhammad Shaaban, Rochester Institute of Technology
Edward Mulrane, Rochester Institute of Technology

As superscalar machines become increasingly wide, it is inevitable that the large set of instructions to be fetched every cycle will span multiple noncontiguous basic blocks. The mechanism to fetch, align, and pass this set of instructions down the pipeline must do so as efficiently as possible. The concept of trace cache has emerged as the most promising technique to meet this high-bandwidth, low-latency fetch requirement. A new fill unit scheme, the Sliding Window Fill Mechanism, is proposed as a method to efficiently populate the trace cache. This method exploits trace continuity and identifies probable start regions to improve trace cache hit rate. Simulation yields a 7% average hit rate increase over the Rotenberg fill mechanism. When combined with branch promotion, trace cache hit rates experienced a 19% average increase along with a 17% average improvement in fetch bandwidth.

Automatic Blocking Of QR and LU Factorizations for Locality

Qing Yi, Lawrence Livermore National Laboratory
Ken Kennedy, Rice University
Haihang You, University of Tennessee
Keith Symour, University of Tennessee
Jack Dongarra, University of Tennessee

QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blocking for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.

Metrics and Models for Reordering Transformations
Michelle Strout, Argonne National Laboratory and University of Chicago
Paul Hovland, Argonne National Laboratory

Irregular applications exhibit poor performance on current computer architectures partially due to their inefficient use of the memory hierarchy. Run-time data and iteration reordering transformations using inspector/executor strategies have been shown to improve the locality and therefore the performance of irregular benchmarks. This paper describes a model for determining which combination of run-time data and iteration reordering heuristics will result in the best executor performance for a given dataset. We propose that the data and iteration reordering transformations should be viewed as approximating minimal linear arrangments on two separate hypergraphs: a spatial locality hypergraph and a spatio-temporal locality hypergraph. Our results show that metrics on the spatial locality hypergraph can guide the selection of a data reordering heuristic and that metrics on the spatio-temporal locality hypergraph can guide selection of an iteration reordering heuristic. We also introduce new iteration reordering heuristics based on the spatio-temporal locality metric that result in better performance than previous heuristics.
 
Instruction Combining For Coalescing Memory Accesses Using Global Code Motion
Motohiro Kawahito, IBM Tokyo Research Laboratory
Hideaki Komatsu, IBM Tokyo Research Laboratory
Toshio Nakatani, IBM Tokyo Research Laboratory

Instruction combining is an optimization to replace a sequence of instructions with a more efficient instruction yielding the same result in a fewer machine cycles. When we apply it for coalescing memory accesses, we can reduce the memory traffic by combining narrow memory references with contiguous addresses into a wider one for taking advantage of a wide-bus architecture. In this paper, we propose a new algorithm for instruction combining by applying global code motion to wider regions of the given program in search of more potential candidates. We implemented two optimizations for coalescing memory accesses, one combining two 32-bit integer loads and the other combining two single-precision floating-point loads, using our algorithm in our Java JIT compiler for IA-64, and evaluated them by measuring the SPECjvm98 benchmark suite. In our experiment, we can improve the maximum performance by 5.5% with little additional compilation time overhead. For the MolDyn benchmark in the JavaGrande benchmark suite, we can improve the performance gain by 7.3%, when we replace every declaration of double of instance variable with that of float. Our approach can be applied to a variety of architectures and programming languages besides Java.

An Empirical Performance Analysis of Commodity Memories in Commodity Servers
Darren Kerbyson, Los Alamos National Laboratory
Mike Lang, Los Alamos National Laboratory
Gene Patino, Smart Modular Technologies Inc
Hossein Amidi, Smart Modular Technologies Inc

This work details a performance study of six different commodity memories in two commodity server nodes on a number of micro-benchmarks, that measure low-level performance characteristics, as well as on two applications representative of the ASCI workload. The memories vary both in terms of performance, including latency and bandwidths, and also in terms of their physical properties and manufacturer. Two server nodes were used; one Itanium-II Madison based system, and one Xeon based system All the memories examined can be used within both processing nodes. This allows the performance of the memories to be directly examined while keeping all other factors within a processing node the same (processor, motherboard, operating system etc.). The results of this study show that there can be a significant difference in application performance from the different memories – by as much as 20%. Thus, by choosing the most appropriate memory for a processing node at a minimal cost differential, significant improved performance may be achievable.