Talk: 
Abstract
The contribution of this paper is three-fold: 1) we draw together common threads in a diverse literature, 2) we identify a framework for extending high-level languages for low-level programming, and 3) we show the power of this approach through concrete case studies. Our framework leverages just three core ideas: extending semantics via intrinsic methods, extending types via unboxing and architectural-width primitives, and controlling semantics via scoped semantic regimes. We develop these ideas through the context of a rich literature and substantial practical experience. We show that they provide the power necessary to implement substantial artifacts such as a high-performance virtual machine, while preserving the software engineering benefits of the host language.
The time has come for high-level low-level programming to be taken more seriously: 1) more projects now use high-level languages for systems programming, 2) increasing architectural heterogeneity and parallelism heighten the need for abstraction, and 3) a new generation of high-level languages are under development and ripe to be influenced.
Talk: 
Abstract
Talk: 
Data (csv):
Complete results:
Abstract

Abstract
The Moxie project set out to explore the question: "How would we design a JVM from scratch knowing what we know today?' Amid the mass of design questions we faced, the tension between performance and flexibility was pervasive, persistent and problematic. In this experience paper we describe the Moxie project and its lessons, a process which began with consulting experts from industry and academia, and ended with a fully working prototype.

Abstract
We perform a thorough analysis in the context of MMTk and Jikes RVM on a wide range of benchmarks and four different architectures. Our baseline system (which includes a number of our improvements) is very competitive with highly tuned alternatives. We show a simple marking mechanism which offers modest but consistent improvements over conventional choices. Finally, we show that enqueuing the edges (pointers) of the object graph rather than the nodes (objects) significantly increases opportunities for software prefetch, despite increasing the total number of queue operations. Combining edge ordered enqueuing with software prefetching yields average performance improvements over a large suite of benchmarks of 20-30% in garbage collection time and 4-6% of total application performance in moderate heaps, across four architectures.
Abstract
Abstract
Note that this is an
extend version of our OOPSLA paper, which includes
extensive data for SPECjvm98 in addition to the DaCapo benchmarks.
Abstract
Since benchmarks drive computer science research and industry product
development, which ones we use and how we evaluate them are key
questions for the community. Despite complex runtime tradeoffs due to
dynamic compilation and garbage collection required for Java programs,
many evaluations still use methodologies developed for C, C++, and
Fortran. SPEC, the dominant purveyor of benchmarks, compounded this
problem by institutionalizing these methodologies for their Java
benchmark suite. This paper recommends benchmarking selection and
evaluation methodologies, and introduces the DaCapo benchmarks, a set
of open source, client-side Java benchmarks. We demonstrate that the
complex interactions of (1) architecture, (2) compiler, (3) virtual
machine, (4) memory management, and (5) application require more
extensive evaluation than C, C++, and Fortran which stress (4) much
less, and do not require (3). We use and introduce new value,
time-series, and statistical metrics for static and dynamic properties
such as code complexity, code size, heap composition, and pointer
mutations. No benchmark suite is definitive, but these metrics show
that DaCapo improves over SPEC Java in a variety of ways, including
more complex code, richer object behaviors, and more demanding memory
system requirements. This paper takes a step towards improving
methodologies for choosing and evaluating benchmarks to foster
innovation in system design and implementation for Java and other
managed languages.
Abstract
Good design points include a parallel L0 (PL0) which improves Java programs by 3% on average, and C by 2% in cycle-accurate simulation over a two-cycle 32KB, 128B line, 2-way L1. A transient qualifying cache (TQ) improves further by a) minimizing pollution in the parent by filtering short-lived lines without temporal reuse, and b) using a write no-fetch policy with per-byte valid bits to eliminate wasted fetch bandwidth. TQs at L1 and L2 improve Java programs by 5% on average and up to 15%. The TQ even achieves improvements when the parent has half the capacity or associativity compared to the original larger L1. The one-cycle access time, a write no-fetch policy, and filtering bestow these benefits. Java motivates this approach, but it also improves for C programs.
Abstract

Abstract

Abstract
Abstract
Abstract

Abstract
Abstract
Abstract
Experiments on SPEC JVM Benchmarks reveal key algorithmic features and how they match program characteristics to explain the direct cost of GC as a function of heap size, and the indirect impact of GC on application performance. Results include the finding that the choice of GC algorithms can improve mutator locality, disputing the myth that "no GC is good GC." We show how the trade-offs in space utilization versus allocation and tracing costs in copying and mark-sweep collectors motivates a copying nursery for newly allocated objects, even without high nursery mortality. We find that object locality and pointer mutations demographics in the mature space are much less sensitive to GC algorithm, but that copying and mark-sweep for the mature space occasionally excel in different programs. This study is unique in its breadth of GC algorithms and its depth of analysis.
Abstract
Abstract
Abstract
The increasing reliance on garbage collected languages such as Java requires that the collector perform well. We show that the generality of Beltway enables us to design and implement new collectors that are robust to variations in heap size and improve total execution time over the best generational copying collectors of which we are aware by up to 40%, and on average by 5 to 10%, for small to moderate heap sizes. New garbage collection algorithms are rare, and yet we define not just one, but a new family of collectors that subsumes previous work. This generality enables us to explore a larger design space and build better collectors.
Abstract
We also study the compilation cost of write-barrier code placement. We find that partial inlining reduces the compilation cost by 20 to 25% compared to full inlining. In the context of just-in-time compilation, the application is exposed to compiler activity. Regardless of the level of compiler activity, partial inlining consistently gives a total running time performance advantage over full inlining on the SPEC JVM98 benchmarks. When the compiler optimizes all application methods on demand and compiler load is highest, partial inlining improves total performance on average by 10.2%, and up to 18.5%.
Abstract
Abstract
We extend the state of the art for simulating GC algorithms in two ways. First, we present a systematic methodology and results on the effects of trace granularity for a variety of copying GC algorithms. We show that trace granularity often distorts GC performance results compared with perfect traces, and that some GC algorithms are more sensitive to this effect than others. Second, we introduce and measure the performance of a new precise algorithm for generating GC traces which is over 800 times faster than the brute force method. Our algorithm, called Merlin, frequently timestamps objects and later uses the timestamps of dead objects to reconstruct precisely when they died. It performs only periodic garbage collections and achieves high accuracy at low cost, eliminating any reason to use granulated traces.
Abstract

Abstract
Abstract
Abstract
In this paper we make two key contributions. First, we present a generic approach to the design of transactional collectors that promotes clarity, generality, and understandability, and then using this approach, we present a new transactional garbage collection algorithm, TMOS. Our design approach brings together three independent components---a mutator, a transactional store, and a GC algorithm. TMOS represents the application of the Mature Object Space family of GC algorithms to the transactional context through our approach to transactional GC design.
Abstract
Abstract
Abstract

Abstract
The Australian Bureau of Statistics' Business Register (BR) is an OODB application that forms a register of all businesses in Australia. The register is fundamental to much of the ABS' economic survey activity and is used by many branches of the large organization. The basis of the register is an object model that reflects the business structure of all Australian businesses, from large multinationals through to corner stores. There are over 100,000,000 objects in the register and it is constantly updated, both through operator-driven data entry and the batch processing of data from other government agencies. Specific requirements are acceptable response time for interactive access for more that 100 users, long and short transaction support, efficient storage and online access to historic information, flexible object version views and schema evolution support.
The challenges raised by the BR are dominated by the problems of complexity and scale. Our response to these challenges has been guided by our view that {\em abstraction} has a basic role in addressing complexity and scale, and by our use of Java as an implementation context.
In this paper we report not only our approach, but also novel technological outcomes which stem from our exposure to the challenges highlighted by the ABS-BR. These outcomes include: portable orthogonally persistent Java, a system for object versioning, a proposal for schema versioning, and a system architecture for scalable object servers based on a separation of programming language and storage system concerns. These outcomes are examined in the context of an ABS-BR technology demonstrator---a small-scale version of the ABS-BR built to test and showcase new technologies for high performance object servers.
Abstract
In this paper we present a framework for semantic extensions in Java. This framework consists of a number of simple but powerful transformations that, among other things, allow us to semantically extend Java to provide orthogonal persistence.
The use of semi-dynamic program transformations lends our orthogonally persistent Java a number of important qualities, including simplicity, portability and a clean model of persistence. Significantly, our implementations are efficient and can outperform in some cases PJama, a well-known orthogonally persistent Java, which is based on a modified virtual machine.
In addition to describing the application of these transformations to orthogonally persistent Java, we foreshadow their use in a number of other contexts, including dynamic instance versioning and instrumentation.
Abstract
Abstract
The demands of orthogonality appear to place orthogonal persistence at odds with more general approaches to concurrency control. Given its stated objective of providing a `complete' computational environment, the future of the orthogonal persistence vision in general, and persistent Java in particular, will depend, to some extent, on the release of this tension through the realization of new approaches to the integration of concurrency and persistence.
This paper addresses both strict transactional and more `open' approaches to managing concurrency in face of persistence, and examines difficulties with each approach. A simple transaction model that relieves some of these difficulties is presented and its limitations briefly examined.
Abstract
Abstract
Abstract
A convergence of computing and communications technologies has lead to the `information explosion'. Demand for information has grown at an unprecedented pace, placing pressure on the scalability of information servers. At the same time there has been a revolution in switching technology which has seen the once exotic tightly coupled distributed memory computer become widely available through commodity components. These two developments in computing and communications technologies come together, the former bringing with it the challenge of constructing systems capable of scaling to address the spiraling demand for information, and the latter offering a scalable hardware architecture upon which such systems can be based.
The most elegant approach to persistent data management lies in the use of orthogonal persistence. Orthogonally persistent programming languages bring together the disparate concerns of programming languages and databases by making the persistence of data a property orthogonal to all others. However, while users are presented with a unified view of the concerns, implementers of orthogonally persistent programming languages must confront the disparity head-on.
The quest to design efficient, scalable, orthogonally persistent systems represents a bringing together of the challenge of constructing systems capable of scaling with the challenge of constructing orthogonally persistent systems. This raises the deeper question as to whether there exists a generalized framework for scalable persistent system design. The framework will bring together the fundamental concerns of concurrency, replication, coherency, latency, and stability.
The major result in this thesis is the description of such a framework. The fundamental concerns are met through a reference architecture based on caching, atomicity, and a layered software architecture.
The framework has been realized in the form of the Persistent Store Interface (PSI). The effectiveness of PSI for scalable persistentsystem construction is demonstrated through a number of experiments with PSI prototypes, both stand-alone and distributed. In addition,two supporting experiments are described, one examining the issues ofdesigning scalable stores that present users with a single store imageand the other examining mechanisms for scalable coherency and recovery.
The development of a generalized
and practical framework for scalable persistent
system design in this thesis gives cause for optimism thatorthogonal persistence
will play an important role in the future of scalable
information management.
Abstract
The paper argues for a store layer which respects the need for caching and replication, and to do so at an "object" level granularity of memory use. These facits are interrelated through atomic processes, leading to an interface for the store which is strongly transactional in character. The paper describes the experimental performance of such a layer on a scalable multi-computer architecture. The behaviour of the store supports the view that a scalable cached "transactional" store architecture is a practical objective for high performance based on parallel computation across distributed memories.
Abstract
We identify the challenges associated with the construction of such a storage layer and report on progress in each of these areas. Detailed results of recent experiments on scalable transactional cache coherency are presented.
Abstract
In this paper we present PSI---a practical storage abstraction that separates database and programming language concerns and facilitates the adoption of mainstream transactional storage technology within orthogonally persistent systems. We argue for PSI as the basis for persistent Java system construction with particular reference to how it might be applied to PJama0 [Atkinson96].
Abstract
The algorithms have been successfully implemented and tested on a 128 node Fujitsu AP1000 distributed memory multicomputer. The paper presents performance results which indicate good performance and scalability for these algorithms under a range of situations. The work is seen as a step in the continuing development of high performance multicomputer object stores.
Abstract