Sun UltraSPARC T2 (Niagra) Processor Summary
COMP3320 - March 2010

The following table briefly summarises the Sun UltraSPARC T2 processor which will be used for Lab 2 (Hardware Architecture and Code Performance). Note that our T2 system (wallaman.anu.edu.au) can be accessed via SSH.

The T2 is a processor consisting of 8-cores, with each core supporting 8 hardware threads. Released in 2007, it is a rather unique example of a Post-RISC multi-core architecture, more aimed towards server/cryptographic applications than floating point performance. Unlike contemporary Intel processors which emphasise on high clock speeds, branch prediction, deep pipelines, out-of-order/speculative execution, the T2 adopts a highly threaded approach using relatively simple set of processing cores. While the performance of each core may not be spectacular, the performance of the memory subsystem is heavily emphasised upon in its design, with features such as high bandwidth cross bar switching between L2 Cache and Core, and on-chip memory controllers and network cards. As such, the T2 has broken several single/dual CPU SPEC benchmark records, including SPECweb2005 (link), SPECint_rate2006, and SPECfp_rate2006 (link).


Diagram from David Kanter's Article

The T2 can also support virtualization. That is, a single T2 system can be partitioned into 64 virtualized domains, potentially with a different operating system running on each domain. For Lab2, however, we will only be concerned with the memory subsystem, specifically the L1 and L2 cache performance. Since we are not using threads, and aside from the possibility of contention due to twenty students running their programs on the T2 at once, the performance results obtained in Lab 2 should be reasonably stable.

Architecture

SPARC v9

Technology

- 65 nm
- 503 million transistors

Year

2007

Basic Descriptions

- Post-RISC
- Aggressively Multi-Core, Each Core Relatively Simple
- Highly Threaded
- Power Efficient
- System-on-a-Chip (eg. PCI-X Unit, SMP, Network Card on Chip)
- On-chip, Shared Memory, Uniform Memory Access (UMA)
- Designed for use in Low-End to Mid-Range Servers

Execution Cores

Cores

8 x [1.2-1.6] GHz SPARC 64-bit Cores
Thread level parallelism

Per Core:
- 8 x Hardware Threads
- 8 x Register Windows per Thread (Total of 32 x Glob, 64 x Loc, 64 x Param)
- 1 x FPU, 1 x LSU, 2 x ALU, 1 x SPU (For Cryptography)

Pipelines (Per Core)

- Relatively short
- In-order execution
- Very basic branch prediction
- 2x Pipelines per core (ie. Can issue 2 Instr per cycle)
- Requires scheduling of threads with some speculation
- Division and Sqrt not pipelined

Pipeline Length
- 8 x stages for Integer Operations
- 12 x stages for Floating Point Operations (6 cycle stall if followed by dependant instr)
- 5 x cycle branch misprediction penalty

On-Chip Memory Subsystem

L1 Instruction Cache (Per Core)

- 16KB, 32B Cache Line, 8-way Set Associative
- with Prefetching

L1 Data Cache (Per Core)

- 8KB, 16B Cache Line, 4-way Set Associative
- Write through, with store buffer
- 3 cycle load latency (stall pipeline if followed by dependant instr)

L2 Data Cache (Shared)

- 4MB, 64B Cache Line, 16-way Set Associative
- Write back
- Spread Across 8 x 512KB banks
- Cache interleaved along the 8 banks to reduce contention
- Simultaneous access to each bank.
- Each bank connected to cores by high bandwidth crossbar
- Crossbar (90GB/s write, 198GB/s read)
- 26 cycle load latency (stall pipeline if followed by dependant instr)

TLB

- 128-entry Data-TLB
- 64-entry Instr-TLB

Memory Controller

- 4 x 667 MHz dual channel FB-DIMM controllers
- Each paired with 2 x L2 Cache bank

I/O

- Support for DMA
- 2x 10-Gigabit Ethernet Ports
- PCI-X Port

FPU = Floating Point Unit
LSU = Load Store Unit
ALU = Arithmetic Logic Unit
SPU = Security Processing Unit



References:
Niagra II: The Hydra Returns
COMP8320 Lecture 3: T2