Library Open Repository

Superscalar performance in a multi threaded microprocessor

Downloads

Downloads per month over past year

Gunther, BK (1993) Superscalar performance in a multi threaded microprocessor. PhD thesis, University of Tasmania.

[img]
Preview
PDF (Front matter)
gunther_front_m...pdf | Download (182kB)
Available under University of Tasmania Standard License.

[img]
Preview
PDF (Whole thesis)
gunther_thesis...pdf | Download (10MB)
Available under University of Tasmania Standard License.

Abstract

Multithreaded processors, having hardware support for the concurrent
execution of fine-grained threaded computations, are noted for their latency
tolerance and low-cost synchronization. Multithreading is a technique for
improving the utilization of processing elements (PEs) in parallel processing
systems, thereby reducing cost/performance ratios. With increasing integrated
circuit densities it is becoming feasible to integrate several PEs
onto a single die, and further diminish the physical dimensions of parallel
systems. However, by eliminating the artificial on-chip PE boundaries and
sharing expensive resources in a more tightly coupled multithreaded architecture,
even greater performance can be achieved from similar hardware.
A multithreaded processor architecture (Concurro) was designed for possible
microprocessor implementation with the objective of multiple instruction
issues per cycle-sustained superscalar performance-by means of multithreading.
This thesis considers the trade-offs necessary for such architectures
to achieve high throughput and hardware utilization under
scalability and cost constraints. A detailed simulation study was carried
out to characterize the architecture and evaluate the impact of implementation
decisions. The key to efficiency in Concurro is asynchronous, zero-time
context switching among a limited set of contexts, promoting effective use
of the storage hierarchy. A 64-bit, register-based, load/store instruction set
architecture is augmented with thread manipulation primitives and !structure
synchronization operations. Novel cache architectures and controller
algorithms were designed for enhancing latency tolerance in the
processor, while maximizing utilization of the most costly resources.
When tested on a variety of numerical and integer workloads, Concurro
was able to sustain superscalar instruction issue rates for multithreaded
operation, yet showed scalar RISC performance on single-thread code. Even
with a simple threading strategy it was frequently possible to extract full
utilization from functional units or the instruction cache. The architecture
showed size scalability to an order of magnitude while remaining binary compatible across these configurations. Performance of large configurations
was shown to be limited ultimately by the bandwidth available from critical
shared resources. With an appropriate memory system Concurro attained
supercomputer-level floating point throughput operating out of uncached
memory. The hardware requirements for this performance are expected to
be comparable with those ofVLIW machines with similar datapaths.

Item Type: Thesis (PhD)
Additional Information:

Copyright the author - The University is continuing to endeavour to trace the copyright owner(s) and in the meantime this item has been reproduced here in good faith. We would be pleased to hear from the copyright owner(s).

Date Deposited: 25 Jun 2012 06:21
Last Modified: 11 Mar 2016 05:55
Item Statistics: View statistics for this item

Actions (login required)

Item Control Page Item Control Page