Exploration within the Network-on-Chip Paradigm
Wolkotte, Pascal Theodoor (2009) Exploration within the Network-on-Chip Paradigm. thesis.
|Abstract:||A general purpose processor used to consist of a single processing core, which performed and controlled all tasks on the chip. Its functionality and maximum clock frequency grew steadily over the years. Due to the continuous increase of the number of transistors available on-chip and the operational clock frequency, it became impossible to reach every function within the chip in a single clock cycle. Furthermore, centralized control becomes hard with the increase in functionality. This lead to the split of the processing into a set of independent processing cores integrated into a single chip.
These multi-core architectures will rely on a well designed on-chip communication architecture. Global wires and bus-based systems need to be replaced to overcome the problem of wiring and the single point of arbitration. This is introduced as the Network-on-Chip (NoC) paradigm. Most of the communication architectures classified as a NoC are a network of routers on-chip, but the paradigm embodies a broader scope. The paradigm enables the sharing of on-chip wiring resources for multiple communication streams to reduce the total wiring required. Furthermore, it enables concurrent communication of concurrently handled data packets. The latter is in contrast to the central arbitration and single communication channel in bus-based systems.
In this thesis we explore the paradigm by implementation and characterization of multiple NoC router architectures. The scope of the communication architecture is the embedding in a heterogeneous multi-core System-on-Chip (SoC) for streaming applications. Six streaming applications, which are used in mobile devices, are analysed. Their common communication characteristics and specific bandwidth requirements are presented. One of the major constraints of these applications is the requirement of Quality of Service (QoS) for the interprocess communication.
Based on application analysis we propose a circuit switched router architecture as opposed to a more flexible packet switched router architecture. The reason for this architecture is the observation that communication patterns in the applications are static. The circuit switched network is integrated in an ARM based heterogeneous reconfigurable multi-core SoC realized in a 0.13 μm CMOS technology.
Besides this architecture, an existing packet switched router architecture, that also offers QoS, is improved and compared with the circuit switched router. Next to the exploration of those two router designs, two other packet switched routers, designed at the University of Cambridge, are included in the in-depth comparison. The four routers are placed and routed in 90 nm CMOS technology. The required buffering dominates the resource usage of all packet switched routers, which is significantly reduced in a circuit switched architecture. However, the latter pays a penalty by a larger required crossbar and reduced flexibility.
The four routers are also compared for their latency performance and energy consumption. For latency the packet switched networks are simulated with popular synthetic traffic scenarios. The circuit switched router has a deterministic latency, due to the congestion free routes.
The latency analysis shows the higher network utilization for NoCs using virtual channel flow control over wormhole flow control. Furthermore, the allocation mechanisms used in the improved packet switched router, cause a higher latency for randomly distributed packets compared to the router with speculation logic that is tailored for this type of traffic. Despite its higher latency for random traffic, the packet switched network is able to give end-to-end latency guarantees for specific connections, due to deterministic arbitration, as is shown in this thesis.
For the power analysis we compared the four routers using various traffic scenarios. One of the first observations is a high power consumption in idle mode, where no data is transported. The clock-tree and the connected synchronous elements consume the majority of the power. A minor part is the static power, which is directly related to the router's required chip area. Automatic insertion of fine-grain clock gating tremendously reduces this idle dynamic power consumption. With clock-gating, both the static and dynamic component have an equal share in the idle power at a clock frequency of 200 MHz.
The increase in dynamic power consumption is directly related to the number of packets that are transported over the network and the amount of bit flips, i.e. activity, in the payload. Transportation of random payload, i.e. 25% activity, requires almost a factor three more in comparison with a payload of constant values, i.e. all bits inactive. Random activity is observed in the analysed streaming applications for most of the intermediate data. The buffer size has no influence on the packet's dynamic energy consumption, due to the fine-grain clock gating, which makes the packet switched routers as energy efficient as the circuit switched router. Most of the difference in energy consumption between the routers, is caused by the different crossbar dimensions and the extra bits in a packet which are required for routing and allocation. The larger crossbar is required for the circuit switched router to add flexibility, and for the improved packet switched router to enable QoS. A marginal increase in energy consumption is caused by the network congestion.
During the design of the heterogeneous SoC architecture as well as the evaluation of the packet switched routers, we were hampered by the prohibitive simulation times of the architecture's bit and cycle accurate models. Motivated by simulation speed-ups of an FPGA in a Hardware-in-the-Loop (HIL) simulation, we developed a framework to simulate large many-core architectures on a single FPGA. Instead of the instantiation of the whole architecture in parallel in the FPGA, the individual cores are evaluated sequentially. Each core is modified such that the core's internal state and combinational functionality are separated.
As all cores in a homogeneous many-core architecture are identical, we can construct a single hyper core, that embodies all combinational functionality of a single core. The state of the whole architecture, stored in the FPGA's memory blocks, is updated sequentially by offering a core's old state to the hyper core and store its new state. Using the sequential simulation approach in an FPGA, we are able to simulate two to three orders of magnitude faster compared to cycle and bit-accurate simulations in software.
Electrical Engineering, Mathematics and Computer Science (EEMCS)
|Link to this item:||http://purl.utwente.nl/publications/60363|
|Export this item as:||BibTeX|
Daily downloads in the past month
Monthly downloads in the past 12 months
Repository Staff Only: item control page
Metis ID: 263714