Microarchitectural Techniques of Advanced Processors
IN THIS CHAPTER, YOU WILL LEARN
- How to define performance.
- The meaning of IPC and techniques to get a higher IPC.
- Hazards that occur in a scalar pipeline and solutions for them.
- Different stages of a superscalar pipeline.
- Hazards that plague a superscalar pipeline.
- Dynamic scheduling of instructions in super scalar architectures.
- Concepts of branch prediction and register renaming.
- Pentium Pro microarchitecture.
- The VLIW concept.
Computers have been around for quite sometime now. Whenever we plan to buy a new computer, the first aspect that we take into consideration is ‘performance’. We insist that the new product should have a higher level of performance than the one we possess now. It is in this context that it is necessary to be very clear about what we mean by ‘performance’. From a computational point of view, the most important component in a computer is its processor. Therefore, let us discuss what is meant by the word ‘performance’ with respect to the ‘microprocessor’ that serves as the brain of any computer.
15.1 | Enhancing Performance – Why Pipelining?
For most of our activities, speed is an aspect of paramount importance. Performance is judged in terms of the rate at which work is done. Let us define it as follows:
Performance = work/time.
For a processor, the work involved is that of ‘executing’ instructions. Thus, ‘speed of execution’ is equivalent to performance. If we define execution time as time/program, then the aim of designing more advanced processors is to reduce the factor ‘time/program’, which becomes equivalent to enhancing ‘performance’.
There is a classical equation that relates time/program to the various parameters of program execution. This is called the ‘Iron Law’ of processor performance. It is stated as follows:
It is obvious that performance is a product of three fractions that are not really independent of one another. This implies that while trying to improve performance by reducing the factor ‘time/program’, it is important to concentrate on ‘reducing’ all three factors. This may not be always possible, because of the ‘inter-relations’ between the factors.
Let us examine this point in more depth by analyzing the effect of each of the factors that comprise the Iron law.
Trying to reduce the factor ‘time/cycle’ indicates that a high clock frequency is to be used. The maximum frequency possible to be used is dependent on the technology used for the processor. For quite sometime, technologies supporting higher and higher frequencies were devised and used, but this has reached a limit. The highest clock frequencies possible to be used have saturated to around 3 GHz as of now, and trying to go ahead with the agenda of higher clock rates is not being seen as a real performance advantage. Why so? Higher clock frequencies cause higher power dissipation. Since ‘power’ has become a very important factor, increasing the clock frequency of the processor still further may not turn out to be the best bet.
How many cycles are needed for executing a particular instruction is based on the implementation of the processor in hardware. For example, on an average, 8086 needs four cycles for a memory read cycle, while 80486 needs two, and Pentium just one. RISC architectures are aimed at getting one instruction executed every cycle. CISC-based processors need different number of cycles for different instructions. Cycles per instruction (CPI) can be improved by better ‘microarchitectural’ design, which we will see shortly.
We would like to have short programs, i.e., short code length. How can instructions/program be reduced? This factor is dependent on the Instruction Set Architecture (ISA) which is defined as the ‘processor as viewed by a compiler designer or assembly program developer’. However, compilers have different levels of efficiency. The same program may be converted to assembly language programs of different lengths. Therefore, having an efficient compiler is the first step in reducing the factor ‘instructions/program’.
A compiler designer must be able to write programs with the smallest code length, but he is constrained by the available instruction set and registers of the processor. For example, a RISC processor has only simple instructions and will need more instructions for a given program. This same program may be written with less number of instructions if the instruction set has complex instructions. Here itself a direct dependence between the factors of the Iron Law can be seen, because a complex instruction is likely to need more cycles for its execution. Thus, simply trying to use a processor with complex instructions to reduce the program length may turn out to be a losing game as far as overall performance is concerned.
From the above discussion, we can surmise that improving the performance of a processor is not a trivial thing. Processor design for improved performance is a highly challenging task involving various tradeoffs and techniques.
15.1.1 | Microarchitectural Design
The term ‘microarchitecture’ is defined as the combination of elements and interconnections required to implement an ISA. When a processor with a specific ISA has to be designed for ‘high’ performance, it is the ‘microarchitecture’ that can be modified or improved. For example, to design a new and improved x86 processor, improvements in the microarchitecture have to be sought along with the use of better technologies that support high clock frequencies. The end result of these techniques should be a processor with a decreased ‘cycles/instruction’ ratio (CPI).
Note 1 The word ‘architecture’ means ‘ISA’; that is, the processor viewed in terms of its instruction set and register set. The ISA defines the ‘hardware software interface’. For example, there is the x86 architecture (ISA) that is different from the ARM architecture (ISA). The word ‘microarchitecture’ is the ‘implementation’ of hardware to get the required ISA. However, in many cases, we may find the word architecture used in place of microarchitecture. Further, whatever confusion that may arise then, should be cleared based on the context of the usage.
CPI and IPC
CPI and IPC are two commonly used terms to refer to processor performance. IPC stands for ‘Instructions per cycle’ and is the reciprocal of CPI. Whenever new techniques for improved microarchitectures are devised, the objective is stated to be an increased value of IPC, i.e., a reduced CPI.
If we examine Intel’s x86 processors, we find an improvement in IPC from generation to generation. Intel has concentrated on developing microarchitectures with steadily increasing values of IPC. This is the case with all other processor manufacturers as well. Performance is measured on the basis of how successful this endeavor is.
We now examine the techniques for obtaining a high IPC.
Techniques for Improving IPC
In the beginning of processor evolution, the objective was to get an IPC of 1. That is, to ensure that a throughput of at least one instruction per cycle is obtained. It was realized that such a figure could be obtained by the technique of pipelining.
The next step was to devise techniques that would increase the IPC count still further, that is, to make it possible to have more than one executed instruction delivered at the output in every cycle. The idea of ‘replication’ came up then.
It was pipelining that came first, and when the gains with pipelining reached saturation, the second technique of replication was taken up. Let us make a thorough study of pipelining first and then move on to the next technique.
15.1.2 | Pipelining
The basic idea behind pipelining must be familiar to you, but here we will take a deep and more detailed look at the whole idea of pipelining and its implementation.
A processor pipeline is just like an assembly line in a factory where each worker does just one piece of job repeatedly. Each worker is ‘specialized’ only in one aspect of the total work cycle. For the assembly line to work efficiently, each worker must be able to do his job without being affected by the needs, requirements and limitations of other workers in the assembly line. In short, each worker in the line should be able to perform his task independently.
Now, let us see a typical processor pipeline. Pipelines may be of any number of stages and the functional division into stages can also vary. For our reference, we will use a pipeline in which the activities corresponding to the complete sequence can be broken down into the five subtasks, i.e., a five-stage pipeline. It is easier to understand pipelining by using the instance of a RISC processor that has simple instructions; they can be any of the following instructions: ALU instructions, load /store instructions or branch instructions. We will consider such a pipeline.
Further, for RISC processors, the operands for ALU instructions are available in registers and that memory is accessed for load (read from memory) and store (write into memory) instructions only. We also assume that there is an Instruction cache (I-cache) from where instructions are fetched, and a Data cache (D-cache) is the ‘memory’ that is accessed for load and store operations. Such a cache structure is typical for x86 processors starting from Pentium, which has separate on-chip caches for data and instructions.
188.8.131.52 | Stages of the Pipeline
Figure 15.1 shows the five stages of our reference pipeline, and Table 15.1 explains the operations (subtasks) performed in each stage. In the Instruction Fetch (IF) stage, the instruction is fetched from memory, i.e., the I-cache. In the ID stage, instructions are decoded and ALU instructions read their operands from registers. Then, in the Execute (EX) stage ALU operations are done. For a load/store operation, only address calculation is done in the EX stage. Memory (MEM) is the stage in which memory (D-cache) is accessed. D-cache is accessed only in the MEM stage of the pipeline, and this stage is used only by load/store instructions. The result of any operation is written to the specified destination register in the Writeback (WB) stage.
Figure 15.1 | A five-stage pipeline
Table 15.1 Functions Performed by the Stages of Reference Pipelines
Now that we have clearly defined the suboperations of an instruction, we are in a position to delve into the intricacies imposed by pipelining.
Figure 15.2 shows an unpipelined processor at work. One instruction completes all these five suboperations and produces its result. After that, the next instruction comes in, and then the next and so on. Since the processor completes the execution of one instruction in five cycles, a ‘CPI = 5’ is the performance of the unpipelined processor.
Figure 15.2 | An unpipelined processor executing instructions
Now, take a second look at the whole scenario, as shown in Fig. 15.3. Obviously, each subtask needs only one cycle. If these subtasks are overlapped in time, and if in every cycle, a new instruction enters the ‘pipeline’ it leads to a situation in which one instruction is completed in every cycle. This corresponds to a CPI of 1. Such a figure is, as you would have surmised, too good to be true, because implementing a pipeline is fraught with difficulties and problems.
Figure 15.3 | A pipelined processor executing instructions
The following points should be understood with respect to a k-stage pipeline:
- It takes k-cycles to ‘fill up‘ a pipeline.
- An instruction takes the same time to execute with or without pipelining, but there is an overall speedup with pipelining.
- Instructions enter into, and come out of the pipeline k-times faster than in the unpipelined case.
- k instructions are in operation at any time within a pipeline
- Additional delays correspond to the time for filling up and draining the pipeline.
15.1.3 | Speedup due to Pipelining
With a k-stage pipeline, we expect one instruction per cycle to be the throughput, i.e., a CPI = 1. For an unpipelined processor, the throughput is 1 instruction per k-cycles, i.e., CPI = k.
Therefore, in the ideal case,
The Speedup due to pipelining = CPI without pipelining /CPI with pipelining = k/1
This turns out to be equal to k, which is the number of stages in the pipeline.
However, these calculations are for the ideal case. In practice, a pipeline is a hardware structure with a number of registers that need to be clocked synchronously. The additional hardware needed for pipelining introduces issues like clock skew, setup and hold times for each stage and many additional delays. In effect, pipelining adds a latency to the clock. If the clock period without pipelining is ‘t’, this latency adds a factor say ‘t1’ to it. Thus, the overall clock frequency is reduced. A numerical example will make the point clear.
Consider an unpipelined processor with a clock period of 2 ns. This processor is now re-modeled with a five-stage pipeline, which adds 0.2 ns latency to the clock period. What are the old and new CPIs? Calculate the ideal and actual speedups obtained.
We assume that without pipelining, one instruction is delivered at every five cycles. For an ideal pipeline, one instruction is delivered at every cycle.
Instruction execution time without pipelining = 5 × 2 = 10 ns
Instruction execution time with pipelining = 2 ns
Speedup = Instruction execution time without pipelining/instruction execution time
= 10/2 = 5 = Number of pipeline stages
CPI = 5 (without pipelining) and 1 (with pipelining)
Latency = 2 ns
Clock period = 2 + latency = 2 + 0.2 = 2.2 ns
Instruction execution time = 2.2 ns
Speedup = instruction execution time without pipelining/instruction execution time with pipelining
= 10/2.2 = 4.45
Thus, we see that the speedup has reduced from 5 to 4.45 because of the non-ideal nature of the pipeline.
What are the difficulties, and what are the problems that are caused due to pipelining?
The difficulty is of needing ‘additional hardware’, and the problems that arise are ‘pipeline hazards’.
Let us discuss these issues one by one.
15.1.4 | Additional Hardware
The pipeline shown in Fig. 15.1 is clocked and the intermediate results from one stage have to be forwarded to the next stage in every cycle, while the data from the previous stage has to be clocked in. This cannot be done without intermediate storage between stages. Thus, we need inter-stage buffers. Figure 15.4 shows a typical pipeline ‘datapath’ with buffers placed between stages.
Figure 15.4 | Pipeline stages with inter-stage buffers
Note 2 While designing a pipeline, one mandatory requirement is that the different stages should be balanced. This means that all stages should have the same or almost the same latency, as they are to be clocked synchronously. This obviously means that the pipeline rate is determined by the latency of the slowest stage.
i) Multiported Registers
From Fig. 15.1, we can observe that five cycles after the pipeline is filled, there are five instructions simultaneously in the pipeline always. Assume a stream of ALU instructions, most of which have two source registers and one destination register. Consider the ‘register file’ that contains all the registers of the processor. For a stream of ALU instructions inhabiting the pipeline, at any time, two registers are being read (in the ID stage) and one register is being written to (in the WB stage). This entails the necessity of having a register file with at least two read ports and one write port. This amounts to needing additional hardware for multiporting the register file when pipelining is done. Multiporting becomes a serious issue when pipelining depths are increased.
ii) Runtime Resolution of Hazards
As we go into deep discussions, we will see that hazards that occur in a pipeline need to be resolved at ‘run time’. This also causes a penalty of extra hardware. We will see this in great detail soon.
15.1.5 | Pipeline Hazards
The whole idea of pipelining appears quite straightforward. Then, what can be the problems they create? Pipelining is very simple in principle, but the kind of instructions that pipelines have to process in sequence, and the inter-relations between these instructions cause issues typically designated as ‘pipeline hazards’.
There are three types of hazards identified:
- Structural Hazards
- Data Hazards
- Control Hazards
184.108.40.206 | Structural Hazards
This occurs when there are resource conflicts. For example, trying to read and write simultaneously from memory or the register file causes such a hazard.
The solution for structural hazards causes an overhead either of time or of extra hardware. What exactly does this statement mean?
If reading and writing are to be done simultaneously by two different pipeline stages from a register file that has only one read/write port, there are two possible solutions:
- Stall the pipeline so that the read and write actions are serialized. This method creates an overhead in time.
- Insert extra hardware into the register file so that there are separate read and write ports.
Since the basic aim of using pipelining is to get high speeds, the solution to all problems that besot pipelines has been to add more hardware. Thus, we have multiported register files and separate instruction and data caches. The latter is necessary because a structural hazard is possible if data and instruction caches are unified because both reading and writing into the same memory will cause a ‘structural hazard’. This will occur because in every cycle, memory is ‘read’ when instructions are fetched (in the IF stage) and memory read/write occurs in the MEM stage also. This problem has been alleviated by using separate instruction and data caches (as in the Pentium and later processors of Intel), such that the IF stage accesses only the I-cache and the MEM stage accesses only the D-cache.
220.127.116.11 | Data Hazards
These occur because there may be data ‘dependencies’ between instructions, which may be either ‘true’ dependencies or just ‘name’ dependencies. Pipelining causes a violation of these dependencies.
There are two methods for resolving the problem of data hazards:
Static resolution is the method by which compilers are assigned the duty of detecting and resolving the hazards. In the dynamic method, it is the ‘hardware’ (associated with pipelining) that solves the issues. In the second case, the problem is said to be resolved at ‘run time’ rather than at ‘compile time’, which is the case in static hazard resolution.
15.1.6 | Types of Data Hazards
Three types of data hazards are possible. Let us identify which of them are applicable for the kind of pipeline we have chosen as our reference.
To understand this, we identify J as a ‘leading instruction’ and K as a trailing instruction. This simply means in the instruction stream, J comes first, i.e., before K.
- Output Dependence (WAW –Write After Write)
This is a ‘name’ dependency. This occurs when an instruction K writes its operand (result) before J writes it. Look at the first two instructions of the following instruction stream.
J: SUB R1, R2, R3
K: ADD R1, R0, R4
L: MUL R5, R1, R2
A WAW hazard can happen ‘if’ Instruction J writes to R1 after instruction K writes to the same register. Then, the source operand R1 for ‘L’ becomes a ‘wrong’ operand.
But can this happen in our reference pipeline?
Can K perform writeback (WB) to R1 before J does it? The answer is ‘no’. Writeback occurs only in the WB stage of the pipeline and the Jth instruction will write into R1 first, and then only K can write in it. Our pipeline is ‘strictly sequential’ and such a hazard will not occur. This hazard is referred to as a ‘name’ dependency because it is due to the using of the same register named R1, that the hazard occurs. Note that there is no data flow from instruction J to K. The hazard is only a storage conflict.
- Anti-dependence (WAR – Write After Read)
This happens ‘if’ instruction K writes to R1 before instruction J reads it.
J: ADD R0, R1, R2
K: SUB R1, R3, R5
Here, we see R1 being used by both instructions – as a source in the first and as a destination in the next. The reading of R1 in instruction J will definitely occur earlier than the writing to R1 by instruction K. This is because our reference pipeline is ‘strictly sequential‘ in nature and also because reading of a register by J occurs in the ID stage, while writing into it by the next instruction K will be later and only in the WB stage. Reading of R1 by the first instruction will be accomplished much earlier than the writing into the same register by the next instruction. Therefore, we conclude that WAR hazards also cannot occur in the sequential pipeline that we use the reference pipeline. This is no as data flow from instruction J to K. That is why we call it ‘anti-dependence’. This point will be clear when we understand ‘true’ data dependence.
Note 3 Where do WAR and WAW hazards occur? They do not occur in the single line, scalar, sequential pipeline; that is the current topic of our discussion. However, there are kinds of pipelines in which the ‘strictly sequential’ nature of program flow is not adhered to. There are also pipelines that are not a ‘single’ stream of instructions in execution. WAR and WAW hazards can occur in such pipelines and will be discussed later.
- True Dependence (RAW-Read After Write)
This is a type of hazard that is always possible in a simple sequential pipeline. The hazard occurs when instruction K cannot execute because instruction J has not yet finished its writing into the same register (named as the ‘hazard register’). Let us see how exactly this happens.
Case 1: When the leading instruction is an ALU instruction.
See the following two instructions. Note that there is ‘data dependency’ between them.
J: ADD R1, R2, R3
K: SUB R4, R1, R5
K lags behind J by just one cycle. Instruction J has R1 as the destination register. After addition of R3 and R2, the sum is to have a ‘writeback’ into R1; however, this occurs only in the last stage of the pipeline. Instruction K has R1 as its source operand. When K is decoded in the ‘ID’ stage, it needs the new value of R1 computed by J. However, R1 has not yet been ‘written to‘ by J, because J is in the EX stage and has not yet reached the WB stage. What happens if K reads R1 now? It gets a stale value of R1. This is the hazard and it occurs because instruction K is dependent on instructions J for its data (source operand). This is what is meant by the term ‘data dependency’ between the two. Figure 15.5 elaborates the situation that when K is in the ID stage (and reading registers), J has reached only the EX stage. Register R1 is called the ‘hazard register’ in this instance.
Figure 15.5 | Case of a RAW hazard due to two ALU instructions
Case 2: When the leading instruction is a LOAD instruction
A ‘load’ instruction is one in which a data from a specified memory location is ‘loaded to a register’.
Consider the following two instructions:
J: MOV R1, [M]
K: ADD R3, R1.R2
In instruction J, the data from the memory address [M] is to be loaded into R1. Our pipeline does this only in the MEM stage. By that time, instruction K would have already passed the ID stage (where reading of R1 is done) and is already in the EX stage. Obviously, it has read a ‘stale’ value of R1.
In this case, the trailing instruction is fully dependent on the leading instruction to supply the correct value of data – this is also a case of ‘data dependency’.
15.1.7 | Solutions for RAW Hazards
What are the possible solutions to this?
- Stall the Trailing Instruction
This is the very obvious and the simplest solution. For case 1, if the trailing instruction is stalled until the leading instruction has traversed the ‘WB’ stage, this hazard cannot occur. This amounts to a pipeline stall of three cycles, since the WB stage and the ID stage are three cycles apart.
For case 2, the trailing instruction should be stalled until the leading instruction traverses the WB stage where the content accessed from D-cache is written into the specified register. This also needs a three-cycle stall.
Figure 15.6 shows the idea of stalling, where a trailing instruction is not allowed to proceed beyond fetching. Essentially, once the trailing instruction is decoded, it should not proceed until the leading instruction completes its critical stage. Pipeline stalls have the name ‘bubbles’.
Figure 15.6 | Illustration of a three-cycle pipeline stall
Stalling the pipeline is easy, but it defeats the very purpose of pipelining, which is to get a speedup over non-pipelined architectures. In any instruction stream, such data dependencies occur throughout and if stalling is done continuously, it leads to serious performance degradation.
Note 4 In any instruction stream, there are instructions that are fully independent. Consider case 1 where hazards of ALU instructions have been discussed.
Suppose the dependent instruction K is not the one immediately following J, but there is an ‘independent’ instruction between J and K. Then, the number of stall cycles needed is 2. In general, for the ALU, the number of stall cycles needed is 3 − n, where ‘n’ is the number of independent instructions that appear in between the ‘dependent’ leading and trailing instructions. In the case where K comes just after J, n = 0.
- Forwarding Paths
Now, take a look at our pipeline again with respect to the following two instructions:
J: ADD R1,R2,R3
K: SUB R4,R1,R5
It is clear that when the Kth instruction is in the ID stage, the Jth instruction is in the EX stage and is performing the ADD operation. Therefore, just after this stage, the result of the ADD operation is available at the output of the EX stage. This can be immediately handed over to the trailing instruction K. This allows K to perform its subtraction immediately without waiting to take it from R1 after J has done its ‘writeback’ of R1.
Figure 15.7 shows this ‘forwarding technique’ which is also called ‘bypassing’. The Jth instructions performs its EX operation, and the result is immediately forwarded to the EX stage, so that the Kth instruction, when it reaches there, can use it as one of its operands. Thus, for the two dependent ALU instructions, the stall cycles are reduced to 0. What if there are independent instructions in between the J and K instructions? Other forwarding paths are necessary and are shown in Fig. 15.8, for n = 2 and n = 1.
Figure 15.7 | Forwarding path for two back-to back ALU instructions
Figure 15.8 | Forwarding paths for ALU instructions for n = 0, 1 and 2
The forwarding path corresponding to n = 0 is called the ‘critical forwarding path’. At any time, all the forwarding paths are present and any of them can be activated on detecting a hazard.
J: MOV R1,[M]
Now, consider the case of J being a load instruction.
When the leading instruction J is a load instruction, the required data from memory is obtained only at the end of the MEM stage. Till then, the required data is in the data cache. Once it is taken from there, it is available for the next instruction to use. However, the trailing instruction K will have to wait at least for one cycle for the data to be brought into the pipeline, from D-cache.
In essence, let us be clear that when the leading instruction is a load instruction, there is no way to avoid a ‘stall’ of one cycle to allow the data to be brought from memory. Once the data is available at the output of the MEM stage, there is no need to wait for it to be written into R1 by the WB stage. From the output of the MEM stage, a forwarding path can take the data to the EX stage, which is then used as the source operand for instruction K (see Fig. 15.9).
Figure 15.9 | Forwarding path for the case of a leading load instruction
For the following sets of instructions, indicate the presence of RAW hazards and say where forwarding paths are necessary. Assume that instructions are processed in the order 1, 2 and 3.
- 1: ADD R2, R1, R3
2: SUB R4, R3, R2
3: ADD R4, R1, R2
- 1: MOV R2, [M]
2: SUB R5, R3, R4
3: ADD R0, R2, R1
- RAW hazard between 1 and 2. The critical forwarding path is to be used. R2 is the hazard register.
- RAW hazard between 1 and 3. Forwarding path from the MEM stage to EX is needed (see Fig. 15.10a).
Figure 15.10a | The two forwarding paths needed for (i)
Figure 15.10b | Forwarding path for (ii)
- RAW hazard between 1 and 3. A forwarding path between the MEM and EX stage is needed. No stall cycles are needed even though the leading instruction is a load instruction. This is because there is one independent instruction between the hazard causing instructions (see Fig. 15.10b).
In Fig. 15.8, when n = 1, why is a forwarding path needed even after writeback to the hazard register has occurred?
In this case, by the time the WB into the hazard register has occurred, the trailing instruction has already completed the ID stage and cannot read it from the ‘specific‘ register. Therefore, the data needed for the trailing instruction has to be forwarded from the WB stage.
15.1.8 | Detecting a RAW Hazard
How can we detect whether a RAW hazard is present in an instruction stream?
For a RAW hazard, there is a register called the hazard register. We have seen that a register becomes a hazard register when it is the destination in a leading instructions and a source in a trailing instruction. In Section 15.1.6, R1 is the ‘hazard’ register. Consider the critical case wherein the Jth instruction is in the EX stage and the (J+1)th instruction (the trailing instruction) is in the ID stage. If the decoded instructions at both these stages are compared, the register code of the hazard register is found in both the instructions – in the Jth instruction as a destination and at the (J+1)th instruction as a source.
This information can be obtained at the buffers appearing physically at the input of the register file. The presence of the same register code may be detected by using comparators. This is how an RAW hazard is identified between a Jth and a (J+1)th instruction. Similar comparisons may be done with respect to J and (J+2)th instructions and so on.
Once the comparator indicates a RAW hazard, a ‘forwarding’ path may be activated (see Section 15.1.7). It can also initiate a pipeline stall as it may be necessary when a leading load instruction is involved (see Section 15.1.6). This method of detecting RAW hazards and resolving it using hardware techniques is called ‘pipeline interlock’. This is the technique used for the dynamic resolution of data dependencies.
Only the idea behind pipeline interlock is presented here, because diagrams of pipeline interlock corresponding to all the forwarding paths are quite complex.
15.1.9 | Control Hazards
Control hazards comprise the hazards incurred when the flow of control changes – this can occur due to interrupts, exceptions or branch instructions. Here, we consider the simplest case of how to handle the hazard due to a branch instruction. Recollect that a branch may be unconditional or conditional.
In the case of unconditional branching, a new value of PC is needed. When the instruction is decoded, it is known that an unconditional branch instruction has come into the pipeline. This instruction needs a ‘target address’ and this address is to be loaded into PC. The next instruction is to be ‘fetched’ from this new target address. Fetching of an instruction is always in the IF stage.
When a branch is to be ‘taken’, it means that the instructions fetched following the branch instruction are not needed anymore. A new set of instructions from the target address onwards, is to be fetched. There may be different addressing modes to calculate the target address. These calculations will be over by the end of the EX stage. However, the new target address will be available in the PC only in the MEM stage, because that is the earliest time available (after address calculation is done) for PC updating. Therefore, the pipeline needs to be stalled for three cycles (for the PC to be updated) (see Fig. 15.11).
Figure 15.11 | Forwarding path for a control instruction
In conclusion, it can be stated that a stall of three cycles in inevitable when branching is done. Just after the MEM stage, a new ‘instruction fetch’ can start. Therefore, the path between the MEM and the IF stage is the best option for ‘forwarding’ and can be identified as the ‘critical forwarding path’ for branch instructions.
For conditional branch instructions, there is an additional constraint. The condition needs to be evaluated, and then only the decision can be made as to whether branching needs to be done or not. The evaluation of the condition may be done in the EX stage or there can be specialized hardware for it. Either way, once it is decided that branching is needed, a new sequence of instructions are to be fetched. All instructions fetched after the branch instruction have to be discarded. This works out exactly as discussed in the previous paragraph. However, the additional point here is that ’branching’ is conditional. If the branch condition is not ‘true’, the normal sequence of instructions can still be used.
One way of dealing with such a scenario is to assume one of the conditions to be true, or do branch prediction, and allow the pipeline to proceed normally without any stalling. When the condition is ‘actually’ evaluated later and our assumption is found to be wrong, the unwanted instructions that were executed will have to be to be discarded, and this amounts to ‘backing up to the previous state’. This is a ‘penalty’ equivalent to stalling (plus an additional overhead of having to restore the previous machine state). However, with good techniques, statistically speaking, branch prediction leads to a big saving in stall cycles. This is the essence of branch prediction that will be discussed in Section 15.2.7.
On reviewing branching, another aspect comes to light. On encountering a conditional branch instruction, there is the possibility of three stall cycles. Instead of doing such a stall, the pipeline can be injected with instructions that are totally independent. Thus, instead of stalling the pipeline for three cycles, three independent instructions may be processed in the pipeline. This technique is called ‘delayed branching’ and it has been used in many RISC machines. However, this requires the help of the compiler to identify independent instructions.
15.1.10 | CPI Degradation due to Pipeline Stalls
We started with the expectation that pipelining could bring down the CPI to 1. However, in practice, this is found to be impossible, because of inevitable and unavoidable stalls that affect the pipeline. This tends to increase the CPI, and finally, we are left with an average CPI that is greater than 1. The extent of degradation will depend on the ‘mix’ of the different kinds of instructions. Let us analyze this matter.
When stall cycles occur in the pipeline, the resultant CPI is calculated as follows.
Resultant CPI = Ideal CPI + Penalty due to stalls
Consider an unpipelined processor with a clock cycle of 1 ns. Assume the following instruction mix: 20% branch instructions, 10% load/store instructions and 70% ALU instructions. These types of instructions use 5, 4 and 4 cycles, respectively. Calculate the average CPI.
For the unpipelined processor,
Average CPI = 0.2 × 5 + 0.1 × 4 +0.7 × 4 = 4.2
In Example 15.4, if the processor is pipelined and ‘forwarding’ techniques are not used, what is the resultant CPI if 20% of the ALU instructions and 10% of the load instructions cause RAW hazards? Further, assume that 50% of the branch instructions are ‘taken’ branches.
If the processor is pipelined, the ideal CPI = 1.
The stalls due to hazards increase the CPI by adding a penalty factor to the ideal CPI.
RAW hazards add three stall cycles for ALU instructions (taking the extreme case of two ‘back to back’ data dependent instructions), three stall cycles for load and three for branches.
Of the 70% ALU instructions, 20% cause a three-cycle penalty. Therefore, we get the following:
Penalty due to ALU instructions = 0.2 × 0.7 × 3 = 0.42
Similarly, load instructions cause a three-cycle penalty. Therefore, the following can be obtained.
Penalty due to load instructions = 0.1 × 0.1 × 3 = 0.03
Fifty percentage of the branch instructions cause a three-cycle stall. Therefore, we calculate it as follows:
Penalty due to branch instructions = 0.5 × 0.2 × 3 = 0.3
Total penalty = 0.75.
Resultant CPI = ideal CPI + penalty due to stalls = 1 + 0.74 = 1.74
Thus, the CPI has improved due to pipelining, though it has not reached the ideal figure of 1.
Consider a case where the instruction mix is 50% ALU, 30% load/store and 20% branch instructions. Let us assume forwarding is done for ALU and load/store instructions if RAW hazards are detected. Assume that all the ALU and load/store instructions cause RAW hazards and also that branches are always ‘taken’.
ALU stall cycles = 0 (because of forwarding)
Load stall cycles = 1 (in spite of forwarding)
Branch stall cycles = 3 (all of them take a new branch)
Resultant CPI = Ideal CPI + effect of stall cycles
=1 + 0.5 × 0 + 0.3 × 1 + 0.2 × 3 = 1 + 0.3 + 0.6 = 1.9
Thus, we see that the average CPI is high due to RAW hazards
Now, let us change the statistics by saying that only 10% of the instructions are ‘RAW hazard causing’ load/store instructions. Further, assume that there are no stalls due to branches because the technique of delayed branching is resorted. What is the improvement in the CPI?
Resultant CPI = 1 + 0.5 × 0 + 0.1 × 1 + 0.2 × 0 =1.1
Note how the CPI has improved due to clever techniques employed by the ‘microarchitecture designer’.
15.1.11 | Superpipelining
In this, each pipeline stage is divided into several substages, and this is essentially deep pipelining. This is helpful because some stages in a pipeline may not require as much latency as other stages. Because of this subdivision, the overall clock rate of the ‘super’ pipeline increases and at any time, there are more instructions in the pipeline. The downside to this is that more data dependency issues have to be handled at any time. The hardware also becomes more complex. In spite of all this, there are many superpipelined architectures that perform remarkably well. Pentium Pro and x86 processors following it are all superpipelined processors.
15.2 | Replication – The Superscalar Concept
In Section 15.1, it was mentioned that the second method of improving performance by achieving a low CPI is ‘replication’. What exactly can be replicated? There are two possibilities. In one case, the entire processor pipeline or parts of it are replicated. This makes a processor a superscalar processor. There is another type of replication, and that is the idea behind ‘multicore’ processors, where there are multiple processor cores residing on the same chip. Recollect that a ‘core’ is the computing engine of a processor that includes the ALU and registers. In this part of the chapter, we discuss superscalar techniques. Multicore processors are covered in Chapter 16.
15.2.1 | Superscalar Processors
Pipelining gave way to ‘superpipelining’ and trying to go further on that path proved right, the ‘law of diminishing returns’. This means that the hardware budget became very high, but proportionately improved performance was not obtained. Pipelining had reached saturation, and a new direction in thinking was sought.
Replicating parts of the processor pipeline was the next big concept and that idea led to the advent of superscalar processors. Pentium was the first x86-based superscalar processor of Intel. Figure 15.12 shows the Pentium pipeline.
Figure 15.12 | The Pentium pipeline
There are two integer execution units. In short, two instructions can be in execution at any time. It is implicit that two integer instructions are also to be fetched and decoded in parallel. In short, the whole pipeline is replicated by a factor of 2. Upto the execution unit, fetching and decoding are done in a common section, but two instructions are to be handled in parallel. In general, a superscalar processor can have a replication of order N. For Pentium, N = 2.
We know that the objective of microarchitectural enhancements is to get an increased IPC, which translates to a reduced CPI. With scalar pipelining, the highest IPC is 1. With a superscalar processor of order 2, IPC becomes 2. Thus, IPC doubles, though this is the idealized figure.
Scalar pipelining is a parallelizing technique where parallelism is in the temporal, i.e., time domain. With this, we try to get more activities to be done ‘at the same time’. The idea of replication is to improve performance by incorporating ‘spatial parallelism’. When multiple pipelines are used, the benefits of both spatial and temporal parallelism are sought. However, problems also increase in the same order.
While superscalar processing was accepted into processor architecture, a few new techniques were also added to solve the ‘new’ issues that emerged. These new techniques replaced the older and tame norms of in-order execution, pipeline stalling and bypassing. New and adventurous methods like branch speculation, register renaming and out-of-order execution were put into practice. The discussion here on superscalar processing is not just about replicating processor pipelines. It is more about these new approaches to the microarchitectural design of advanced processors.
The objective of superscalar techniques is to get multiple instructions to be executed simultaneously, i.e., to increase Instruction Level Parallelism (ILP). In essence, a superscalar machine is a processor that has multiple pipelines. If there are N pipelines working in parallel, it is a superscalar processor of order N. For Pentium, N = 2. There are other processors with higher values for N.
15.2.2 | Multiple Execution Units
Programs are normally written with code lines ‘in order’ such that a sequential order of execution is expected. Normally, fetching and decoding of instructions can be in a common section of the pipeline, i.e., fetching and decoding need not be done in two ‘physically’ separate pipelines. However, it may be necessary that for N = 2, two instructions each are fetched and decoded in each cycle. After this, the decoded instructions are sent to separate executions units. This is the most logical method of implementing Superscalar Architecture (SSA). This is exactly how it is done in Pentium (see Fig. 15.12).
Figure 15.13 shows a superscalar pipeline of order 4. Four instructions are fetched and decoded simultaneously. They are then issued to four separate execution units.
Figure 15.13 | A superscalar pipeline of order 4
Now, consider the execution units. They should have the capability to ‘compute’, but is it necessary that they all be identical?
Refer to the Pentium pipeline (see Section14.16.2). There are two separate pipes for handling integer computation and they are called the U and V pipes. However, U is more ‘capable’ than V. That is how it has been designed. This means that the integer execution units of Pentium are not really identical. This brings us to the concept of diversified pipelines.
15.2.3 | Diversified Pipelines
What are diversified pipelines?
They are pipelines in which the execution units are different. We can then refer to them as ‘functional units’ where each unit caters to one specific function.
The diversification may be in terms of separating out the instruction stream into instruction types such as ALU, Load/Store, MEM and FP (Floating Point). Or there can be functional units for ALUs such as ADD, MULT and FP. In effect, when there are different types of functional units in the execution stage, the pipeline becomes a diversified pipeline.
Figure 15.14 shown such a pipeline.
Figure 15.14 | A diversified pipeline
15.2.4 | Dynamic Pipelines
In Section 15.2.3, it was seen that after decoding, different functional units handle different kinds of instructions. It should be clear that the latency of each of the functional units is likely to be different. If there are separate multiplication and addition units, it should be understood that there is dedicated hardware for each of these operations. Then, the function of addition will definitely be done faster than multiplication. Because of the different latencies in the units, completion of instruction execution occurs at different time instants. Therefore, the order in which results come out from the different functional units may not be in the order of the program sequence. This is so, even if the instructions were issued to the functional units at the same time. This means that instruction completion may be ‘out of order’ even if issuing instructions is ‘in order’.
Therefore, after execution in different functional units, the results have to be put in order. For this, there is a reorder buffer. Assume that the results from the functional units are collected in a common buffer and the final results are sent out in order. Let us visualize this. Figure 15.15 shows the reorder buffer after the functional units.
Figure 15.15 | Pipeline with a reorder buffer
18.104.22.168 | Out-Of-Order Issuing of Instructions
In the above discussion, we assumed that the issuing of instructions to different execution units was ‘in order’. However, is it really necessary to issue instructions to execution units in this way? But then, should the program sequence be disturbed? Let us discuss this matter.
Consider a case where there is only one functional unit for addition, and that this ADD unit is already in use when a new ADD instruction arrives. The new instruction has to wait because of the non-availability of a functional unit for it. Meanwhile, there is a multiply instruction trailing behind the ADD instruction, which is totally independent of the ADD instruction. The ‘MULT’ functional unit is free, and then it seems quite logical to allow the trailing multiply instruction to be issued to the ‘MULT’ unit. The point now is that the ‘MULT’ instruction has been issued to its execution unit before the instruction ahead of it has been issued for execution. This scenario amounts to be a case of ‘out-of-order’ issue of instructions along with ‘out-of-order’ execution.
Thus, there can be two cases for out-of-order instruction execution.
- In-order issue, out-of-order completion
- Out-of-order issue, out-of-order completion
In general, pipelines in which out-of-order execution takes place are called dynamic pipelines. Intel’s Pentium Pro architecture is the first processor of the x86 series that inculcated this principle in its design.
Out-of-order execution proved to be a major step in improving performance. However, along with this, a new set of problems came up for which solutions were needed. We will analyze the problems and then look up the solutions.
15.2.5 | Stages of a Superscalar Pipeline
The stages of a superscalar pipeline are the same as a scalar pipeline; fetch, decode, execute, access memory and writeback are the functions to be performed.
22.214.171.124 | Fetch
Consider a processor with a superscalar order of N. This means that N instructions are to be fetched. What are the issues to be handled while doing multiple parallel fetches?
Instruction fetch mechanisms involve the process of how instructions are fetched from memory and delivered to the decoder. Instructions are fetched from the instruction cache, and if N instructions are to be fetched, it is best to have the cache to be arranged as rows with ‘N’ instructions, so that in one cycle when one row is fetched, we get the N instructions that are needed by the processor. This works fine normally but when a branch instruction comes, this order gets disturbed. When the branch instruction is decoded and if it is an unconditional jump or call or a ‘taken’ conditional branch, it means that a new series of instructions from a new address is to be fetched. This new set may not be aligned to be in one ‘row’ that can be fetched simultaneously in one cycle. This misalignment problem can definitely be solved, but extra hardware is involved.
Another matter to be taken care of is that fetching has to be synchronized with the branch prediction hardware, because as we will see later, if a prediction for a branch has ‘failed’, fetching from a new location in the I-cache has to be initiated.
126.96.36.199 | Decode
The multiple instructions fetched have now to be decoded. Decoding is a more involved step, and it is after this stage that the decoded instructions are issued to the execution stage. Decoding RISC instructions is relatively simple, because all instructions are of the same length. For the scalar pipeline discussed in Section 15.1, it was seen that after decoding, registers are read and operands are obtained. However, for a superscalar pipeline, dependencies between instructions have to be identified here itself. It is based on this that ‘issuing’ of instructions to parallel execution units can be done.
Figure 15.16 shows that for a superscalar processor, a buffer is needed before the execution stage, where instructions wait until they are issued. There is a lot of matter to be covered regarding this buffer. We will discuss it in Section 15.2.6. Here, while learning about the decode stage, the point to note is that data dependencies have to be detected just after decoding and then handled. For scalar pipelines to detect dependencies between the register operands, a number of comparators are needed (see Section 15.1.8). When ‘parallel pipelines’ are involved, the complexity of the hardware for this will be very high. The number of comparators needed increases exponentially as the order of superscalar processing. Therefore, superscalar processors handle these problems in a different way.
Figure 15.16 | Decode units of scalar and superscalar architectures
Branch instructions are another source of complexity. The normal sequence of instruction execution is disrupted by branches and such a situation is detected at the decoder. This information must be conveyed to the fetch unit to change the sequence of fetching. Because of all these complexities, the decoding stage of a superscalar processor is quite a complex hardware stage even for RISC processors.
The situation is much worse for CISC. Why? In CISC processors, all instructions are not of the same length. Consider the x86 ISA where instructions of 1, 2, 3, 4 and 5 bytes length are present. When decoding is to be done, the point at which an instruction begins and ends is not clear and this has been sorted out. This makes the decoding complex.
Some later generation processors have resorted to a type of architecture in which CISC instructions are implemented in an RISC format. This is for the Pentium Pro architecture. Here, the decoded instructions are converted to ‘micro-ops’ that are similar to the simple instructions of RISC. The hardware structure that does this conversion is part of the decoder.
All in all, we conclude that the decoder part of a superscalar processor is a complex piece of hardware. It is because of this that many CISC processors take two to three cycles for decoding. Observe that the Pentium pipeline (see Fig. 15.12) shows two decode stages. Pentium Pro has four cycles for decoding. To alleviate the complexity of the decoder unit, a technique called pre-decoding has been devised. This does not reduce the decoder complexity as a whole – instead it transfers a part of decoding to the input of the I-cache where a pre-decoder is present. Partially decoded instructions enter the I-cache. The I-cache sends partially decoded instructions to the decoder and the decoder has a simple task now. Pre-decoding has been used in many superscalar processors. Though it adds a few extra bits to the pre-decoded stream, in the overall estimate, a less complex decoder is obtained. Figure 15.17 shows the pre-decode unit at the input side of the I-cache.
Figure 15.17 | A pre-decode unit at the input of the I-cache
188.8.131.52 | The Execution Stage
After decoding, the instructions have to be ‘issued’ to different execution units. Before doing that, dependencies have to be verified. There may be dependencies between the decoded instructions. If this is a case of ‘true dependency’, then appropriate solutions must be ensured (see Section 15.1.6). Other dependencies like ‘output’ or ‘anti-dependencies’ will have to removed by the technique of register renaming (see Section 15.2.8). Besides these, there are structural and control hazards that need to be taken care of. We now examine these issues in great detail.
What are the problems encountered when decoded instructions are to be issued to execution units?
Up to the decoding of instructions, everything in the pipeline is more or less smooth. Once it is time for execution, various hazards are to be dealt with. Some of these hazards are the same as those observed in scalar pipelines. However, for SSAs, multiple instructions have to be taken care of all the time. Therefore, the effects of hazards get magnified. The various hazards that pertain to SSAs are as follows:
- Structural Hazards
When instructions are decoded, it is possible that some functional units are not free, as they are already in use. This is a structural hazard. This may seriously impede the speed of execution. The solution for this is to have enough number of functional units. A right mix of functional units depends on the statistics of the instruction types in a program stream. The designer has to have a rough estimate of the application arena of the processor to decide this.
- Data Dependencies
We have already discussed RAW hazards in Section 15.1.6. RAW hazards are caused by ‘true’ data dependence, because there is ‘data flow’ between the instructions involved. For scalar pipelines, simple techniques like stalling and forwarding are used. However, for super scalar architectures, the degree of data dependence is greater. More comprehensive techniques have been sought for this, which we will discuss in Section 15.2.6.
- Anti- and Output Dependencies
In Section 15.1.6, WAW and WAR hazards were briefly touched upon. They are hazards, but there is no data flow between the instructions involved in the hazard. They are ‘storage conflicts’ and involve the use of the same register by more than one instruction. Such hazards are eliminated by ‘register renaming’, which is explained in Section 15.2.8.
- Control Hazards
In SSAs, when branch instructions occur, control flow may change and cause problems similar to that in scalar pipelines. To alleviate the effect of this problem, ‘branch prediction’ has been adopted. This is dealt with, in Section 15.1.9.
As of now, we have discussed the superscalar pipeline stages of fetching and decoding. After decoding, operands are to be obtained (from registers and/or memory) and then these instructions are issued to different execution units. The stages after decoding constitute a ‘dynamic execution core’ in which some of the above mentioned hazards are dynamically handled. Before we discuss specific topics such as register renaming and branch prediction, let us try to figure out how ‘execution’ is handled in SSAs. In the execution stage, multiple instructions are to be dynamically scheduled and handed over to different functional units. How this is done is a very interesting part of the understanding of advanced computer architecture. We will discuss it for an extremely simple case of two instructions being scheduled to two different kinds of functional units.
15.2.6 | Dynamic Scheduling of Instructions in Superscalar Architectures
Consider a superscalar processor of order 2. In this, two instructions are to be executed simultaneously. However, execution can be done only if the operands of both the instructions are available.Because of data dependencies between instructions, operands for them may not be available even if the functional units for execution are free for use. A structure named ‘reservation stations’ is used to store operands as and when they become available.
Reservation stations are buffers that store operands until they are passed on to the functional units for execution. They are also referred to as ‘shelving buffers’. Reservation stations may be centralized or distributed. Figure 15.18 shows a distributed reservation station in which each functional unit has its own reservation station. A central reservation station is also possible, where all operands are stored and used as and when they become available. Figure 15.19 shows a centralized reservation station.
Figure 15.18 | Distributed reservation stations
Figure 15.19 | A centralized reservation station
Now, let us delve deeper into the working mechanism of a functional unit in tandem with a reservation station. Let us assume that we have one ‘add’ and one ‘multiply’ functional unit (designated as A and B, respectively) and each has its dedicated reservation station that can hold the two source operands needed for two instructions. Further, assume that an ‘add’ instruction needs two cycles for execution, while ‘multiply’ requires three cycles.
Assume an instant when a particular instruction is ready to be executed, because it has found that the required functional unit is available for it. For example, let it be an ‘add’ instruction. This instruction looks for two operands. If they are available, they should be in the reservation station of the ‘add’ unit. In the reservation stations, if the operands are not explicitly available, it is because the result of a previous instruction is awaited. In that case, a ‘tag’ to where the operands will be obtained later is written in the Reservation Station (RS). Therefore, the ‘add’ unit waits until the operands are available, but this waiting does not cause a pipeline stall – it does not halt the progress of the pipeline activities. Instructions trailing behind the ‘add’ instruction are allowed to be decoded and issued to functional units that are free and available.
How is pipeline stall avoided?
When an instruction’s operand is passed on to a particular reservation station, it is considered as ‘issued’ for execution. Even if a functional unit is not available as yet, the operands can wait in the reservation station without incurring a structural hazard. This is because a reservation station is considered to be part of a functional unit. Let us see how this leads to an ‘out-of-order’ execution.
Figure 15.20 shows an ‘ADD’ and ‘MULT’ units with their respective reservation stations where operands are stored, the operands being OP1 and OP2. In the ‘tag’ position, either the operand or the address of the unit from which the operands will be obtained later is entered, if the operands are currently not available.
Figure 15.20 | Two functional units with their dedicated reservation stations
Consider the following two instructions such that J appears first.
J: ADD R4, R0, R1
K: MUL R2, R1, R3
When they are decoded, they can be dispatched simultaneously to the ADD unit and MULT units, respectively, as both units are free to be used. They have the necessary operands also; therefore, both the ADD and MULT units start executing at time T = 0. The ‘ADD’ unit A needs two cycles and the ‘MULT’ unit B needs three cycles to complete execution.
Assume that the contents of registers are as follows: R0 = 21, R1 = 45, R3 = 54. Let us observe the contents of the two reservation stations after the first cycle. The RS of the ADD unit A holds the operands of instruction J and the RS of the MULT unit B holds the operands of instruction K. See Fig. 15.21.
Figure 15.21 | The contents of reservation stations A and B after one cycle
Now the next set of instructions comes in the very next cycle. They are L and M that are decoded and ready.
L: ADD R3, R4, R0
M: MUL R1, R4, R2
Now, examine the data dependency between the two sets of instructions.
Instruction L needs the operand R4, which is the result of instruction J. The question is, has J been completed by then? Since J needs two cycles, obviously the operand R4 is still pending. The functional unit A is also not available for the second ADD instruction. However, the pipeline is not stalled because in the reservation station (RS) of the ADD unit, a tag is marked corresponding to the operand R4 of instruction L. This tag specifies where the operand will come from later. In this example, the tag is of the functional unit A itself. Thus, for instruction L in the ADD unit, the tag of A is inserted instead of OP1, and the content of R0 is inserted as OP2 (see Fig. 15.22). When addition is completed in instruction J, the sum is broadcasted to the RS through a ‘common data bus (CBB)’ (not shown in the diagram) and is also written to R4. This happens only after the second cycle.
Figure 15.22 | Contents of the reservation stations in the second cycle
Now consider instruction M. In the same cycle, instruction M has also been decoded and is awaiting execution. This is a multiply instruction that needs the ‘MULT’ functional unit. But is it free now? Multiply is a complex instruction that takes three cycles to execute, and currently, instruction K is using this unit. The non-availability of the MULT unit also amounts to a structural hazard. However, since we consider the reservation station to be part of a functional unit, structural hazard is avoided by storing the operands for the instruction in its RS. However, there are other problems as well.
Are both operands available? The operand R4 will be obtained only after instruction J is completed. Therefore, the ‘tag’ of the ‘ADD’ unit, i.e., ‘A’ is written in the space for OP1. The other source operand R2 will be available only after K is completed. The RS marks the operand as awaiting the result from the functional unit B. This operand gets a ‘tag’ of the MULT unit ‘B’ instead of the actual operand R2 (see Fig. 15.22). In short, both OP1 and OP2 of instruction M are ‘awaited’ and only their tags are written in the reservation station. Instruction M will start execution only after three cycles by which time, both its operands are obtained.
What advantage has come out of using such reservation stations?
The total pipeline is not stalled, as already the two sets of instructions have been triggered to start execution as and when their respective functional units are free and their operands are available. The ‘data dependency’ between the instructions has not caused a stall of the pipeline. The concept of a forwarding path (see Section 15.1.7) is used here, by having a common data bus (CDB), that forwards the result of computations directly to reservation stations. Instructions trailing behind these two sets of instructions can be decoded and handed over to other functional units.
The above discussed ideas are adapted from the Tomasulo’s algorithm, which was a classical algorithm formulated in 1967. This algorithm (in modified form) is used in all superscalar processors. In this, the concept of reservation stations and a common data bus is used to avoid structural and RAW hazards. Register renaming (see Section 15.2.8) has also been incorporated, as the reservation stations are actually renamed registers where operands are temporarily kept.
Instructions are executed in their functional units and their results may be in buffers until ‘writeback’ to the destination registers occur. For store instructions, data must be written into the D-cache. When these actions are done, we say that the instruction is ‘retired’.
A ‘reorder buffer’ forms part of the dynamic execution core of a superscalar processor. We have just discussed the out-of-order execution engine. However, finally, results have to be written back to registers in program order. For this, there is a reorder buffer, which is a structure having knowledge of all instructions that are being executed or have completed execution.
In practice, the reservation stations, functional units and the reorder buffer have to operate in unison because any instruction needs to be issued to a functional unit, keep its operands in reservation stations, and finally arrange its results in a reorder buffer. Processors may have the reservation stations and reorder buffer as one physical structure, where operands enter at one end and results stream out from the other end. This is a complex structure called an ‘instruction window’. This window must have information about the status of all instructions, which may be in one of the following states: waiting to be executed, being executed or completed execution.
Interrupts and Exceptions
During the course of execution of multiple instructions, disruptions may occur in the form of interrupts and exceptions. The former refers to requests from I/O devices to which the processor must respond. Exceptions are error generated interrupts. In both cases, the execution of the current instructions must be stopped at the point of these ‘disruptions’. The machine state, i.e., the context of the processor must be saved, and it must be retrievable once the interrupting program has been processed. Interrupts and exceptions are acknowledged and serviced only when the currently executing instructions are completed, i.e., have finished their execution. To allow context retrieval after interrupts have been processed, it is necessary for completion of instructions to be in program order. This reordering is taken care of by a reorder buffer that follows the execution units.
Next, we will see two important and new techniques that have made superscalar processors very efficient.
15.2.7 | Branch Prediction
It was in the Pentium processor that Intel first introduced the concept of ‘branch prediction’. Later, processors incorporated more sophisticated branch prediction techniques that served to improve performance to a much larger degree.
Let us see why branch prediction is necessary in the first place. As per statistics, a general purpose program (in contrast to scientific programs) contains instructions of which around 20% are branches. Most of the branches are conditional also. We have already seen what the effects of branching are, for the case of a scalar processor, and we can easily realize that this high number can seriously degrade performance.
What are the effects of a conditional branch instruction?
There are following two issues:
- Conditional Branch instructions cause pipeline stalls because it has to wait for the condition to be evaluated and the branch address to be calculated.
- When a branch is ‘taken’, instructions are to be fetched from a new address. All the ‘pre-fetched’ instructions are useless and the pipeline is to be flushed.
Such problems are aggravated for superscalar processing. If around 20% are branch instructions, it implies that every fifth instruction is a branch. For a superscalar processor of order 5, when five instructions are fetched simultaneously, each set is likely to have at least one branch instruction. Thus, almost every set of instructions is affected.
Branching cannot be avoided, because the power of computing is ingrained in the ‘decision making’ attributed to branches.
What is predicted?
- When a condition is associated with branching, the condition is evaluated and if it is true, a branch is ‘taken’ (T); otherwise the program sequence remains unaltered, i.e., the branch is ‘not taken’ (N). Branch prediction is meant to make a decision before the condition is actually evaluated. This saves time because condition evaluation can be done only in the next cycle. Therefore, execution has to be stalled until the evaluation is done. To quicken the process, a prediction is done and program execution proceeds in the direction of the predicted path.
Meanwhile, the condition is ‘actually’ evaluated. If it is realized that prediction was wrong, the processor state is retraced back to the point where the wrong decision was made, and the alternate path is taken. This, of course, is a major performance loss. However, the gains obtained by a correct prediction are quite high. Successful prediction rates better than 90% are achievable and hence branch prediction is used in all superscalar processors.
- Prediction is not limited to whether it is a T or N branch. The ‘target address’ is also predicted. The assumption is that in loops, the same target is used repeatedly. Also, many times, the same loops are used repeatedly and then, the target address that is written in the form of history can be used without doing address calculation again.
Consider the following example:
This is an x86 program in which the instruction named ‘LOOP’ decrements register CX and branches to ‘NEXT’. The ‘program counter’ value is changed to the address of NEXT, if CX is not equal to 0. When the condition of CX = 0 is reached, the loop is exited. It is easy to realize that NEXT needs to be ‘taken’ 999 times and is ‘not taken’ just once. Lets us see how prediction proceeds here.
Consider the first time, the LOOP instruction is encountered. There is no previous history for this branch. Instead of waiting for condition evaluation, one of the two outcomes is predicted and the program may proceed, for example, by branching to NEXT. The history table records it as a taken (T) branch.
For the simple case, the prediction is based on a single previous event, which is considered as the relevant history. Once it is established that ‘T’ is the correct path, the prediction henceforth is always a ‘T’. This is fine and up to the time that CX = 0, the prediction works fine and the branch is always taken. However, in the last instance, when CX = 0, the prediction fails. What about the statistics of the prediction success? It is very high. A lot of time and effort was saved by not having to wait for the result of condition evaluation all the 1000 times.
Prediction is also applied to the ‘target’. In this case, once the target address is predicted to be ‘NEXT’, it is perfect until the time that the loop is exited.
This example has been used to establish the success statistics of branch prediction. All loops are not this simple, and therefore, tracing of the history may be extended to more events of the past.
Branch Prediction Unit
There are two tables in a branch prediction unit – one is a Branch History Table (BHT) containing the history of whether previous branches were taken or not (T or N) and the other is the Branch Target Buffer (BTB) where previously used target addresses are stored. Figure 15.23 shows both the BHT and BTB of a branch prediction unit.
Figure 15.23 | A typical branch prediction unit (FSM: Finite State Machine)
It is highly probable that previously used targets are likely to be used again. The BTB is the buffer that contains the PC values for taken branches, i.e., the history of target addresses. The BHT and the BTB together form the ‘branch prediction logic’ for a superscalar processor. Note that the table is indexed (pointed) by the PC values, which are marked in the figure as Branch Instruction Address (BIA). (Actually, only a few of the lower bits of the PC value is used.) The Finite State Machine (FSM) shown in the figure predicts the direction of branching based on previous history.
The branch prediction unit is not a standalone unit. It needs to be aligned to other units in the processor microarchitecture. In the dynamic execution core (Section 15.2.6), the instruction window needs information about the speculation done. Branches that have been predicted are known as being in the ‘speculated’ state. When the branch condition is actually verified and the prediction is found to be correct, the branch is then considered to be in the resolved state – if it is a misprediction, the branch is ‘invalid’. For misprediction, the branch unit has to be aligned with the fetch unit to go backward to the point of misprediction and take the correct path.
184.108.40.206 | Static and Dynamic Prediction
Prediction may be static or dynamic. In static prediction, all prediction is made at compile time. There is no way to change the prediction if the program behavior changes. One way of using static prediction is to take a decision that either a ‘taken’ or ‘not taken’ branching be done ‘always’. A second method is to decide on T or N, depending on the sign of displacement. For example, deciding beforehand that if the displacement is positive (a forward branch), the prediction is ‘T’ otherwise it is ‘N’.
These methods are simple but there is no intelligence used and the results may not be very encouraging.
220.127.116.11 | Dynamic Prediction
Dynamic prediction is an intelligent speculation technique done at run time and it is based entirely on the history of branching. The history used may be of one or many previous events. Let us have a look at the simplest types of predictors.
One-bit Dynamic Predictor
This predictor ‘remembers’ only one previous event, whether it was T or N. This is recorded in the BHT. Prediction is based on this single bit history. Later, after evaluation of the condition, if it is seen that the prediction is wrong, this bit is inverted and this is taken as the new history. Now let us discuss a real life situation that can help to understand the whole idea of branch prediction.
Assume you go to college every day of the week except Sunday. On Sunday, you go to a video gaming centre in the city. Assume your navigator is a predictor that decides where you have to go. Further, assume that ‘T’ is for college and ‘N’ is for the video parlor.
Observe Table 15.2 (it is a circular), which repeats every week. On Monday, there is a history of what has been done on Sunday; thus, there is always available a history of what was done the previous day. Note how the prediction proceeds. If the previous history is ‘T’, the actual decision is also T and vice versa. The column marked ‘actual’ correspond to what is the actual action taken.
Table 15.2 An Illustration for the Case of a Single Bit Predictor
How many times does the prediction fail?
Prediction fails twice on a Monday and on a Sunday. Now, let us compare the statistics of this to that of a two-bit predictor.
Let us consider a two-bit predictor that looks at the last two events that occurred.
Here, the history of two previous events is saved as history bits. Only when two consecutive Ts actually occur, the next prediction will change to a ‘T’. If branching occurs continually in the T direction and then an N occurs, the next history bits will be TN. However, the prediction will be changed to N only if two consecutive Ns occur.
The state diagram of Fig. 15.24 has four states TT, (strongly taken) TN (weakly taken), NN (strongly not taken) and NT (weakly not taken). Let us try this technique on our previous example.
Figure 15.24 | State transition diagram for a two-bit predictor
Table 15.3 shows the example of our daily travel and this can be used to illustrate a two-bit predictor. Note that a wrong decision occurs only once; that is, on a Sunday. This is because on that day, the two previous history bits were TT, but the actual is N. On Monday, the history changes to TN, but the prediction still is T. (Only if two Ns occur together, will the prediction change to N).
Table 15.3 An Illustration for the Case of a Two Bit Predictor
It seems obvious that we can use more history bits for better prediction. However, in practice, it has been found that two-bit predictors perform well enough. For more advanced processors, there are better predictors that trace the pattern of previously taken branches and use more advanced algorithms for prediction. Such predictors result in extremely good prediction rates.
15.2.8 | Register Renaming
This is the part of the microarchitecture that takes care of output and name dependencies.
In section 15.1.6, WAR and WAR hazards were mentioned. Let us discuss them once more.
- Case 1: WAW hazard
J: SUB R1, R2, R3
K: ADD R1, R0, R4
- Case 2: WAR hazard
L: ADD R0, R1, R2
M: SUB R1, R3, R5
Take a look at the instruction sets J, K and L, M. It is obvious in both these cases, the register R1 the ‘hazard register’, does not do any data transfer between the instructions, that is, there is no data flow between J and K or L and M. For case 1, we need to ensure a sequential order of writing into R1 that should not be disturbed. In the second case, we should ensure that the reading of R1 should occur before it is written into by M. In an out-of-order execution core, this logical sequence may get disturbed and that is how WAW and WAR hazards occur.
On looking at the instruction sequences, it is clear that there is nothing special about register R1. The programmer has simply used the same register for storage and that is the cause of the problem. If another register is used instead of R1 in one of the instructions of each set, no such havoc will occur. This is the idea behind register renaming; however, the question is whether there are enough registers available. The physical registers seen by the ISA are called ‘architectural registers’. They are limited in number and having more of them will only allow programmers to use still more of them in programs with no respite for such hazards.
The solution used for this is a technique named ‘register renaming’: that is to have shadow registers corresponding to the registers that cause hazards. These shadow registers are not seen by the ISA. They are available in the microarchitecture and take care of situations when WAR and WAW hazards occur. In essence, the hazard registers are replaced temporarily by these shadow registers.
How is shadowing of the architectural registers done?
There is a set of shadow registers called the ‘Rename Register File’ and there is a mapping between the Architectural Register File (ARF) and the Rename Register File (RRF). This mapping is not a one-to-one mapping between the two register files. Usually, the mapping between the two is dynamic. This means that for a register, say R1 in the architectural set, the rename register used in one instance need not be the same as that used in another instance.
Figure 15.25 gives the concept of the mapping performed between the RRF and the ARF. Note that both the files have tags for registers. If a register does not contain the required operand, a tag value that notates the address of the unit from where data will be dispatched is entered. The ‘B’ or busy bit in the ARF indicates that the data in the corresponding register is stale, and hence an update is pending. Therefore, as long as the busy bit is active, no data reading is allowed from the register. The V or valid bit in the RRF indicates that the data can be read directly from the corresponding register in the RRF for any computation. (If the V bit is inactive, it means that the data is not to be read because it has not yet been updated with the correct data.) The ARF will also be updated with this data.
Figure 15.25 | The mapping scheme for register renaming
When data is to be stored in any architectural register, it is first put in a register in the RRF. Any instruction needing the data can take it from the RRF without waiting until data is stored in the ARF. Later, the content of the register in the RRF should be copied into the ARF. Once this is done, the V bit can be de-activated.
What are the issues to be taken care of when the renaming technique is being implemented?
Let us examine a typical problem.
Assume that the architectural register R1 is to be read from. First, it has to be verified if there is a WAR hazard, which means whether there is a pending write to this register. This pending write will be indicated by a ‘busy’ bit for the corresponding register. This means that reading now from the architectural register will not give the right operand. In that case, the rename register corresponding to the architectural R1 is checked.
Let us examine this case in detail.
Case 1: There is a pending write to register R1.
- This may be because the write operation has not been done here, but there is a rename register allocated to R1 into which the write operation has already been done. Therefore, for our read instruction, the required operand may be fetched from the corresponding re-name register.
- However, it may be that the rename register also has not been written with the required data. This is verified by checking the valid bit of the rename register.
- If the required operand is not available either in the ARF or RRF, then our read operation has to wait. Meanwhile, the ‘tag’ of the chosen register is forwarded to the RS.
Case 2: A destination register has to written to.
In this case, first its corresponding rename register is updated and then only the architectural register is written to. This can be done in the same cycle.
In many processors, there is a separate RRF; but in some cases, the RRF is part of the reorder buffer. Register renaming is a very comprehensive technique with different kinds of rename register structures. More detailed reading of this is advised, as more discussion here is beyond the scope of this chapter.
With this, we conclude our discussion on the general organization and techniques of advanced processors. The topic is much too vast to be covered in a single chapter, and as such, only an introduction to the different ideas involved has been attempted here.
Next, we will see the case of a specific superscalar processor incorporating the techniques we have just learned. The P6 (Pentium Pro) architecture was launched in 1995 and it was one of the very successful architectures of Intel. It was used in the processors Pentium II and Pentium III. Even though new and more advanced processors are the current state of the art, P6 has been chosen here to show a direct implementation of most of the concepts discussed in this chapter. Processors with more advanced microarchitectures are discussed in Chapter 17.
15.3 | Pentium Pro (P6) Architecture
This is an IA-32 architecture, which means that the internal registers are 32 bits long. It has a 12-stage superpipelined implementation, whose efficiency is 33% higher than that of Pentium. This directly gives the ability to operate at higher clock frequencies compared to its predecessor. P6 is frequently designated as a CISC processor with RISC-like features. This is because of its decoding technique of converting instructions to micro-ops (simple RISC-like instructions) before executing them.
P6 is considered to have a superscalar order of 3, which means that on an average, it can handle three instructions at a time, i.e., three instructions can be fetched, decoded, executed and retired per clock cycle.
Figure 15.26 shows a conceptual block diagram of the microarchitecture of P6. Note that the execution unit is an ‘out-of-order’ unit while the front end and retire units are ‘ in-order’ units. Observe also that the branch prediction unit is linked to the fetch unit as mentioned in Section 18.104.22.168.
Figure 15.26 | Block diagram illustrating the P6 microarchitecture
Figure 15.27 shows that there is an ‘instruction window/pool’ that takes into account all the stages of instruction processing, which comprise the following.
Figure 15.27 | Instruction window
- Fetch/decode unit: this is an in-order unit takes instructions from the I-cache, and decodes them into a series of micro-ops (see Section 22.214.171.124).
- Dispatch/execute unit: this is an out-of-order scheduler for the micro-ops that takes into account all the data and structural hazards and allows execution after resolving them.
- Retire unit: this is also an in-order unit. It has knowledge of when instructions are completed so that they can be ‘committed’ or ‘retired’. This unit must not only take note of which micro-ops are complete, but must also arrange the results in the original program sequence.
The retirement unit of P6 is capable of retiring three micro-ops per clock. This unit includes the reorder buffer (see Section 15.2.8) where the technique of ‘register renaming’ is implemented along with ‘reordering’.
Figure 15.27 shows that that the dynamic scheduling is represented in terms of an ‘instruction pool/window’ into which instructions are fed in and taken out once execution is done.
The key feature of the P6 microarchitecture is an innovative out-of-order execution technique that has been designated as ‘dynamic execution’. Dynamic execution includes three data-processing concepts:
- Branch prediction
- Dynamic scheduling
- Speculative execution
Branch prediction as discussed in Section 15.2.7 is aimed at delivering high performance in superscalar microarchitectures. P6 implements a highly optimized Yeh and Patt (1991) branch prediction algorithm to predict the direction of the instruction stream through multiple levels of branches, procedure calls and returns.
Dynamic scheduling is the scheme in which the flow of data is analyzed at run time to determine data and register dependencies and to pave the way for out-of-order instruction execution (see Section 15.2.6). The out-of-order execution core handles many instructions and executes these instructions in the order that optimizes the use of the processor’s multiple execution units, without compromising data integrity.
Speculative execution is the result of the abovementioned two techniques. It refers to the processor’s ability to execute instructions that lie beyond a conditional branch that has not yet been resolved, and then to commit the results in the order of the original instruction stream. The processor’s out-of-order execution core uses data-dependency analysis to execute all the available instructions in the instruction window. Then, it stores these results in temporary registers before retiring them. Figure 15.28 shows a more detailed view of the implementation of these concepts.
Figure 15.28 | A Detailed view of the dynamic execution core of P6
Figure 15.28 shows a comprehensive picture of the P6 architecture. The following points make the understanding of the block diagram more complete.
- There are five different functional units in the Execution stage.
- The Register Alias table is part of the Register Rename logic.
- The decoding logic has three sections for catering to different levels of complexity of the instructions involved.
- After decoding into micro-codes, a sequencer sends them to the execution unit.
- The Retirement Register File is the same as the ARF of the processor, because when an instruction is ‘retired’, the results are stored in the registers of the processor.
- The rename register file is included in the reorder buffer.
- The memory reorder buffer is for reordering data in memory; it is part of the memory subsystem and interfaces the processor’s out-of-order engine with the memory.
15.4 | VLIW Architecture
The term VLIW coined by the computer architect Joseph Fisher means ‘Very Long Instruction Word’.
This is a type of architecture meant to exploit instruction level parallelism. VLIW-based processors are similar to superscalars in that they have many execution units; in general, they have more functional units than superscalars. The difference between these two classes is the way ILP has been exploited. It is obvious that when many instructions execute simultaneously in different functional units, it is to be ensured that data dependency between them should be nil. In the case of SSAs, ensuring this is the duty of the processor hardware, whereas compliers do this for VLIW processors.
This means that the hardware for VLIWs is much less complex, but the compiler has to be very smart; further, it should be an optimized compiler that knows all about the processor’s microarchitecture, ISA, etc. It must be able to take a stream of sequential instructions and convert it to an instruction word. This instruction word is presented to many different functional units for concurrent execution.
The idea seems quite simple, but the compiler needs to be updated for each new VLIW architecture. This is one major drawback for them. The other major drawback is that it might not be possible always to find sufficient number of independent instructions to keep all the functional units busy. This amounts to waste of resources. VLIW as an idea is widely used in DSP processors that are specialized for signal processing applications. Besides this, there have been processors whose basic architecture can be said to be of the VLIW flavor.
Intel’s first 64-bit processor, Itanium processor, is one such type. The Itanium-1 processor is Intel’s first implementation of the IA-64 ISA. IA-64 is an ISA for the Explicitly Parallel Instruction Computing (EPIC) style of VLIW developed jointly by Intel and HP. It is a 64-bit, 6 issue VLIW processor with four integer units, four multimedia units, two load/store units, two extended precision floating point units and two single precision floating point units. This processor running at 800 MHz has a 10-stage pipeline.
Intel had expected Itanium to perform as a replacement of its x86 architecture, but that did not happen because Itanium did not perform as expected. As such, the x86 architecture retained its importance with Intel opting to produce 64-bit processors based on the x86 superscalar architecture.
Itanium did not get entry into the desktop market, but it has gained acceptance in servers and enterprise systems.
KEY POINTS OF THIS CHAPTER
- Performance enhancement is the motivation factor for processor microarchitectural research.
- Ideally, scalar pipelining brings the IPC to a value of 1.
- Pipeline hazards bring the IPC to low values.
- Extra hardware is needed to resolve hazards.
- True data dependency is the only data hazard that can occur in a scalar pipeline.
- Forwarding is the technique used for resolving true data dependency.
- Superscalar processing is aimed at taking the IPC to values above 1.
- In SSAs, multiple instructions are fetched, decoded, issued and executed.
- Data dependencies have to be resolved before instructions are issued.
- In SSAs, WAR, WAW and RAW hazards can occur.
- Dynamic scheduling of instructions is done to resolve dependencies in superscalar processors.
- An instruction window forms part of the dynamic execution core.
- Pentium Pro is the first processor of Intel that has an ‘out-of-order’ execution core.
- Itanium-1 is the non-x86 processor of Intel based on the VLIW principle.
- Define performance and explain its dependence on the following:
- Semiconductor technology
- Processor microarchitecture
- Why is it necessary for pipeline stages to be uniform?
- What additional hardware is required for pipelining?
- Distinguish between the different types of hazards that can occur in a pipeline?
- Why is dynamic hazard resolution preferred? What are the additional complexities incurred due to this?
- Which is the only type of data hazard that occurs in a scalar pipeline?
- What is a forwarding path in the context of scalar pipelines?
- How many stall cycles are avoided by forwarding in the case of two back to back ALU instructions?
- How do control hazards occur?
- Is delayed branching used regularly for avoiding control hazards?
- In superscalar processors, why are many functional units used?
- In out-of-order execution philosophy, are instructions to be issued in order?
- Why is decoding a very tedious process for superscalar processors?
- Explain the concept of ‘dynamic scheduling’ of instructions.
- How does the use of reservation stations solve the problem of structural hazards?
- What is the concept of an ‘instruction window’ in superscalar processors, specifically the P6 processor?
- What are the items ‘predicted’ in the branch prediction technique?
- What kind of hazards does register renaming attempt to solve?
- What is the key concept involved in the VLIW architecture?
- What is special about the Itanium processor?
- Show how a leading ALU instruction causes a RAW hazard in a scalar pipeline. How much is the penalty incurred in this case? Explain.
- Identify the RAW hazards present in this set of instructions.
MOV R2, [M]
ADD R2, R3, R4
ADD R7, R2, R1
- For question No. 2, draw forwarding paths for reducing the effect of the hazards identified. How much of saving is obtained due to forwarding?
- Consider a pipelined processor. The relative frequencies of operations are 40% for ALU, 20% for branch, and 40% for load/store operations.
Assume that there are no penalty cycles for 40% of the branch instructions, while ALU instructions cause a two-cycle penalty. Load/store, and 60% of the branch operations cause a three-cycle penalty. What is the CPI obtained?
- Answer the following questions with respect to the Tomasulo’s algorithm, from which the idea of ‘dynamic scheduling‘ originated.
Assume the following:
- The initial contents of the registers are as follows: R0 = 7, R4 = 67, R8 = 34, R2 = 12.
- Subtraction and addition takes two cycles for completion and use the same functional unit.
- Division and multiplication take three cycles for completion and use the same functional unit.
- Two instructions are issued every cycle.
- What is the function of a reservation station?
- What are ‘tags’ and how are they used in this algorithm?
- What is ‘CDB’ and how does it make itself useful in this setup?
- In which cycles do the instructions B and C, D start their execution?
- For the four instructions given below, show the contents of the RS after the first, second and third cycles.
A: R8 = R0 − R4
B: R2 = R0 / R8
C: R8 = R8 + R0
D: R0 = R8 * R2