Overview of Hitachi SH7750 manual; Section 8, Pipelining
This is merely an overview of the information contained in
Section 8. It will hopefully aid in understanding how the
pipelining mechanisms of the processor works.
In a few cases, the processor operation has been oversimplified such that
some boundary cases are not correctly described; this has been done in
the name of clarity.
General layout of a CPU instruction (Figure 8.1, #1-#4):
I D EX/SX NA/MA S
I - Instruction Fetch
D - Instruction Decode / Register Read
EX/SX - Operation
NA/MA - Memory Access (if any)
S - Result Writeback
Each stage takes 1 cycle.
In an instruction pipeline that adhers strictly to the above model,
a MOV R0,R1 or an ADD R0,R1 would read its operands at D stage,
perform the addition (or nothing in the MOV case) at the E stage,
and update the destination register at the S stage.
This would however incur a latency of 3 cycles on any ALU operation,
which is far too much to be practical (to avoid resource conflicts,
the code would have to be paired like FPU code).
Therefore, some 'shortcuts' are implemented in the processor:
* 1-step operations (Figure 8.2, #1) which operate purely on Rns,
have the result of the operation available after the EX stage (so
the latency is 1 cycle for those instructions).
* If a MOV Rm,Rn is executed in the first pipe, then the result of the
MOV can be fed directly into the second pipe as well --
therefore the latency of the MOV is effectively 0 cycles;
an instruction in the 2nd pipe that is started in the same cycle
as the MOV will not stall.
This holds true for all MOVs that don't access memory
(FMOVs, and also FLDI0/FLDI1).
* Normal loads (Figure 8.2, #2) have the result of the load
available after the MA stage (so the latency is 2 cycles for those
Consider the 'latency' of an instruction to be a fairly accurate
measure of how many cycles it takes *in your sourcecode* before
the result is available. That is, if you have an instruction with a
3-cycle latency at one place, you should not access it during the
following 0-6 instructions [exact amount depends on how well the
follow instructions pair etc]. This is not entirely true due to
some kinds of resource conflicts (particularly when mixing FPU and
CPU ops); to be 100% certain, you need to follow the
instruction execution patterns (Figure 8.2) and look for collisions.
General layout of an FPU instruction (Figure 8.1, #5-#7):
I D [F0] F1 F2 FS
I - Instruction Fetch
D - Instruction Decode / Register Read
F0 - Computation 0 (for some operations only)
F1 - Computation 1
F2 - Computation 2
F3 - Computation 3 (for fdiv/fsqrt only; replaces F1
but is multi-cycle)
FS - Result Writeback
Each stage takes 1 cycle, except for F3.
FP ops will read their data during D stage, perform their computations
during F0-F2, and write back the result during FS.
There are few 'shortcuts' as implemented in the CPU core (sometimes
the result can be forwarded to another execution unit in the same cycle
as it is written, but I do not know the exact behaviour).
No FP op can pass another FP op; they will all go through
the F0-FS stages in the order they are issued by the processor.
Most single-precision computations (Figure 8.2, #36) adher to the
following execution pattern:
I D F1 F2 FS
Double-precision computations need multiple D cycles, because the FP
register file can not provide the data quickly enough.
For example, this is FADD/FMUL/FSUB (Figure 8.2, #39):
I D d d d d F1 F2 Fs
(The slot marked above is due to the internal pipelining)
FIPR has the F0 cycle (Figure 8.2, #42):
I D F0 F1 F2 FS
FTRV has both multiple read cycles and the F0 cycle (Figure 8.2, #43):
I D d d d F0 F1 F2 FS
Since F0-FS are different stages, one FP op can execute in each stage
at any given time. The peak FP throughput is therefore at a theoretical
maximum of one FP op per cycle.
Since there are no shortcuts (well, this needs investigation) in the
FP core, these are the rules for FP register resource allocation (Chapter 8.3):
* The result of an FP op is available for reading after the FS stage.
* The result of an FP op is available for writing after the F1 or F2
stage (depends on the instruction).
* The source operands of an FP op are available for writing immediately after
they are no longer being read by a previous instruction (after the D and
possible extra 'd' stages are done). Exception: FTRV (longer).
FP ops generally write back their result one 32-bit register at a
time. Therefore, the first half of a double-precision operation is usually
available one cycle before the second half.
FP ops which update FPSCR generally update it one cycle after the result
In practice, the latencies (when considered in the same way as the CPU
* Latency for result from a normal FP op is number of stages from D to FS stage.
[More complex rules for double-precision operations; see
Instruction Execution Patterns (Table 8.2) or Execution Cycles (Table 8.3)]
* When trying to write to the result of an FP op, latency is reduced by
1 or 2 cycles.
* When trying to write an operand of an FP op, latency is:
* 0 cycles for single-precision computations
* 2 cycles for double-precision FMUL/FADD/FSUB
* 5 cycles for FTRV
Parallel-executability (Table 8.2):
The processor has two instruction pipelines, but most resources are shared.
1. CO group cannot execute in parallel with any group
2. MT group can execute in parallel with all groups [except CO]
3. other groups can execute in parallel with all groups,
except itself [and CO]
The processor is capable of fetching two instructions from the instruction
cache at any given cycle. If they meet the above criteria, then they will
be issued during the same cycle, essentially running in parallel.
The test is performed after the I-stage (right after the instructions
have been fetched).
* If the test passes, both instructions enter the D-stage at the same time.
Normal resource conflicts may cause other stalls.
* If the test fails, the second instruction will wait until the first
instruction has finished in the D-stage. (Most instructions only
take one cycle there, but some control instructions lock the D-stage for
a while to ensure serial execution.)
The processor will issue another instruction as soon as possible; if the
parallel-executability test fails, the processor will fetch another
instruction and try to pair it with the delayed instruction.