Dreamcast Emulation Developer - The source for dreamcast development.

SH4 Pipelining overview
Overview of Hitachi SH7750 manual; Section 8, Pipelining
--------------------------------------------------------
v0.7, 2000-08-10


This is merely an overview of the information contained in
Section 8. It will hopefully aid in understanding how the
pipelining mechanisms of the processor works.

In a few cases, the processor operation has been oversimplified such that
some boundary cases are not correctly described; this has been done in
the name of clarity.




General layout of a CPU instruction (Figure 8.1, #1-#4):

  I D EX/SX NA/MA S

  I	- Instruction Fetch
  D	- Instruction Decode / Register Read
  EX/SX	- Operation
  NA/MA	- Memory Access (if any)
  S	- Result Writeback

  Each stage takes 1 cycle.

In an instruction pipeline that adhers strictly to the above model,
a MOV R0,R1 or an ADD R0,R1 would read its operands at D stage,
perform the addition (or nothing in the MOV case) at the E stage,
and update the destination register at the S stage.

This would however incur a latency of 3 cycles on any ALU operation,
which is far too much to be practical (to avoid resource conflicts,
the code would have to be paired like FPU code).

Therefore, some 'shortcuts' are implemented in the processor:

* 1-step operations (Figure 8.2, #1) which operate purely on Rns,
  have the result of the operation available after the EX stage (so
  the latency is 1 cycle for those instructions).

* If a MOV Rm,Rn is executed in the first pipe, then the result of the
  MOV can be fed directly into the second pipe as well --
  therefore the latency of the MOV is effectively 0 cycles;
  an instruction in the 2nd pipe that is started in the same cycle
  as the MOV will not stall.
  This holds true for all MOVs that don't access memory
  (FMOVs, and also FLDI0/FLDI1).

* Normal loads (Figure 8.2, #2) have the result of the load
  available after the MA stage (so the latency is 2 cycles for those
  instructions).


Consider the 'latency' of an instruction to be a fairly accurate
measure of how many cycles it takes *in your sourcecode* before
the result is available. That is, if you have an instruction with a
3-cycle latency at one place, you should not access it during the
following 0-6 instructions [exact amount depends on how well the
follow instructions pair etc]. This is not entirely true due to
some kinds of resource conflicts (particularly when mixing FPU and
CPU ops); to be 100% certain, you need to follow the
instruction execution patterns (Figure 8.2) and look for collisions.





General layout of an FPU instruction (Figure 8.1, #5-#7):

  I D [F0] F1 F2 FS
           F3

  I	- Instruction Fetch
  D	- Instruction Decode / Register Read
  F0	- Computation 0 (for some operations only)
  F1	- Computation 1
  F2	- Computation 2
  F3	- Computation 3 (for fdiv/fsqrt only; replaces F1
			 but is multi-cycle)
  FS	- Result Writeback

  Each stage takes 1 cycle, except for F3.

FP ops will read their data during D stage, perform their computations
during F0-F2, and write back the result during FS.
There are few 'shortcuts' as implemented in the CPU core (sometimes
the result can be forwarded to another execution unit in the same cycle
as it is written, but I do not know the exact behaviour).


No FP op can pass another FP op; they will all go through
the F0-FS stages in the order they are issued by the processor.


Most single-precision computations (Figure 8.2, #36) adher to the
following execution pattern:

  I D F1 F2 FS

Double-precision computations need multiple D cycles, because the FP
register file can not provide the data quickly enough.
For example, this is FADD/FMUL/FSUB (Figure 8.2, #39):

  I D d d d d  F1 F2 Fs

  (The slot marked  above is due to the internal pipelining)

FIPR has the F0 cycle (Figure 8.2, #42):

  I D F0 F1 F2 FS

FTRV has both multiple read cycles and the F0 cycle (Figure 8.2, #43):

  I D d d d F0 F1 F2 FS


Since F0-FS are different stages, one FP op can execute in each stage
at any given time. The peak FP throughput is therefore at a theoretical
maximum of one FP op per cycle.

Since there are no shortcuts (well, this needs investigation) in the
FP core, these are the rules for FP register resource allocation (Chapter 8.3):
* The result of an FP op is available for reading after the FS stage.
* The result of an FP op is available for writing after the F1 or F2
  stage (depends on the instruction).
* The source operands of an FP op are available for writing immediately after
  they are no longer being read by a previous instruction (after the D and
  possible extra 'd' stages are done). Exception: FTRV (longer).

FP ops generally write back their result one 32-bit register at a
time. Therefore, the first half of a double-precision operation is usually
available one cycle before the second half.

FP ops which update FPSCR generally update it one cycle after the result
registers.

In practice, the latencies (when considered in the same way as the CPU
instructions) become:
* Latency for result from a normal FP op is number of stages from D to FS stage.
  [More complex rules for double-precision operations; see
   Instruction Execution Patterns (Table 8.2) or Execution Cycles (Table 8.3)]
* When trying to write to the result of an FP op, latency is reduced by
  1 or 2 cycles.
* When trying to write an operand of an FP op, latency is:
  * 0 cycles for single-precision computations
  * 2 cycles for double-precision FMUL/FADD/FSUB
  * 5 cycles for FTRV



Parallel-executability (Table 8.2):

The processor has two instruction pipelines, but most resources are shared.

1. CO group cannot execute in parallel with any group
2. MT group can execute in parallel with all groups [except CO]
3. other groups can execute in parallel with all groups,
   except itself [and CO]

The processor is capable of fetching two instructions from the instruction
cache at any given cycle. If they meet the above criteria, then they will
be issued during the same cycle, essentially running in parallel.

The test is performed after the I-stage (right after the instructions
have been fetched). 
* If the test passes, both instructions enter the D-stage at the same time.
  Normal resource conflicts may cause other stalls.
* If the test fails, the second instruction will wait until the first
  instruction has finished in the D-stage. (Most instructions only
  take one cycle there, but some control instructions lock the D-stage for
  a while to ensure serial execution.)

The processor will issue another instruction as soon as possible; if the
parallel-executability test fails, the processor will fetch another
instruction and try to pair it with the delayed instruction.
This banner below is OUR banner, the one that helps us pay bills. The one above is our host.

All names, logos, symbols, representations, and anything is copyright Sega and we aren't associated with nor are they associated with us.