Got something we don't? Need a correction? Send it!

DC Memory organization
Dreamcast main memory subsystem
v1.2, 2001-03-15

Note: Table 3.2.1 contains the essential timing figures. Remember that
      the cycles given are buscycles (100MHz), not SH4 core cycles!

Table of Contents

0.	Introduction
0.1	Quickstart
1.	Dreamcast main memory configuration
2.	SDRAM operation
2.1	  Overview
            Figure 2.1.1: Model of an SDRAM chip
2.2	  Commands
2.3	  Read/write access
2.4	  Aborting and pipelining bus transactions
3.	SDRAM in the Dreamcast
3.1	  Access philosophy
3.2	  Access timing in RASDown mode
	    Table 3.2.1: Common access timings
	    Figure 3.2.1: CPU burst read, no row active
	    Figure 3.2.2: CPU burst read, row hit
	    Figure 3.2.3: CPU burst read, row miss
	    Figure 3.2.4: CPU burst write, no row active
	    Figure 3.2.5: CPU burst write, row hit
	    Figure 3.2.6: CPU burst write, row miss
	    Figure 3.2.7: DMA reads, row hit
	    Figure 3.2.8: DMA writes, row hit
3.3	  Optimizing memory access patterns
3.4	  Accessing other memory areas
4.	  References

0. Introduction

  SDRAMs are essentially DRAMs on steroids, with a minor interface on top
of them that allows some degree of pipelining.

  SDRAMs are designed to transfer data in fixed-size bursts. Initiating a
transfer takes some cycles, but after that the burst of data is transferred
very quickly (one data item per cycle). Subsequent accesses may also
be faster if they are in the vincinity of each other.

  Getting good performance out of an SDRAM-based memory subsystem requires
that the programmer pays attention to the size and location of data items.
Ensuring that data objects are as small as possible, and related objects
are stored either close to each other, or really far from each other (see
Chapter 3.3), is imperative to high bandwidth operation.

All cycle values referred to in this document are bus cycles, not CPU core

1. Dreamcast main memory configuration

  The Dreamcast has 16MB of main memory, supplied in the form of two 8MB SDRAM
chips. These chips are clocked at the SH4's full bus frequency, 100MHz.

  Each of the two main memory SDRAM chips in the Dreamcast is
512k x 32 bits x 4 banks large. The two chips are connected in tandem to the
CPU's 64bit bus - one chip handles the upper 32 bits of each access, the other
the lower 32 bits - so the configuration is (from the CPU's point of view)
identical to having a single 512k x 64 bits x 4 banks SDRAM chip.

2. SDRAM operation

2.1 Overview

                      Address/control bus
  |                           |||                                            |
  |    +++--------------------+++---+++--------------------- ... ---+++      |
  |    |||                          |||                             |||      |
  |    |||         Bank 0           |||        Bank 1               ...      |
  |    |||      +-----------+       |||      +-----------+                   |
  |  +------+   |   Row 0   |     +------+   |   Row 0   |                   |
  |  | Bank |---|   Row 1   |     | Bank |---|   Row 1   |   ...   Bank k-1  |
  |  | ctrl |   |    ...    |     | ctrl |   |    ...    |                   |
  |  +------+   |  Row n-1  |     +------+   |  Row n-1  |                   |
  |             +-----------+                +-----------+                   |
  |               |||||||||                    |||||||||                     |
  |             +-----------+                +-----------+                   |
  |             |Sense Amps |                |Sense amps |                   |
  |             +-----------+                +-----------+        .........  |
  |               |||||||||    +--------+      |||||||||          |||||||||  |
  |               +++++++++----|Bus ctrl|------+++++++++---- ... -+++++++++  |
  |                            +--------+                                    |
                                Data bus

         Figure 2.1.1: Model of an SDRAM chip: n rows x m bits/row x k banks

  An SDRAM chip has one set of address/control pins, and one set of data pins.
These make up the SDRAM's interface to the outside world. (See fig. 2.1.1)
  Inside the chip, there are several (usually two or four) identical
submodules called 'banks'.
  Each bank contains three parts: Bank control circuitry, a memory array,
and a row of sense amplifiers.
The memory array, in turn, is made up of a set of rows of bits. A row can be
subdivided into a set of words, each of which is the size of the data bus.
The length of a row is usually equal to (or a bit smaller than) the number
of rows in the memory array.

  Each bank covers one segment of the addressable space; that is, the highest
1 or 2 bits of the address is used to select bank. Row address is given
using the following address bits, and finally the column index is given
using the lowest-order address bits.

  Before a given memory cell can be accessed, the appropriate row must be
activated in the right bank. Activation connects it to the sense amplifiers,
which will translate the small (0.5v or less) charges in the memory array to
acceptable CMOS levels. Once the row is active, its contents can be
read/written over the data bus.
  One row can be active in each bank, independent of the other banks.

 Activation takes a few cycles. If another row already is active in that bank,
then that row must first be deactivated. That procedure also takes a few

  When reading/writing to the SDRAM, the bus control module will transfer data
between the appropriate portions of the active row in the bank in question
and the data bus. The bus control module allows the rows to be many
times longer than the width of the data bus.

  In addition to this, the memory cells leak power; inactive rows leak slowly,
active rows leak quickly. Therefore, the rows periodically need to be
rewritten (typically once every 4ms). A row must not be open too long
either (typically max 100us). The normal refresh handles both these issues.

2.2 Commands

  A few different commands can be sent to the SDRAM on the control/address
bus, here is a subset:

  REF		      - Deactivates all active rows, and refreshes one row in
		        each bank (the SDRAM keeps an internal counter which
			specifies which row is next to be refreshed).
  ACTIV(bank, row)    - Activates the given row in the given bank. No other
		        row must be currently active in that bank.
  PRE(bank)	      - 'Precharges', that is, deactivates the currently active
		        row in the given bank. If no row is active in the bank,
		        this is a no-op.
  PALL		      - Precharges all banks.
  READ(bank, column)  - Reads a burst of data from the currently active row
                        in the given bank, starting at the specified column.
		        The data will be written to the databus.
  WRIT(bank, column) -  Writes a burst of data to the currently active row
                        in the given bank, starting at the specified column.
		        The data will be read from the databus.
  READA(bank, column) - Reads a burst of data, then precharges the bank.
  WRITA(bank, column) - Writes a burst of data, then precharges the bank.

2.3 Read/write operation

  READ/READA commands initiate a series ('burst') of consecutive reads from
the active row. There is no need for more communication after the command has
been sent; all reads will be carried out without any handshaking.
  Each read ('beat') will deliver one full set of data to the data bus
(8/16/32/64 bits, depending on the data bus width of the SDRAM), and then
update the reading address within the row in anticipation of the next beat.

  The number of beats in a burst has to be set by the CPU when the SDRAM is
initialized. Normal values are 1, 2, 4, 8 beats, or enough beats to deliver
the whole row.

  The reading address must be aligned to an integer word boundary. As the read
progresses, the next address is computed in one out of a few different ways.
The most commonly used scheme is the sequential burst sequence: The access
address is increased to the next word address modulo (burst length) -- that is,
if the current burst length is 4, and (access address) modulo 4 == 3, then
the next access address will be 3 words back, rather than 1 word ahead.
This mode allows CPUs to begin fetching a cacheline at an arbitrary position
within the line - if the word at offset #2 into a 4-word cacheline is
immediately needed by the CPU, it can fetch the words in order 2, 3, 0, 1 by
starting a fetch at word 2.

  Write operation is identical to read operation, except that the SDRAM will
read from the address pins, rather than write to them. Also, as there is no
write operation that writes only a subset of a word, there is one
disable/enable signal available per byte; the bus controller can assert
only some of those signals, thereby telling the SDRAM to only update some
of the bytes of the next word to be written.

2.4 Aborting and pipelining bus transactions

  A burst read/write sequence can be prematurely aborted by issuing another
read/write command to the SDRAM which will result in new bus access before
the first burst sequence has finished.

  When issuing a read/write, the bank number and column address given on
the address bus will be latched into the SDRAM. Once the first data word is
read/written via the data bus, the contents on the address and control buses
is no longer needed. Another command can then be given while one of the banks
is still reading/writing via the data bus, as long as that command will not
directly interfere with the ongoing bus transaction.
  Suitable commands can be row activation/deactivation in another bank, or
a read/write command for one of the banks. (The read/write command must not
be sent too soon however, or the ongoing bus transaction will be aborted before
  By overlapping data bus transfers and command issuing, the SDRAM can reach
throughput rates closer to its theoretical maximum: 1 data word per cycle.

3. SDRAM in the Dreamcast

  As the Dreamcast has two 32bit SDRAMs in parallel, and the bus to the SDRAMs
is 64bit wide, this chapter will assume the simplified view that there is a
single 64bit SDRAM chip in the system.
  The fictionary '64bit SDRAM' has 2kB/row, 512 rows, and 4 banks. This means
that each bank spans 4MB of memory.

 An address can be split into bank-, row- and column-bits in the following

  bbrr rrrr rrrr rccc cccc c000             <-- 24bit memory address
   |         |        |      | 
  bank      row    column   sub-8byte position

  The SH4 has a bus state controller, which operates in parallel with the
rest of the SH4 core.

The SH4 bus runs at 100MHz, which is half the SH4 core frequency.

3.1 Access philosophy

  The SH4 will always access the SDRAM in 32-byte chunks (4-beat bursts) -
even when doing non-cached reads/writes. [When doing non-cached writes, the
SH4 will tell the SDRAM what data to ignore using some bus control signals.]
Because all DMA uses the SH4's on-board DMA controller to generate the
addresses, DMA also accesses the SDRAM in 32-byte chunks.
  Thus there are only two operations which are of interest to the programmer:
the 32-byte read, and the 32-byte write.

  The bus state controller keeps track of which banks have open
rows, and omits unnecessary PRE/ACTIV commands whenever possible.

  The bus state controller can be in two modes.

  In the first mode ('RASDown', which is setup by the BootROM code),
read/write commands are issued as READs/WRITs, and thus leaves the accessed
row active after the operation. The bus state controller keeps track of which
rows are open in which banks, and omits unnecessary PRE/ACTIV commands
whenever possible (when there is a 'row hit'). The only way that rows become
deactivated is via 'row misses' and the periodically issued REF commands.
  This mode gives better performance, unless the row hit rate is exceptionally

  In the second mode, read/write commands are issued as READAs/WRITAs, and
thus deactivate the row after the operation. This means that one ACTIV command
must be issued before each READA/WRITA. This mode has lower maximum
throughput, but may be faster when executing algorithms that have bad
row hit/miss ratio.

  The bus state controller will not pipeline CPU accesses, only DMA
accesses. This means that when the CPU requests a memory access, the
bus state controller will wait until the data bus is idle before issuing
any commands to the SDRAM.

Normally, the bus state controller will run in RASDown mode. The
BootROM code sets up the bus state controller to operate in this mode.

3.2 Access timing in RASDown mode

ACTIV(bank, row) takes 3 cycles.
PRE(bank) takes 2 cycles.
READ(column) takes 3 cycles for setup, and then 4 cycles during which
             the data arrives (one beat per bus cycle).
WRIT(column) begins writing data immediately: 4 cycles to receive
             the data. However, the WRITE command must come at least
	     2 cycles into the operation (some control bus signals
	     may be delayed that much since the previous bus access).

  The setup time of one CPU access can not be overlapped with the data
access time of a previous CPU access. However, DMA accesses will pipeline
in this fashion, and since they are massively sequential, there will be a
lot of row hits which results in a transfer speed close to 8 bytes/cycle.

  The values in Table 3.2.1 indicate that the maximum read speed for CPU
would be 450MB/s, maximum write speed at 600MB/s, and maximum DMA speed
(both read and write) at 800MB/s.

  When CPU or DMA requests a memory access to a given address, the SH4
will issue different commands to the SDRAM:

* If the correct row is active in the bank in question ('row hit'),
  the READ/WRIT is sent directly to the SDRAM.
* If no row is active in the bank in question ('no row active'), an ACTIV
  command is sent, followed by the READ/WRIT. This case occurs
  very rarely in the Dreamcast.
* If another row is active in the bank in question ('row miss'), a PRE command
  is first issued to close that row. This is followed by an ACTIV to activate
  the appropriate row, and finally the READ/WRIT is given.

See table 3.2.1 for cycle timings of the different cases.

  Figure 3.2.5 deserves a comment: According to the SDRAM specification, one
of the control signals is delayed by two cycles, so the bus state controller
must assert the signal two cycles before the bus transaction begins.
  Since the CPU is unable to predict what that signal should be set to in
advance of the bus transaction, the bus state controller has to idle for
two cycles while the control signal in question is propagating through
the SDRAM.
  (DMA accesses, on the other hand, are long sequences of increasing
addresses. The control signal can then be predicted ahead of time, and the
gap between bus transactions eliminated.)

  Table 3.2.1: Common access timings

  CPU burst read, no row active  10 cycles     (Fig. 3.2.1)
  CPU burst read, row hit	 7 cycles      (Fig. 3.2.2)
  CPU burst read, row miss	 12 cycles     (Fig. 3.2.3)
  CPU burst write, no row active 7 cycles      (Fig. 3.2.4)
  CPU burst write, row hit	 6 cycles      (Fig. 3.2.5)
  CPU burst write, row miss	 9 cycles      (Fig. 3.2.6)
  DMA burst reads, row hit 	 4 cycles each (Fig. 3.2.7)
  DMA burst writes, row hit	 4 cycles each (Fig. 3.2.8)
  Note: If a row miss happens during the first cycle after a write,
        the bus state controller will idle for a cycle before
	sending the PRE command.

    Chart coding:

    **** marks the time when a row/column address, or PRE command
         is being sent
    ---- and .... mark when burst beats 1,2,3,4 transfer the data
    ++++ marks when address/data of other memory accesses are
         being performed (only used in the DMA figures)

    Cycle             0   1   2   3   4   5   6   7   8   9  10  11  12
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Row address     ************  |   |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address    |   |   | ****************************  |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Data arrives      |   |   |   |   |   |  ----....----.... |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.1: CPU burst read, no row active

                      0   1   2   3   4   5   6   7   8   9  10  11  12
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address  ****************************  |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Data arrives      |   |   |  ----....----.... |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.2: CPU burst read, row hit

    Cycle             0   1   2   3   4   5   6   7   8   9  10  11  12
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Precharge       ********  |   |   |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Row address       |   | ************  |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address    |   |   |   |   | ****************************  |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Data arrives      |   |   |   |   |   |   |   |  ----....----.... |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.3: CPU burst read, row miss

                      0   1   2   3   4   5   6   7   8   9  10  11  12
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Row address     ************  |   |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address    |   |   | ****************  |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Data departs      |   |   |  --- ... --- ...  |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.4: CPU burst write, no row active

                      0   1   2   3   4   5   6   7   8   9  10  11  12
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Idle cycles     ********  |   |   |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address    |   | ****************  |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Data departs      |   |  --- ... --- ...  |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.5: CPU burst write, row hit

                      0   1   2   3   4   5   6   7   8   9  10  11  12
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Precharge       ********  |   |   |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Row address       |   | ************  |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address    |   |   |   |   | ****************  |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Data departs      |   |   |   |   |  --- ... --- ...  |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.6: CPU burst write, row miss

                     -1   0   1   2   3   4   5   6   7   8   9  10  11
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address 1  | ****************  |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address 2  |   |   |   |   | ++++++++++++++++  |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address 3  |   |   |   |   |   |   |   |   | ++++++++++++++++
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Read data 0      ++++++++++++++++ |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Read data 1       |   |   |   |  ----....----.... |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
    Read data 2       |   |   |   |   |   |   |   |  ++++++++++++++++ |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.7: DMA burst reads, row hit

                      -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
                       |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address 0 ++++  |   |   |   |   |   |   |   |   |   |   |   |
                       |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address 1   |   |   |   | ****  |   |   |   |   |   |   |   |
                       |   |   |   |   |   |   |   |   |   |   |   |   |
    Column address 2   |   |   |   |   |   |   |   | ++++  |   |   |   |
                       |   |   |   |   |   |   |   |   |   |   |   |   |
    Data departs 0    ++++++++++++++++ |   |   |   |   |   |   |   |   |
                       |   |   |   |   |   |   |   |   |   |   |   |   |
    Data departs 1     |   |   |   |  ----....----.... |   |   |   |   |
                       |   |   |   |   |   |   |   |   |   |   |   |   |
    Data departs 2     |   |   |   |   |   |   |   |  ++++++++++++++++ |
                       |   |   |   |   |   |   |   |   |   |   |   |   |

	Figure 3.2.8: DMA burst writes, row hit

3.3 Optimizing memory access patterns

* Align your data structures such that they span as few cache lines as
  possible (16-byte align vectors, 32-byte align matrices).

* If you have a group of related data items, which are going to be accessed
  at roughly the same time, and there will not be many accesses elsewhere
  in that bank, then put them into the same row by 2kB-aligning the group
  of items; this avoids some row activations/deactivations.

* If performing some kind of streaming operation (wading through lots of
  data and performing some simple operation on it), put each data stream
  into a separate memory bank; this avoids lots of unnecessary row

* Remember that the cache is direct-mapped. If you are interleaving accesses
  to two arrays that both are aligned to at least 16kB, there is a risk of
  cache thrashing. Offset one array some bytes to the side to solve that

* Use PREF to fetch data some cycles before you'll be accessing it. If you
  don't prefetch appropriately, a cachemiss later on will stall the entire
  SH4 pipeline until the data is available (8+ CPU-cycles from SDRAM).

* If you are creating a data stream from scratch -- not modifying existing
  data at those locations -- then use MOVCA to allocate cache lines without
  causing memory reads.
  This way avoids reading in dummy data (due to cache misses & line fills)
  which is soon going to be overwritten anyway.
  If you're not going to access the data shortly either, use the
  Store Queues to write out the data.

* When you are writing out a data stream, and you will not overwrite it
  in the near future, use the OCBWB instruction to force the data to be
  written back to memory.
  The two main reasons for triggering cache write-backs manually are that
  you can avoid memory contention to some degree, and spurious cache
  writebacks later on may cause lots of SDRAM active row switching.
  Again, Store Queues is an alternative.

* Keep in mind that Store Queues bypass the cache: If previously have
  read from a memory area, and subsequently write to it using Store Queues,
  use the OCBI instruction to invalidate the corresponding cache lines.
  Otherwise the cache might contain stale data.

3.4 Accessing other memory areas

When switching between writing to the SDRAM, and writing to other memory
areas (either to other memory, or to memory-mapped devices), idle cycles
may be inserted.

A 32-byte burst to the Tile Accelerator seems to usually take roughly
9 cycles, give or take max 2. The Tile Accelerator might be busy (this
happens mainly when starting/ending a new list type, or when submitting
degenerate/invalid primitives); then the access will stall for a while
(delays up to 500 cycles have been observed).

4. References

+ The bus state controller and SDRAM interface of the SH4 are well
  described in the SH7750 Hardware Manual.

+ A detailed description on how to setup an SH4-SDRAM system is
  found in Application Note #92, named "SH-4 Interface to SDRAM".

+ The KM432S2030CT-G8 SDRAM (which is used for main memory in some
  Euro DCs, at least) documentation is available from Samsung.


This banner below is OUR banner, the one that helps us pay bills. The one above is our host.

All names, logos, symbols, representations, and anything is copyright Sega and we aren't associated with nor are they associated with us.