-
Notifications
You must be signed in to change notification settings - Fork 3
SH2 on the 32X
- NTSC: 23.01136 MHz
- PAL: 22.801467 MHz
The ABI for SuperH is to pass the first four parameters in r4 to r7, and any others on the stack. Registers r8 to r14, pr, macl and mach MUST be preserved if you use them, normally by pushing them on the stack, which is r15. r0 to r7 may be freely changed without saving, and the result is returned in r0 as long as it's 4 bytes or less. For 8 bytes or less, the result is returned in r0 and r1.
The SH-2 and 32X hardware is all big-endian.
The SH2 has a five stage pipeline - each instruction takes (nearly always with a few exceptions) five cycles to complete. However, the pipe can be loaded on each cycle, so after five cycles for the first instruction, all further instructions complete on the next cycle for an effective cycle count of one. Conditional branching can result in the pipe being flushed, meaning four more cycles. You'll want to read the pipeline section of the Hitachi SH2 Programming Manual for details (section 7). In general, though, you can count most instructions as one cycle long... as long as the code is cached and makes no outside memory fetches/stores.
The 32X hardware manual tells you how many cycles for reading/writing various blocks in the SH2 address map. For example, reading SDRAM takes 12 cycles since it does a burst read, but only 2 cycles on a write since writes are not burst. Burst reading reads 8 words (one cache line) in one go of 12 cycles - or 1.5 cycles per word on average (the fastest non-cache memory can be read). However, even when reading a single word that is uncached, it still does a burst read - 8 words are read in 12 cycles, and the other 7 are tossed out. So reading an uncached word in SDRAM is the slowest thing you can do on the SH-2s. Keeping in mind the burst reads on the SH-2 is one of the key things to remember when designing code for the 32X when trying to get as much speed as possible.
The test code by bakemono is based on variations of the following loop. Does four 32-bit words at a time, has some overhead but not too much. (Note that the index is scaled so +1 is the next 32-bit word.) When all memory accesses are from the cache this takes 11 cycles per iteration.
st02:
mov [r1].d,r2
mov [r1+1].d,r2
mov [r1+2].d,r2
mov [r1+3].d,r2
add 16,r1
cmpeq 1,r0
bfs st02
add -1,r0
Sparse tests only access one 32-bit word every 16 bytes.
Conclusions:
- accessing SDRAM is always a 12-cycle burst = true
- writes to VRAM are 5 cycles per word (after the FIFO is filled)
- cartridge ROM is horrendously slow at 8 cycles per word
- sequential writes to SDRAM are nice and fast
- reading from ROM in the cached region also causes a "burst" but with no timing advantage
Comments on the last point by Chilly Willy:
The ROM is physically accessed by the bus controller as non-burst single reads of a set period (which can be lengthened as needed to wait until the bus arbitrator gives access, for example, if the 68000 has been granted the bus to do a read/write, or the other SH2 has the bus). So whether you try to use burst or not makes no difference as it will always just do "regular" single reads or writes. Where the difference comes in is when you make it do a "burst" access to the cache. Although it doesn't do the burst timing, it still fills a cache line. For sparse access tests, this is wasting time since you are doing SPARSE accesses, and the cache doesn't help. Uncached sparse reading will be faster as you're not doing extra reads to fill thecache when it won't be used.
If you don't do sparse reads, the cache WILL help, so the extra cycles spent reading the cache line finally pay off. You might say "But, filling the cache line does the same access cycle as regular reads, so how can it be saving time?" Filling a cache line is handled by the bus interface controller, and has no extra overhead associated with the decode and processing of commands by the SH2. It simply does back to back accesses to fill the cache line. The SH2 will then do its normal read, but straight from the cache, which is a special access that completes much more quickly than a normal read cycle not from the cache. The cache was designed so that accessing it doesn't disrupt the pipeline... no stalls waiting on data read/write.
The hardware division unit can work in parallel with the rest of the CPU.
When a read or write instruction is issued while the division unit is operating, the read or write instruction is continuously extended until the operation ends. This means that instructions that do not access the division unit can be parallel-processed.
For 64:32 bit division, the quotient is accessible from two registers: DVDNT and DVDNTL
The divider can't be saved/restored, so make sure that no function used by interrupt handlers uses the divider.
The SH2 processors have two Direct Memory Access Controllers (each). These allow you to set a source from which to fetch data, a destination to store the data to, a count of how much data to transfer, and a control register to tell the channel things like whether or not to increment or decrement (or neither) the source and destination, how big the data units to be transferred are (byte/word/long/16 bytes), if the transfer is done, if there was an error, and to generate an interrupt when the transfer is done.
Note, the DMA in the SH2 can use this burst mode when put in 16-byte mode. If you're trying to get the best speed from DMA, put the source data on 16 byte boundaries, and use the 16 byte transfer word size.
For a 16-byte transfer, the address is incremented by +16 regardless of the SM1 and SM0 values.
The internal cache bus width isn't specified directly, but a couple things allow you to assume it either IS 32 bits, or is fast enough to not matter - the HW manual says it takes one cycle to fetch the data for the CPU regardless of the size requested, and it says the cache data bus uses four longwords to fill the cache AND that the cache data bus is what the CPU reads to get the data, therefore the cache data width is indeed 32 bits.
The timing sequence when the CPU accesses the peripheral is called a bus cycle, and takes a minimum of 4 Clock with 68000 and 2 Clock with SH2*. In addition, wait time is created on the CPU side due to the difference of the peripheral and operating speeds. 1 Wait means that the minimum bus cycle + 1 Clock is necessary in the access. A wait is required for all 32X blocks (as shown below) to access from 68000 and SH2 in response to the process contents and operation status.
* Besides inputting a Wait signal from the outside, SH2 can input Wait by setting the built-in bus state controller, but after implementing boot ROM only external Wait is set.
CPU | min wait | max wait |
---|---|---|
SH2 (Read/Write) | 6 | 15 |
68K (Read/Write) | 0 | 5 |
CPU | min wait | max wait |
---|---|---|
SH2 (Read) | 5 | 12 |
SH2 (Write) | 1 | 3 |
68K (Read) | 2 | 4 |
68K (Write) | 0 | 0 |
Write access to the SH2 frame buffer assumes continuous accessing without an Idle Cycle. When the Idle Cycle is inserted between accesses, the next access time is shortened only by the number entered by the Idle Cycle (the next access time cannot be shorter than a minimum cycle of 3 clock).
A 4 word component of FIFO is held for frame buffer writing. Thus, 5 Clock is required if FIFO is FULL and 3 Clock is required if FIFO is not FULL.
CPU | min wait |
---|---|
SH2 (Read/Write) | 5 ~ 64 μsec |
68K (Read) | 2 ~ 64 μsec |
68K (Write) | 3 ~ 64 μsec |
Wait number 64 μsec means that a wait of a 1 line component display is required. (If access to the palette competes with the CPU and VDP, a wait of a 1 line component is required in the CPU side).
CPU | wait (const) |
---|---|
SH2 (Read/Write) | 5 |
68K (Read) | 2 |
68K (Write) | 0 |
CPU | wait (const) |
---|---|
SH2 (Read/Write) | 1 |
68K (Read/Write) | 0 |
CPU | wait (const) |
---|---|
SH2 (Read) | 1 |
The 32X SDRAM is specialized for the "replace" in the case of the SH2 cache miss, and read transfers in the 8 word bursts mode* while write transfers in the 1 word single mode. Access time is fixed at the following values:
Op | time |
---|---|
Read | 12 Clock / 8 Words |
Write | 2 Clock / 1 Word |
* 8-Word burst mod of read is a read operation that takes data in batches of 8 word components from the first address specified by the word address. Because 8 word corresponds to a single line cache, there will be conformity when a cache miss-hit occurs and line data is replaced. But when the SDRAM is read using cache-through, even if the data to be read is only a single word, the access operation to the SH2 SDRAM is 8-word-burst-read-fixed, and action time is required by that amount.
32X Technical Bulletin #32 - SH2 Internal IO Register Access Cycles - [1994-12-08]
Module Name Minimum | Number of Cycles |
---|---|
BSC | 3 |
DMAC | 3 |
DIV | 3 |
UBC | 3 |
INTC | 4 |
MDC (CCR, SBYCR) | 4 |
FRT | 11 |
WDT | 11 |
SCI | 11 |
Access to the internal I/O is done in the following sequence:
- A wait occurs if the bus is determined to be busy 1 cycle after the internal I/O access begins.
- Internal I/O access occurs after the bus master completes the use of the bus.
- After access to the internal I/O is completed, bus access is enabled for the other bus master on hold.
Therefore, the access time to the internal I/O = Wait time + minimum number of cycles
When cycle stealing, the bus is released for each access. During burst transfers, the bus is released after 1 burst is completed.
For example, when the slave side has the bus right, the master side's internal I/O access will be on wait status until the slave side releases the bus right.