Optimizations

From "Programming the 32x FAQ" by Toshiyasu Morita

Types of optimizations

There are several types of optimization applicable to the 32x:

CPU optimizations
Bus bandwidth optimization
Salvaging wasted time
Using special hardware effectively

CPU optimizations

This is basic processor-specific optimization.

Example - Align memory access on longword boundaries

The SH2 has a free bus access cycle on every longword instruction, so try to align your memory reads and writes on longword boundaries. In particular, try to insert register-to-register operations between multiply-and-accumulate instructions.

Related problem: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100163

There's a workaround for this problem, which allows loops and locations to be aligned to cache boundaries in the .data segment. Normally doing something like this:

       .align 2
do_col_loop_low:
        swap.w  r2,r0

would zero-fill the padded data, which of course would not work on the SH-2, since nop is coded as 0x0009 there. .align does allow you to specify a custom zero-fill value as the second argument:

       .align 2, 9

However, this produces opcodes of 0x0909, which are also invalid. The solution that works is to use the .p2alignw:

       .p2alignw 2, 0x0009
do_col_loop_low:
        swap.w  r2,r0

Example - Try to keep the SH2s off the bus

When writing data to the frame buffer in 256 color mode try to accumulate at least two pixels in a register, and do a word/longword write.

If you have a routine which performs many small writes and your CPU is in split-cache mode (2k cache/2k RAM) then try to accumulate your small writes in the on-chip RAM, and then write the data to SDRAM as one big block. This avoids "handing off" the bus back and forth between the SH2s which costs 2-3 clock cycles.

The performance penalty due to bus contention has been measured to be at about 6-10%

Assembly optimizations

http://cyberwarriorx.com/sh2-assembly-optimizations

Bus bandwidth optimization

Since the three processors of the SH2 share the ROM, and the two SH2s share ROM, there tends to be much bus contention. Reducing bus usage by performing fewer reads and writes helps considerably.

Example - Move 68000 code into RAM

The 68000 fights with the SH2 for ROM access, and tends to hog the ROM since it is slow at accessing memory. Also, since the 68000 has no cache, if the code is in ROM it will access the ROM for every single instruction it executes! The 68000 running code in ROM can seriously bring a 32x to its knees.

Salvaging wasted time

There is typically a lot of wasted time in programs waiting for certain hardware events to happen. Typically these circumstances aren't very obvious.

Example - Salvaging frame buffer swap time

When the frame buffer bit is toggled, it takes time (until the next VBLANK) for the frame buffer to change. Usually most games immediately busy-wait for the bit to change state, which is very bad. There is usually quite a bit of CPU time which can be recovered if the game code flow is reordered in this fashion:

Toggle frame buffer bit
Perform AI - player movement, enemy movement
Perform math (if 3-D game)
Busy wait to make sure frame buffer has swapped
Write to frame buffer
Go to stage 1

Using special hardware effectively

There are many useful bits of hardware in the 32x; some are:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly