-
Notifications
You must be signed in to change notification settings - Fork 3
Optimizations
From "Programming the 32x FAQ" by Toshiyasu Morita
There are several types of optimization applicable to the 32x:
- CPU optimizations
- Bus bandwidth optimization
- Salvaging wasted time
- Using special hardware effectively
This is basic processor-specific optimization.
The SH2 has a free bus access cycle on every longword instruction, so try to align your memory reads and writes on longword boundaries. In particular, try to insert register-to-register operations between multiply-and-accumulate instructions.
Related problem: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100163
There's a workaround for this problem, which allows loops and locations to be aligned to cache boundaries in the .data
segment. Normally doing something like this:
.align 2
do_col_loop_low:
swap.w r2,r0
would zero-fill the padded data, which of course would not work on the SH-2, since nop is coded as 0x0009 there. .align
does
allow you to specify a custom zero-fill value as the second argument:
.align 2, 9
However, this produces opcodes of 0x0909, which are also invalid. The solution that works is to use the .p2alignw
:
.p2alignw 2, 0x0009
do_col_loop_low:
swap.w r2,r0
When writing data to the frame buffer in 256 color mode try to accumulate at least two pixels in a register, and do a word/longword write.
If you have a routine which performs many small writes and your CPU is in split-cache mode (2k cache/2k RAM) then try to accumulate your small writes in the on-chip RAM, and then write the data to SDRAM as one big block. This avoids "handing off" the bus back and forth between the SH2s which costs 2-3 clock cycles.
The performance penalty due to bus contention has been measured to be at about 6-10%
http://cyberwarriorx.com/sh2-assembly-optimizations
Since the three processors of the SH2 share the ROM, and the two SH2s share ROM, there tends to be much bus contention. Reducing bus usage by performing fewer reads and writes helps considerably.
The 68000 fights with the SH2 for ROM access, and tends to hog the ROM since it is slow at accessing memory. Also, since the 68000 has no cache, if the code is in ROM it will access the ROM for every single instruction it executes! The 68000 running code in ROM can seriously bring a 32x to its knees.
There is typically a lot of wasted time in programs waiting for certain hardware events to happen. Typically these circumstances aren't very obvious.
When the frame buffer bit is toggled, it takes time (until the next VBLANK) for the frame buffer to change. Usually most games immediately busy-wait for the bit to change state, which is very bad. There is usually quite a bit of CPU time which can be recovered if the game code flow is reordered in this fashion:
- Toggle frame buffer bit
- Perform AI - player movement, enemy movement
- Perform math (if 3-D game)
- Busy wait to make sure frame buffer has swapped
- Write to frame buffer
- Go to stage 1
There are many useful bits of hardware in the 32x; some are: