Skip to content

JCAP Log #9: Video Part 4

Connor Spangler edited this page May 16, 2019 · 35 revisions

NES CPU-PPU Setup

Video System Implementation

We finally have all the information we need to fully implement a VGA arcade graphics system, minus one final critical consideration. The main and cog RAM sizes and their constraint of the graphics representation solution have already been addressed, however the Propeller 1 core clock as it pertains to the pixel clock represents one final technical hurdle to overcome, and will ultimately define the high-level architecture of the system.

Decisions

Thanks to the robust community behind the Propeller 1 microcontroller, a vast amount of communal knowledge can be drawn from to apply to our own design decisions. This is in no better way showcased than with video display. Dozens of developers have created hundreds of different solutions to display a wide variety of video types, resolutions, refresh rates, and other variations. What can be learned from these implementations is that video display of a high complexity and quality simply cannot be accomplished in a single cog. This is largely a constraint imposed by the generation of pixel data itself. Some solutions solve this by splitting the scanlines into groups which are assigned to different cogs, while others interlace individual scanlines generated by individual cogs.

In our case, with two layers of indirection and sprite effects to implement, we'll be forced to use a different paradigm altogether: a scanline driver. With this method, one cog is the "display" cog. Its sole job is to take pixel data from main RAM and display it via the video generator circuit. N cogs are then spooled up as "render" cogs. Their job is to generate interleaved scanlines of pixels which are then requested sequentially by the display cog. The choice of this methodology is a direct result of simply doing the math...

Colors


8-bit Color Palette

8-bit Color Palette

A critical constraint posed by the "indirect" method of using waitvid discussed in Video Part 2 is that each series of 16 pixels can only have 4 colors: 2 bits per pixel addressing one of the four color bytes. We need 16 colors per 8x8 pixel tile, which even if we only push out 8 pixels per waitvid we're still restricted to a 4 color palette. The solution to this problem is novel: simply switch the color palette with the pixel palette. By populating the color palette with the colors of the next four pixels, we can directly display them by waitviding each color sequentially, i.e. waitvid pixels, #%%3210. This new paradigm works perfectly at giving us "full color", but requires more waitvids per screen, a timing issue that will need to be addressed.

But another drawback also reveals itself here: if we're firing waitvids to synchronize all of our color and sync signals, then we're going to have to reserve 2 of the 8 video generator pins for the syncs, leaving us with only 6 bits of color (64 colors). Right? If we were lazy, then yes, we could simply drive the entire VGA protocol from a single video generator VGroup and be done with it at a 64 color limit. But, we're better than that, and a solution isn't that complex.


8-bit HSync Calculations

Figuring It Out

The fundamental issue here is that in order to maintain deterministically correct VGA signal timing, every single waitvid must be either blocked or just-in-time. Every single waitvid spits out pixels for an amount of time we specify, and as the wait in waitvid implies, when the instruction is executed it will wait for the video generator to finish spitting out the previous pixels before it immediately starts spitting out the next. If the video generator isn't busy, then the waitvid will execute without delay. What this means is: as long as every waitvid we execute ends up being forced to wait (or ends up being executed at the exact last pixel clock from the previous waitvid), we will always be 100% certain of the timing of the various parts of the VGA signal. If we miss, and wait too long before firing a "next" waitvid, then our signal will be skewed by however much that delay is.

The inevitable significance of this fact is that in order to guarantee signal integrity, we will need to control all 8 colors and both syncs with waitvid on the same cog (side note: we could do it with waitcnts controlling the sync pins, but because the system clock (104 MHz) isn't an even multiple of the pixel clock (25.175 MHz) this solution would not be 100% deterministic). In comes an interesting quark/feature of the video generators: you can change vcfg on the fly (i.e. during pixel display) and the changes will be immediately implemented. This is in contrast to vscl which is only latched at waitvid execution. What this means is that we can be tricky and change the VGroup controlled by the video generator during display, essentially switching output from the color pins to the sync pins and vice versa. Combine this with some appropriate outa control of the sync pins, and we have ourselves a 100% deterministic 8-bit color VGA signal.

Nanoseconds


Nanoseconds (xkcd)

There's a Relevant xkcd for Everything

It is in no way shape or form an exaggeration to say that the timing of this video system on the Propeller 1 comes down to single nanoseconds. Let's look at the numbers to find out why...

Our 640x480 @ 60 Hz VGA pixel clock is 25.175 MHz, which means we're displaying a pixel every 40 nanoseconds, or a group of 4 every 160 nanoseconds. Using our "direct" method of pixel output discussed above - displaying 4 at a time - we'll need to have a waitvid being blocked every 40*4=160 nanoseconds. Between each waitvid, we also will need to perform a rdlong to retrieve the next 4 pixels from main RAM. We're excluding using a djnz to loop through the instructions, and instead we're generating all instructions into a monolithic region of scancode, as (you're about to see) we don't have time to perform a jump and we have the space in cog RAM to generate the scancode. A worst-case scenario waitvid takes 7 clock cycles from execution to pixels being pushed out of the video generator. A worst-case scenario rdlong takes 23 cycle, however because the intermediate waitvids are only 7 cycles, we're always hitting the best case of 8 cycles.

Assuming an 80 MHz core clock, where each instruction cycle is 12.5 ns, our reading and printing routine takes (8+7) x 12.5 = 188 nanoseconds. That means we're blowing our 160 ns deadline! Our 2-instruction routine cannot be any more efficient, at least not without resorting to some nasty hacks that are difficult to understand and implement (a no-go for a project intended to be easily worked off of by all). With this execution routine already at its minimal cycle count, the only other possible way to meet our deadlines is to somehow execute the instructions faster. Fortunately, this is actually an incredibly easy solution to implement - we can simply increase the core clock speed.


P1 Clock Speed Range

Max P1 Clock Speed vs. Temperature

The standard, "most stable" combination of crystal oscillator and internal PLL results in the so-far assumed system clock of 80 MHz. However, the P1 has been proven to be stable up to clock speeds as high as 120 MHz (see diagram above)! However, for the sake of guaranteed stability (and based on community recommendations) we can choose an ideal clock speed of 104 MHz. At this clock rate, we have more than enough time to read and display each long (almost 40 nanoseconds of headroom).

Bytes

Concerns about the size of our data, and thus our ability to sensibly store it alongside the remaining resources necessary for the system's operation, manifests in a few places: the scanline buffer in each render cog, the main RAM scanline buffer they write to and the display cog reads from, the tile map which represents the screen area, and the scancode buffer which the display cog uses to read the main RAM scanline buffer longs and display them. As discussed in Video Part 3, given a 640x480 screen with a 1:1 tile map, we'd need over 9.5 kB of main RAM to store a single tile map. Taking up a third of our memory for a single map is not optimal. In the render cogs, we need to generate 640/4=160 longs, so we'd have to allocate 1/4th of cog RAM to that buffer. This is mirrored in main RAM, where it's less of a problem. As for the scancode, we would need 160*2=320 longs of buffer, or almost 2/3rds of cog RAM. That is pretty untenable.

The solution to this problem isn't one found in adding any indirection or increasing anything like clock speed, but actually in reducing what isn't needed. Classic arcade games were far lower resolution than anything we see today. Horizontal and vertical resolutions of less than 400 and 300 were the norm, viewable on CGA-compatible 15 kHz monitors. What this means for us is we can both drastically reduce our memory footprint AND develop a more faithful classic arcade graphics system by utilizing upscaling to fit less graphic system output at our modern screen resolution. Upscaling is exactly what it sounds like: scaling an image up from a lower resolution to a higher one. This involves duplication or stretching of pixels to fill a larger visible area with the same amount of unique data. By leveraging upscaling, we can render data at 320x240 (1/2 resolution) while displaying it at full 640x480. What's more, implementing this modification is trivial thanks to the nature of the P1 video generator. As discussed previously in Video Part 2, we can "stretch" pixels in memory over multiple physical pixels on the screen by changing vscl. So, to implement 2x upscaling in the horizontal dimension, we simply modify vscl to display each pixel on the wire for 2 physical pixel's worth of time, thus displaying them twice as wide. In the vertical dimension, 2x upscaling is as simple as only incrementing the scanline pointer to the next scanline every other render, therefore displaying each line twice. Just like that we have a 2x upscaled image to 640x480 while cutting our render cog buffers in half, display cog scancode buffer in half, and our tile map footprint by a factor of 4.

Propellers


Multi-Prop Setup

Example Multi-Prop Setup

The final question remains: how do we set up our display and render cogs? All of the previously analyzed factors come into play to answer this question, and then some. First lets define a graphics baseline:

Graphics Baseline
Tiles Per Scanline 40 tiles
Sprite Attribute Table Size 64 entries
Sprites Per Line 16 sprites

These values - chosen partly based on features of the NES - are the most critical in determining how much time it takes to render a single scanline, with some extra emphasis on sprites-per-scanline in order to support more complex games. After implementing the basic tile/sprite graphics system described in Video Part 3, it was determined through timing analysis that a minimum of 5 render cogs was necessary to support this baseline without blowing our render deadline (still generating pixel data when the display cog needs it). Great, then we can run 7 cogs for rendering and display and then cram the input and sound system and game and everything else into the last cog and call it a day!


Tile/Sprite System in Action

Tile/Sprite System in Action

Yeah, no. With the P1's shedding of interrupts and adoption of multiprocessing comes the restriction of running multiple time-sensitive processes in a single cog. There's no way - without creating an absolute abomination of spaghetti code soup - to poll inputs at the right time while playing sounds at the right time which themselves will be necessarily dependent on precise timing to produce accurate PWM signals and around all of that executing game code which has to sequence and execute these subroutines seamlessly all within the same cog. So within the constraints of the hardware as we have defined it thus far, something would have to give to make our system viable: reducing our graphics fidelity, removing sound altogether, or throwing out our serialized input system and hard-wiring all inputs. Needless to say, none of these things are going to happen. Yes, we could take a hit in graphics quality, but honestly at some point the result simply becomes underwhelming. This means, inevitably, we need more hardware.

As it turns out, this problem of handling so many subsystems in (relatively) under-powered semiconductor space is an old and solved one by none other than the NES itself: splitting the workload between two processors. The idea is simple: a primary CPU runs our game, input, and sound code while a secondary GPU runs our graphics code. The video data required for each frame is sent from the CPU to the GPU during the vertical sync period, and is used in the rendering and display of each frame. By offloading the graphics work to the GPU, we can also add another cog to the render pool to increase the amount of processing we can do.


Single-Ended NCO Counter Mode

Single-Ended NCO Counter Mode

In this scenario, we now have to deal with transmitting date from one P1 to the other. Fortunately there are myriad ways to skin the cat of high-speed serial data transmission. The most simple of these will load a long to be transmitted, and using the phsx register of a counter in single-ended NCO mode, rotate the long around place each bit on the wire every 4 clock cycles (with the reception side taking longer however due to the need to both poll AND store the bits). Thanks to the work of Marko Lukat though, we have an even faster method which utilizes a counter module on the reception side as well, both polling AND storing the data in the same instruction (details and original source of his implementation can be found here here). With his solution, we can achieve transmission speeds approaching 26 Mbps.

But we have a lot of data: between the tile map (2.4 KB), tile palettes (up to 8.192 KB), color palettes (~0.5 KB), and sprite palettes (also up to 8.192 KB), we would have almost 20 KB of data to transmit if we sent it all. Additionally, all of this data would need to be sent during a blanking period on the screen, specifically the longest one: the vertical sync period. If there are 6 render cogs, then we would also need to start rendering at least 6 lines before active video started to give them time to be generated. Based on the timing of the 640x480 VGA standard, this means we would have 0.31777557100298 ms + 0.063555114200596 ms + 1.0486593843098 ms - (31.777557100298 μs * 6) = 1.239 ms (vertical front porch + sync pulse + back porch - 6 lines of head start) to transmit all the data necessary. At our maximum transmission speed of 26 Mbps, it would take 5 times that long to transmit our 20 KB of data! This simply won't do.

Fortunately, we can mediate this greatly by realizing that we will never need to modify the tile and sprite palettes. These elements will always remain the same, so don't need to be refreshed or reloaded every frame, and we can therefore store those statically on the GPU. With color palettes, we can modify them on the fly to achieve some advanced animations, so we still want to send those every frame. The Tile Map and SAT need to be sent every frame as they are dynamic structures. But by removing the sprite and tile palettes, we can remove ~16.5 KB of data from our stream. This results in a payload of only ~3 KB, which could be transmitted in time at speeds as low as 20 Mbps, well within our transfer system specs.

Conclusion

These past 4 sections have covered an immense amount of ground: how video is represented on the screen electronically in terms of various output standards, how it can be generated with our Propeller microcontroller, how high-fidelity graphics can be generated and stored with limited memory available, and how we can display our graphics with similar constraints. All of these considerations and more take inspiration for their solutions from the extensive history of working with limited hardware to create robust effects. Incorporating these paradigms into our project results in a graphics system with capabilities even greater than the NES and arcade systems of the time.