LFSR CPU running Forth

Author: Richard James Howe
License: 0BSD / Public Domain
Email: mailto:howe.r.j.89@gmail.com
Repo: https://github.com/howerj/lfsr-vhdl

This project contains a CPU written in VHDL for an FPGA using a Linear Feedback Shift Register (LFSR) instead of a Program Counter, this was sometimes done to save space as a LFSR requires fewer gates than an adder, however on an FPGA it will make very little difference as the units that make an FPGA (Slices/Configurable Logic Blocks) have carry chains in them. The saving would perhaps be more apparent if making the system out of 7400 series ICs, or if laying transistors out by hand.

See https://github.com/howerj/lfsr for more information.

The system contains a fully working Forth interpreter.

The project currently works in simulation (it outputs the startup message "eForth 3.3" with a new line) and accepts input (try typing "words" when the simulation in GHDL is running).

Here is an example session of the simulator running:

And the VHDL running under GHDL for a limited number of cycles:

CPU Resource Utilization

The system runs can run at 151.768MHz (on a Spartan-6) according to the timing report. The CPU itself is quite small, here is a cut-down report on the resources consumed for commit 7dc4c9b7e03082364b09540bb2d97105d2858d0b:

+------------------------------------------------------------+
| Module     | Slices | Slice Reg | LUTs  | BRAM/FIFO | BUFG | 
+------------------------------------------------------------+
| top/       | 3/53   | 9/105     | 1/151 | 0/8       | 1/1  | 
| +system    | 0/27   | 0/47      | 0/85  | 0/8       | 0/0  | 
| ++bram     | 0/0    | 0/0       | 0/0   | 8/8       | 0/0  | 
| ++cpu      | 27/27  | 47/47     | 85/85 | 0/0       | 0/0  | 
| +uart_rx_0 | 12/12  | 24/24     | 39/39 | 0/0       | 0/0  | 
| +uart_tx_0 | 11/11  | 25/25     | 26/26 | 0/0       | 0/0  | 
+------------------------------------------------------------+
No LUTRAM/BUFIO/DSP48A1/BUFR/DCM/PLL_ADV used

The above is indicative only as the actual resources used may vary from commit to commit, but also because of the tool chain and FPGA targeted.

Even given that though it is clear that this system is small. The CPU only occupies 27 slices, which is only a little larger than the bit-serial CPU available at https://github.com/howerj/bit-serial, and this 16-bit CPU is much faster.

Building

To build you will need make. To make the C Virtual Machine you will need a C compiler of your choosing.

To run the C VM type:

make run

You should be greeted by the message eforth 3.3. Type bye and hit ENTER to quit (CTRL-D will not work, this is not a bug). Type words for a list of defined Forth words. An example session:

: ahoy cr ." GOODBYE, CRUEL WORLD!" cr ;
ahoy
2 2 + . cr
bye

This is not a Forth tutorial. For a Forth tutorial look elsewhere. Try "the internet". I am sure they have something.

Making the simulation requires GHDL:

make simulation

There is a configuration file for the simulation in the file tb.cfg that allows changing many test bench parameters without recompiling the test bench. GHDL also allows you to set generics via the command line, which is done in the makefile. For example make simulation DEBUG=2 sets the top level debug generic to 2, bear in mind when changing the generics in tb.vhd that the makefile can override these constants!

To build for an FPGA you will need Xilinx ISE 14.7:

make synthesis implementation bitfile

The system has not been tested on an FPGA at the moment. If you have any luck, let me know.

Instruction Set and System Design

A short summary of the instruction set and design:

The machine is an accumulator based machine.
All instructions are 16-bits in length.
The top four bits determine the instruction.
The lowest four bits are the operand parameter.
Addresses are in number of 16-bit cells, not bytes.
The program counter is 8-bits in size and is advanced with a LFSR.
The top most bit determined whether the operand is used directly or whether or it loaded first.

The instruction set has been carefully chosen so that it should be very simple to implement in a traditional manner, or in a bit-serial fashion (much like my other CPU project at https://github.com/howerj/bit-serial that runs on an FPGA). It has also been designed so that it should be possible to implement in 7400 series logic ICs (excluding ROM and RAM) as well as on an FPGA. There may be no savings in logic on an FPGA due to the fact that slices have a built in carry chain (usually), but a comparison could be made when the system is implemented on an FPGA (This has been done, there is very little difference).

As the Program Counter does not use addition, it was decided that addition should be removed from the instruction set. It would be a bit weird to worry about all the gates the Program Counter is using and then just include an adder elsewhere. The ADD instruction is sorely missed and makes the Forth code more complex and slows it down. The only other instruction that feels like it should be present (to me) is bit-wise OR, in practice it is not used that much within the interpreter so it not missed, it is reimplemented using bitwise AND and INVERT (XOR against all bits set).

The instruction layout is:

+----------+---------------+-------------------+
|  BIT 15  | BITS 12 to 14 |    BITS 0 to 11   |
+----------+---------------+-------------------+
| INDIRECT | INSTRUCTION   | VALUE   / ADDRESS |
+----------+---------------+-------------------+

The INDIRECT flag determined whether bits 0 to 11 are treated as a value (not set) or an address (INDIRECT is set).

The INSTRUCTION field is 3-bits in size, the instructions are:

0 : XOR   : ACC = ACC ^ ARG
1 : AND   : ACC = ACC & ARG
2 : LLS1  : ACC = ARG << 1
3 : LRS1  : ACC = ARG >> 1
4 : LOAD  : ACC = MEM[ARG]
5 : STORE : MEM[ARG] = ACC
6 : JUMP  : PC = ARG
7 : JUMPZ : IF (ACC == 0) { PC = ARG }

All instructions advance the program counter except JUMP, and (conditionally) JUMPZ. All instruction affect or use the accumulator except the jump instructions. MEM consists of a linear array of 16-bit values.

There is one special address, address 0. This address is never incremented after an instruction is run as this is a lock up state for the LFSR, the current instruction will be executed indefinitely. A way to exit this condition is for the first instruction to be a jump to address 1, however any non-zero address will do. The system starts up executing from address zero. Conditional jumps could be used to determine whether to reset or halt the system if needed. Alternatively a LFSR that used XNOR could have been used (the lockup state for which is all ones) but it was not.

ACC is the 16-bit accumulator.

ARG is either the zero extended 12-bit operand (the lowest 12-bits of the instruction) as is, or the 16-bit value MEM[operand] if the topmost bit of the instruction is set.

Input and Output is memory mapped, reading from a negative address (high bit set) causes a byte to be read or output. This is triggered from reading or writing to any negative address, if multiple peripherals are to be added the address will have to be decoded correctly.

For the purposes of simulation JUMP will cause the CPU to halt if the jump address is the same as the program counter. This will not be implemented in hardware.

The program counter uses a 8-bit LFSR to advance, that means only 256 16-bit values can be directly addressed by this CPU, this is not a limitation that matters for the purpose of the software running on this system, a complete Forth Programming Language image, as the Virtual Machine that supports Forth can be written in under 256 instructions for this system. That Forth Virtual Machine can address more memory by using LOAD/STORE to access values outside the 256 instruction range.

This instruction set might change depending on the implementation, or when it comes to implementation, to make things easier and smaller still.

State machine diagrams

The following can be used to create a state-machine diagram with Graphviz, this can be done online, see https://dreampuf.github.io/GraphvizOnline/:

digraph LfsrCpu {
	fetch -> fetch [label = "pause = 1\nor next instr"];
	fetch -> indirect;
	fetch -> load;
	fetch -> store;
	indirect -> fetch;
	indirect -> load;
	indirect -> store;
	load -> load [label = "ihav = 0\nand input"];
	load -> next;
	store -> next;
	store -> store [label = "obsy = 1\nand output"];
	next -> fetch;
}

Alternatively, here is a veritable feast of a PNG file for both your lovely eye balls to take in and gaze upon with wonder, or to just look at, whatever:

Tracing

Edit tb.cfg and ensure that the number of clocks to run for is over 90000000 (or equal to), and that we are running in interactive mode, the following configuration file should work:

Clocks.. # Number of clock cycles to run for
90000000
Forever. # If greater than zero run forever
0
Interact # Is this session interactive?
2
InWaitMs # Time to wait before getting user input
10
UartRep. # UART reporting level, if non zero `report` UART chars
1
LogFor.. # Extra debug logging for this many cycles
100
1Line... # Exit after getting one line of input if non zero
1
UChDelay # Delay (ms) between sending UART characters
1
CRLF.EOF # Line ending CRLF (1) or LF (0)?
0

Then run the command:

echo "bye" | make simulation DEBUG=2 > vhdl.txt

And then:

echo "bye" | DEBUG=1 ./lfsr lfsr.hex 2> c.txt

And compare with:

vimdiff vhdl.txt c.txt

You will notice differences, however they should be minor, obvious, and mostly related to VHDL report messages. The instructions executed should be identical so long as the input is identical, to make sure this is the case the line endings also need to be made to be identical for both inputs.

Empirically the system averages 2.4 clock cycles (or 2.4 state changes) to execute a single instruction, this was measured by executing the default Forth image and feeding it a line of text with the quit/halt word "bye" in it. This was calculated from the following table, which contains the number of times the instruction has been encountered, whether the instruction is the indirect variant (marked with an 'i'), the instruction, the number of cycles needed to execute that instruction and the cycles multiplied by instruction cycle length.

+--------+--------+----------+---------+---------+
| cycles | indir? | instr?   | cyc/ins |         |
+--------+--------+----------+---------+---------+
| 369492 |        |  a_jmp   | 1       | 369492  |
| 241070 |        |  a_jmpz  | 1       | 241070  |
| 880098 |        |  a_load  | 3       | 2640294 |
| 1450   |        |  a_lsr1  | 1       | 1450    |
| 880388 |        |  a_store | 3       | 2641164 |
| 206102 | i      |  a_and   | 2       | 412204  |
| 65755  | i      |  a_jmp   | 2       | 131510  |
| 32042  | i      |  a_jmpz  | 2       | 64084   |
| 53057  | i      |  a_load  | 4       | 212228  |
| 173287 | i      |  a_lsl1  | 2       | 346574  |
| 17053  | i      |  a_store | 4       | 68212   |
| 173513 | i      |  a_xor   | 2       | 347026  |
+--------+--------+----------+---------+---------+
Total Instructions executed: 3093307 
Total Instruction cycles: 7475308

To Do / Future Directions

Improve the documentation; the addition of timing diagrams, terminal capture of the simulation running, state-machine diagrams and the like would all help. Much of the documentation from the sister project could be incorporated into this one.
Optimize the CPU core:
- The Input/Output system could be reworked, making I/O truly memory mapped.
See https://github.com/howerj/lfsr for more information and suggestions.
The number of these devices that could fit on one device could be quite large, limited perhaps by the number of Block RAMs available. Modifications would have to be made to program and interact with a matrix of these CPUs, but it might be a neat thing to do. They could take their input from the same UART and the same starting commands or boot image, synchronization could be partially achieved with pause, ihav and obsy signals. Each CPU would need an identifier assigned to it. It might be a good idea to have a way for each CPU to talk to its immediate neighbors. Most of these problems could be solved with a adjustments to the I/O system and the addition of peripherals and pseudo peripherals.
The lfsr.vhd file is quite configurable with many generics, the tool-chain could be made to generate Forth images that deal with the variants.
A Dual-Port version of the CPU would be faster, although require more resources, this CPU in its current form can share one half a Dual Port Block RAM allowing very fast memory mapped I/O.
The eForth image could be turned into ROM sections (the Forth Virtual Machine and the initial Forth dictionary could go here) and sections that can be stored in RAM. This ROM/RAM version would also execute faster, provided only the Forth VM was to be stored in ROM.
A bit-serial version of this CPU could be made, it might be smaller, it would certainly be slower.
A simulation written in VHDL using components based off of real 7400 series devices could be made. This should make it easier to build a physical version of this system out of 7400 series ICs, and will most likely be faster than other digital simulators. This would involve making components for each 7400 series IC we would want to use, sticking to the common ones (NAND gates, counters, multiplexors, flip flops, comparators, decoders) would be best, as some of the more exotic devices (such as ALUs, adders, ...) can be harder to find.
The article https://www.fpgarelated.com/showarticle/797.php, or Inside the Spartan-6: Using LUTs to optimize circuits by Victor Yurkovsky, publish June 24, 2015, contains information about optimizing LUT usage by explicitly instantiating and placing LUTs and other FPGA logic. If we really want to shrink this system, we could do the same here. There is also an interesting article about outputting VGA signals in only 7 slices: https://www.fpgarelated.com/showarticle/42.php.
The instruction set could be played around with, there are some really interesting instructions such as execute this register as an instruction (which is appropriate for a register machine and not an accumulator machine, adding a full rotate and removing the shift left/right by one instructions, or perhaps my favorite, using a Lookup Table (LUT) to determine what the output should be for two inputs which would allow all two input logic gates to be implemented, see https://arnaud-carre.github.io/2024-10-06-vpternlogd/ for an example of a three input version. Setting the LUT would require setting a 4-bit register for a 2-input LUT, which could be tacked on to the JUMP instruction as we have bits to spare (we only use 8-bits).
LFSR, and the related Cyclic Redundancy Checks (CRC), have many uses, Pseudo Random Number Generation (PRNG), test vector generation and checking, Built In Self Tests (BIST), and more. Using LFSR for test vector generation, in both a test bench and in hardware, would allow us to query the CPU with many test patterns and efficiently compare this to a result, in hardware this could be part of a BIST peripheral that sets a pin high on success.
This project would be a good candidate for the Tiny Tapeout project, amongst my others projects. Tiny Tapeout allows designers to cheaply make their own physical silicon. The interface would be a problem and would require shifting data in and out N-bits at a time as there are a limited number of pins available to interface with external memory (both RAM and ROM ICs would be required). This would mean we would need to be able to pause the CPU in any state, or other modifications would need to be made. Perhaps https://github.com/howerj/bit-serial would make a better candidate?

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
LICENSE		LICENSE
lfsr-0.gif		lfsr-0.gif
lfsr-1.gif		lfsr-1.gif
lfsr-sm.png		lfsr-sm.png
lfsr.c		lfsr.c
lfsr.hex		lfsr.hex
lfsr.vhd		lfsr.vhd
makefile		makefile
readme.md		readme.md
signals.tcl		signals.tcl
system.vhd		system.vhd
tb.cfg		tb.cfg
tb.vhd		tb.vhd
top.ucf		top.ucf
top.vhd		top.vhd
uart.vhd		uart.vhd
util.vhd		util.vhd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LFSR CPU running Forth

CPU Resource Utilization

Building

Instruction Set and System Design

State machine diagrams

Tracing

To Do / Future Directions

References

About

Releases

Packages

Languages

License

howerj/lfsr-vhdl

Folders and files

Latest commit

History

Repository files navigation

LFSR CPU running Forth

CPU Resource Utilization

Building

Instruction Set and System Design

State machine diagrams

Tracing

To Do / Future Directions

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages