65HE06

Prototype implementation of a Pipelined 16 bit Accumulator CPU, inspired by 6502. HE mean Half-word Extended, with a 32 bit word.

Design

Goals

No addressing mode left behind (from the original 6502), but improving or collapsing multiple addressing modes into one is allowed
No new registers, only registers already implemented in the original 6502 or later in the 65816 are allowed. Enhancing or expanding them is allowed
No support of self modifying code or BCD arithmetic
Supporting both 16 bit word and 8 bit bytes
Opcode fit in 16 bit.
Argument fit in 16 bit
Final core is synthetizable and is written in verilog
Try to perform less than 2 clock per instruction

Instruction Encoding

5 bit opcode
3 bit accumulator
1 bit save flags
7 bit left for different families

Instruction Family	Fixed Encoding	Specific encoding
Immediate Operand	fffffaaas	0010000
Register Operand	fffffaaas	0000rrr
Indexed Operand	fffffaaas	w10jj00
Indirect Indexed Operand	fffffaaas	w11jjyy
Predicated Add Register	11101aaas	cccnrrr
Predicated Add Immediate	11111aaas	cccnrrr

R(3) is source register, flow is: OP(A, R) -> A
W(1) is width, if 0 the size is word, if 1 the size is byte
C(3) is the index of the first 8 bit of Status Flags
N(1) is the predicated value of the selected bit
J(2) is the index register for indexed and reindexed modes
Y(2) is the post-index register for the reindexed mode

Register Set

4 Accumulators: A, B, SF, PC
4 Indexes: S, X, Y, Z
Z is conventionally 0

Remapping original addressing modes

#imm is extended to 16 bit
Abs is remapped as Z + k16
Abs, X is remapped as X + k16
Abs, Y is remapped as Y + k16
(Abs) is remapped as (Z + k16), Z
ZP is remapped as Z + k16
ZP, X is remapped as X + k16
ZP, Y is remapped as Y + k16
(ZP, X) is remapped as (X + k16), Z
(ZP), Y is remapped as (Z + k16), Y
(ZP) is remapped as (Z + k16), Z

Implementation 1

5 stage variable length, multi cycle pipeline (Failure)

Stages

IF: instruction fetch, fetch 16 bit from the memory
ID: instruction decode, fetch the optional argument, issue operations and registers to the back-end, keep track of busy registers (registers that are being calculated). Prevent failed predicted operations to enter the backend. Re Issue registers to the back end for reindexed mode. Create the uOP opcode that flow through the pipeline. Stall instructions if PC need to be updated or flags are required, but are currently busy.
AGU/ALU0: early exit for simple instructions, address calculation for memory operands. Multi stage
Load/Store Unit: load/store values from/to memory.
ALU1: ALU for operations with memory values.

Problems

ID does too much. It needs to keep an enormous amount of state, while the last three stages of the pipeline have almost no state. IF does almost nothing.
IF fetch 16 bit/clock, but 32 bit instructions are common. It is useless to pipeline the core to get at most 0.5 op/cycle on frequent operations. A simpler implementation with a single microcoded late stage that do everything is already capable of 0.5 ipc for the common opcodes (See my other repository 6516). There is no point to waste ton of resources to gain nothing.
With registers checked at ID stage there is a guaranteed full pipeline flush everytime someone write a register since 3/8 registers are required to be not busy . Considering C mem* functions, all implementations will be something like LD A, (S, src), Y; ST A, (S, dest), Y; ADD Y, #1. In the specified sequence everything stall for multiple stages (1°: 1 IF + 3 ID + 2 AGU + 2 MEM, 2°: 1 ID, 2 AGU + 1 MEM)
No writeback to memory as writeback is impossible with this configuration. While most of rmw operands opcodes are not needed because their performance is bad: e.g. "LSR B, mem; ADD A, B; ST B, mem;" is faster than "LSR mem; ADD A, mem;", atomic operations are popular and registers spills come with great cost.

Example pipeline run with this configuration of a memory transfer between two array with pointer on stack, index in Y.

OP	1	2	3	4	5	6	7	8	9	10	11	12	13	14
LD A, (S, src), Y	IF	ID	ID	ID ALU0	ALU0	ALU0	MEM	MEM	ALU1
ST A, (S, dest), Y	-	-	IF	IF	ID	ID	ID	ID	ID	ID	ID ALU0	ALU0	ALU0	MEM

A single memory transfer in 14 clock cycles is something that even the original 6502 could do, and without the two channels required here. Indirect Indexed mode cannot be so slow that is useless. Since there is a limited set of registers and memory accesses are frequents, Load/Stores must not stall the pipeline.

Conclusion

The pipeline need to be redesigned from scratch. Either a register renaming scheme is required or the register file need to be put near the execution units and busy/not busy deferred or avoided. IF is required to implement his own ALU and require a single port to handle predicated instructions that store on PC. A prefetch is needed to push 2 16 bit words/clock on IF. Model of next prototype (Prefetch considered extern for the moment)

IF: fetch 2 16 bit words, IF then send a complete instruction to ID
ID: transform a 16 bit opcode in a 32 bit one, expanding all fields and present it to the execution unit when needed. ID issue back the expanded opcode to IF as well to let it handle predication
ALU: fetch expanded opcode, select inputs and calculate output, destination is itself or memory. Keeps the list of busy registers. WB pseudo operation is when Memory Write to Register file, while ALU is not writing back to register file (e.g. writing to the address)
MEM: fetch address and data from ALU, push back to register file

Proposed Schedule for the new pipeline

OP	1	2	3	4	5	6	7	8	9
LD A, (S, src), Y	IF	ID	ALU	MEM	ALU	MEM	ALU
ST A, (S, dest), Y	-	IF	ID	ALU	MEM	WB	-	ALU	MEM

Implementation 2

4/8 Stage Multi cycle interlieved Micro-Executed Pipeline

Given the aforementioned requirements and conclusions, a new pipeline is being developed. The interlieved execution of Memory Operations simplify the state tracking and improve performance by delaying the stall until the last possible moment. The new pipeline is composed of

IF: fetch 2 16 bit words and feed them to ID
ID: use the opcode to generate at most 3 uOp/cycle and feed them to the Microcore Reservation Stations (A & B).
Microcore SCHED: select next uOp and load it in the uOp register. Wait uOps that require busy registers. Note that Main RS doesn't check for busy registers.
Microcore ALU: execute the selected uOp and start a memory cycle if required
Microcore MEM: execute writes, execute loads and save the result in the temporary register of the relative Reservation Station (Register TA for RSA, Register TB for RSB). The Microcore Repeat stages 3-5 until the execution is complete. During Main memory cycles, the wasted cycle is instead used to perform useful ALU operations. Load a new operation when needed. During stalls, ID will feed a NOP.

OP	1	2	3	4	5	6	7	8	9	10	11
SUB? Y, #1	IF	ID	SCHED	ALU
LD A, (S, src), Y	-	IF	ID	SCHED	ALU	MEM	ALU	MEM	ALU
ST A, (S, dest), Y	-	-	IF	ID	SCHED	ALU	MEM	SCHED	SCHED	ALU	MEM
BNE SUB	-	-	-	IF	ID	ID	ID	ID	ID	SCHED	ALU

Performances so far

100 byte transfer using indirect indexed complete in 800 clocks, using direct indexed it requires only 600 clocks. Current performance is two/three fold of the original 6502, so the large implementation did improve performance in a meaningful way.

TODO

Evaluate further code speedup.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
docs		docs
src		src
tests		tests
65HE06P0.CircuitProject		65HE06P0.CircuitProject
65HE06P1.CircuitProject		65HE06P1.CircuitProject
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

65HE06

Design

Goals

Instruction Encoding

Register Set

Remapping original addressing modes

Implementation 1

5 stage variable length, multi cycle pipeline (Failure)

Stages

Problems

Conclusion

Proposed Schedule for the new pipeline

Implementation 2

4/8 Stage Multi cycle interlieved Micro-Executed Pipeline

Performances so far

TODO

About

Releases

Packages

Languages

License

aleferri/65HE06

Folders and files

Latest commit

History

Repository files navigation

65HE06

Design

Goals

Instruction Encoding

Register Set

Remapping original addressing modes

Implementation 1

5 stage variable length, multi cycle pipeline (Failure)

Stages

Problems

Conclusion

Proposed Schedule for the new pipeline

Implementation 2

4/8 Stage Multi cycle interlieved Micro-Executed Pipeline

Performances so far

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages