Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v80.c: v80 assembler in c89 #13

Merged
merged 14 commits into from
Sep 19, 2024
Merged

Conversation

gvvaughan
Copy link
Collaborator

started work on #4

@gvvaughan
Copy link
Collaborator Author

@Kroc please feel free to scribble any feedback or suggestions all over this PR, it's far from ready to merge at the moment!

v1/v80.c Outdated Show resolved Hide resolved
@Kroc Kroc marked this pull request as draft August 7, 2024 20:52
@gvvaughan
Copy link
Collaborator Author

gvvaughan commented Aug 7, 2024

  • Next task is to rewrite the grammar comment to be line oriented to see if I can reduce the amount of lexical book-keeping compared to the token based grammar I've half implemented so far...

@gvvaughan
Copy link
Collaborator Author

Pasting my question's and @Kroc's answers here for easy reference:

  1. I have a 32 byte static token buffer for everything right now (label names, const names, numbers etc) to help enforce the token length limit, but presumably we want to handle strings of arbitrary length?
    There are strings in v80 and they are 'arbitrary' in length, but line-length in v80 is hard capped at 127 cols to limit memory usage on 8-bit systems and the C implementation should enforce this too so that source code written on PC will assemble on Z80.

When v80 encounters a string, it simply writes the bytes to the code-segment one by one so the string is never stored anywhere whole -- with one exception: the file-name of an include .i statement is captured whole, but because CP/M doesn't have subfolders, the length of this is known to be limited. At the moment expressions are not allowed in include file-names, but this might be supported in the future.

  1. When parsing expressions following .b, and the results don't fit in one byte, do you mask off the low eight bits? mask and right shift (but then that's the same as .w)? write big-endian order bytes? bail out with an error? something else?

It's an error -- when v80 encounters .b it sets a 'parameter size' variable for how many bytes (1, in this case) that expressions must fit into. If an expr > $ff then it's an error. Note that with .w using a string is an error, you can't have an ASCII string expanded to words.

“errors.txt” contains all possible errors in v80 and an explanation of what causes them so it’s a good source of detail on parsing behaviour

  1. Are values (literal and/or resulting from expressions) limited to 16bits by the assembler? Or in principle could I configure an ISA for a 32bit machine?

yes v80 is limited to a 16-bit number internally for everything. Considering that v80 can only output bytes or words to the code-segment, 32-bit results don't actually have a practical use! Note that v80 allows underflow but errors on overflow! This is so that the negate operator can work because numbers like -7 is a negate unary operator followed by the positive number 7

  1. Seems like the parser should be line oriented? Or can, say, an incomplete expression continue on a new line?

For memory and parsing-simplification reasons, expressions are limited to one line; the entire parser is line-orientated to allow for parsing a file larger than memory allows. v80 is 335KB of code which obviously doesn't fit into 64 KB of RAM :P

But you have to understand that v80 is purposefully limited to fit into 8-bit hardware and that a C89 version shouldn't be assembling code that can't be assembled on real 8-bit hardware otherwise that defeats the point!

  1. Would you be interested in discussing using a context free grammar to simplify the implementation, so we don't have to track indentation levels for conditionals, whether tokens are the first on a new line or not for constants and labels etc?

v80 is not trying to be an ideal assembler; it's trying to be minimal so that it can support many systems. Things like context-free grammars, macros etc. are features for a better, more language-orientated assembler (hopefully written in v80) -- v80 exists to bootstrap 8-bit software on 8-bit machines instead of relying on PC-only toolchains. Ergo, it has no goal to be anything more than a brutally simple assembler that acts as the bedrock of a broader range of 8-bit software. If an 8-bit computer can't modify and assemble it's own software then it might as well be proprietary. An 8-bit computer that can only run software that has to be compiled on a PC is not a real computer and v80 aims to break that cycle by allowing code on a PC to also assemble on 8-bit hardware.

@gvvaughan
Copy link
Collaborator Author

@Kroc 'nother question about local labels (possibly leading to reducing heap usage quite a bit):

  1. do you have documented support for jumping to local labels from outside of the non-local to which they apply?

    In my fantasyvm assembler I have gone back and forth on supporting that, but currently keep all the local labels in their own table without using the non-local prefix. The local labels table is reset every time a new non-local label is defined, and unresolved local label references throw an error at that point. The downside is that if you really do need to jump into a local label from outside the current non-local label's scope, you end up having to promote some of the locals to non-local and there can be a cascade of promotions around that area as a result. I'm thinking about adding persistent locals that are recorded in the non-local label table if I find it problematic later.

@Kroc
Copy link
Owner

Kroc commented Aug 8, 2024

Local labels are simply appended to the last non-local label defined forming a complete label-name.
"release/readme.txt" documents each feature, are you referring to that?

1.4 Local Labels:
--------------------------------------------------------------------------------
Local labels can be "reused", as they automatically append themselves to
the last defined, non-local, label:

|   _local                  ; error: local label without label!
|
|   :label1
|   _first                  ; defines :label1_first
|   _second     jr _first   ; defines :label1_second, jumps to :label1_first
|
|   :label2
|   _first                  ; defines :label2_first
|   _second     jr _first   ; defines :label2_second, jumps to :label2_first

Note that the combined length of the local label name and its parent must not
exceed 31 characters, including label sigil:

|   :2345678901234567890    ; 20 chars
|   _234567890              ; 30 chars - OK
|   _23456789012            ; 32 chars - invalid symbol error!

It was done this way for ease of implementation, but I would like to add anonymous labels in the future or change the way local labels are implemented so that they don't take up so much heap space.

@gvvaughan
Copy link
Collaborator Author

gvvaughan commented Aug 8, 2024

Sort of. I wondered whether you want to be able to rely on, eg:

:nonlocal1
_local1
:nonlocal2
_local1             jr :nonlocal1_local1

And if that's not an explicit goal, I think there's some low hanging fruit in heap size savings with segregating local labels into a short-lived table that gets reset at every non-local label boundary. (and allowing local labels a full 31 characters since there's no longer any need to prepend the non-local label)

@Kroc
Copy link
Owner

Kroc commented Aug 8, 2024

The heap in v80 cannot deallocate anything, ever! If a label gets added, it cannot ever be removed, because once something else gets added to the heap (like a deferred expression, a new constant), the heap cannot shrink without deleting something else important. The space cannot be reused because that creates a fragmentation problem that would take hundreds of bytes of code to work with. The heap is append only.

Hope is not lost however; we could have label records include a sub-label linked list on the end of it so that only the local labels names are stored attached to the parent label by a linked list. The downside to this would be greater complexity and code size in label searches.

- need a line-based parser to watch v80.v80, so instead of reading
  the next token from the input stream on demand, we buffer the
  next line
- redid the GRAMMAR to support a line-based parser
- factored out a better memory management API and built a getdelim
  and getline work-alike implementation with it
- the tokenizer now sets a start pointer into the buffered line,
  and a token length
- reworked the error messages to match errors.txt docs more
  closely -- can't resist including the current token in the error
  message for ease of use
- added support for nested .i, along with input file stack
  management
- added support for .a, along with a placeholder output stream
- redid most of the low-level string functions for consistency and
  robustness
- lost constants and labels support -- they need a do-over with
  the line-based parser
@gvvaughan
Copy link
Collaborator Author

@Kroc Heap limitations make sense. For v80.c, I'll I'll use the same "append local to non-local name" for symbol table entries as you, effectively supporting jumping to local labels from another non-local block.

Largely rewrote v80.c today to take into account your earlier answers. Any other feedback welcome as I make progress...

@gvvaughan
Copy link
Collaborator Author

gvvaughan commented Aug 10, 2024

Hmm.. just occurred to me that you could have local symbols in their own linked list, and as long as each entry is the same size (32bytes for the label name, and 4 bytes for the next entry pointer) and a zero length name marking the end of the list when searching, then there's no need to deallocate anything. When a new non-local label is encountered, we can error out for unresolved local label references, and then put a 0x0 tombstone at the head of the list. New local labels would then overwrite the entries from the local label list in place starting at the head (making sure that if the next entry was allocated, it get's a 0x0 tombstone) and reusing following entries until they are all used up, and then additional local labels get pushed onto the head of the list as before.

The size of the local labels list would only ever be 36bytes * largest-number-of-locals-in-a-single-scope. Surely much better for very large programs, which are the ones most likely to overflow the heap?

@gvvaughan
Copy link
Collaborator Author

gvvaughan commented Aug 10, 2024

What remains:

  • fix any errors from cc -std=c89 -pedantic -D_POSIX_C_SOURCE=1
  • diagnose unprintable character literals as an error
  • don't rely on host library strtol availability
  • numbers at start of line set PC
  • write to code-segment
  • constant assignment and references
  • label assignment and references
  • local label assignment and references
  • .f keyword
  • .w keyword
  • .b keyword
  • .b strings
  • conditions
  • flush code-segment to output file
  • command line arguments
  • deferred expressions
  • forward references (separate pass ? or back-patch recorded holes in single pass?)
  • testsuite

Did I miss anything?

- fix a few little compilation failures when copiling with strict
  c89 mode only
@Kroc
Copy link
Owner

Kroc commented Aug 11, 2024

Thinking about it, what I'm trying to get at is that changes to v80's design in Z80 code can take weeks, even months -- it took six months of meticulous crafting instruction-by-instruction and I'm not the fastest developer already. Given that the assembler is now self-assembling, I don't want to break it without careful consideration, and rewriting what already works is equally time consuming, so there had to be clear net wins.

This brings me on to instructions; I hadn't thought far enough ahead about a C version (I didn't actually think anybody would take up the offer), but the C version should reuse the instruction table binary so that this work isn't duplicated for every ISA -- v80 is unique in that support for different CPU instruction sets requires minimal code changes. The instruction set is encoded as a binary tree (see "isa_z80.v80") with a small amount of CPU-specific code to handle parameters ("v80_z80.v80"). However, I'm in the process of rewriting this table (see branch "v2") logic to both greatly simplify the instruction tables (see "is2_6502.v80" in branch "v2" for just how much simpler) and hopefully save more bytes, so you'll want to hold off of parsing instructions for the moment.

@gvvaughan
Copy link
Collaborator Author

Oh, I didn't mean to imply you should change the algorithm, but I think it's definitely worth throwing an error when attempting to jump into the middle of a local label from another scope so that some space optimizations are still on the table in case you want to do that one day 😁

In the unlikely event that the C version catches up, I might bug you for some specs for the v2 tables then. I secretly want to add support for my fantasy vm ISA after all!

- added a line-wise tokenizer; keeping track of buffers and token
  start and end offsets by hand was too finicky
- minimally tested
- define UINT_MAX if compiler/headers don't have it
- set new global skipcol to UINT_MAX
- add indent field to Include struct
- new parse_condition sets skipcol if condition expression fails
- parse_file sets files->indent from the column of the first token
  as each line is tokenized, not parsing any new lines until the
  first one with an indent no more than skipcol, when skipcol is
  reset to UINT_MAX
- moved the line-too-long diagnostic to tokenize_line
- when tokenize_line reaches a comment, return what was already
  tokenized, potentially avoiding line-too-long failures for
  comments
- new diagnostic when a  string token is found where a
  (non-byte-)expression is expected
- expect a (non-byte-)expression after any keyword except .b, and
  also after a condition and when setting a constant
- exit with usage message for bad command line arguments
- new xfopen helper to open a FILE* or exit with a diagnostic
- do file extension substitution on input path to make an output
  filename if none was given, or fallback to v.out if there was
  no extension match in the table
- keep all opened File objects on a stack and ensure they are all
  closed before exit
- adjust grammar and implementation to allow multiple keywords on
  a single input line
- snprintf is a C99 addition, carefully use sprintf instead
- fix a variable declaration after a statement (C99 feature)
- adjust parser to work in two passes
- for parsing pass 1, don't emit bytes
- for parsing pass 2, don't set label addresses
- reset include stack and pc value before each pass
- elide __attribute__ annotations when __GNUC__ is not defined
- simplify extreplace a little
- fix a bug with closing files from the include stack
- fix a bug with ERR_BADVALUE being too eager in .b and .w args
- fix a bug with double for loop in .b argument parsing
@gvvaughan gvvaughan marked this pull request as ready for review August 16, 2024 01:45
@gvvaughan
Copy link
Collaborator Author

I should add tests to flush out bugs in another PR, and I don't have any code to read the opcode tables yet - but the parser handles v1/isa_6502.v80 and v1/isa_z80.v80 and produces plausible looking binary output files, so it seems to be minimally functional.

What's the usual way of building an assembler that does opcode lookup in the tables? And do you have a spec for v2 tables I can implement?

- found some code that looks like `$ $ + 1 _label` in the cpm v80
  assembly files... changed the parser to support that as setting
  PC to the result of an expression (followed by a local label)
- fixed a bug in keyword parsing, where we should return a token
  that can't be parsed as part of the keyword arguments so the
  caller can try a different leg of the recursive descent
- don't attempt to close the standard streams
- diagnose number overflow at any point in evaluation of an expression
- can't close and reopen stdin, so remove '-' sentinel from command
  line
- use separate len and num fields in Token struct so that we always
  (for the duration of working with a specific line anyway) have the
  token text, even when there's a number value in the token now
- use a single T_COND, storing the condition type (= - ! +) in the
  newly available num field
- simplify parse_condition and parse_line accordingly
- new simpler err_fatal_token replaces both err_fatal_token_str and
  err_fatal_token_value
- simplify callers with new token_new_number and token_new_string
- simplify tokenize_line
- remove unused functions stack_zstreq, token_type and token_value
@Kroc
Copy link
Owner

Kroc commented Aug 16, 2024

Sorry for the slow response, I'm rather busy at home whilst my son is off school over summer. The process of parsing the instruction tables is covered by parseMnemonic in "v80_asm.v80" (

v80/v1/v80_asm.v80

Lines 1234 to 1377 in cebb049

:parseMnemonic
;===============================================================================
; parse an instruction into opcodes:
;
; the CPU-specific module (e.g. "v80_z80.wla") provides a binary tree,
; :opcodes, that this routine walks to match instruction names to opcodes
; and a CPU-specific set of flags that determines which parameters are required
;
; in: A first character of word to parse
; HL heap addr
; out: HL heap addr is advanced for any expressions deferred
; IY binary code is appended to the code-segment,
; IX and the virtual program-counter is advanced
; A, BC|DE (clobbered)
;-------------------------------------------------------------------------------
ex.DE.HL ; swap heap to DE for now
ld.HL :opcodes ; start at beginning of opcode tree
; the first character is already in A
;
set5.A ; force lowercase (see desc. below)
jr _0 ; jump into the parsing loop
;=======================================================================
; match; follow the branch:
;-----------------------------------------------------------------------
; once a character matches, the next two bytes are either
; an offset to the next branch to follow, or an opcode pair
;
_match inc.HL ; step over the matched character
ld.C*HL ; read the offset lo-byte | opcode-byte
inc.HL ; move to next byte in tree
ld.B*HL ; read the offset hi-byte | opcode-flags
bit7.B ; is hi-bit of hi-byte set?
jr?nz _opcode ; if so, this is an opcode
; add the offset to the current position to jump to the new branch:
; NOTE: the offset in the binary tree is reduced by 1 to compensate
; for adding from the hi-byte addr, rather than the lo-byte addr
;
adc.HL.BC
; if the hi-bit is set on the hi-byte, then it's an opcode + flag pair,
; not a jump! we branch away after the add to get a free flag-check
;
; TODO: this requires bit 6 of the opcode-flags to always be zero
; otherwise the ADC can overflow, voiding this check. this
; would leave us only 5 unique bits for any CPU
;
;jp?m , @opcode ; if hi-bit set, emit opcode
; get character from input file:
;-----------------------------------------------------------------------
_next call :readChar ; read from input file
cp #SPC + 1 ; is it whitespace? (hold carry...)
; force lowercase, without also affecting
; numbers / [most] punctuation:
;
; this essentially forces ASCII codes 64-95 (@A-Z[\]^_) to codes
; 96-127 (`a-z{|}~) which makes A-Z lowercase with the caveat that
; some punctuation cannot be differentiated "@"<->"`", "[]"<->"{}",
; "\"<->"|" and "^"<->"~" but we aren't using any of those in the
; instruction names anyway
;
; it also means that ASCII codes 0-31 (non-visible) are promoted
; to 32-64 (visible), but we have already checked for ASCII codes
; 32 (space) or below and this is signalled by the carry flag; so
; even though the below instruction would change tab into ")", we
; will undo this afterwards
;
set5.A ; force partial lowercase
jr?nc _0 ; was this a non-visible char before?
xor.A ; any whitespace = end-of-word (0)
_0 ld.BC 3 ; this is faster than INC HL x 3!
; compare with opcode tree:
;-----------------------------------------------------------------------
_cp cp*HL ; compare input char with tree char
jr?z _match ; characters match?
; if the hi-bit of the character from the opcode tree is set, it's
; either a continuation character (>128) or the end of a branch (=255)
;
bit7*HL ; check bit 7 of character
jr?nz _cont ; handle continuation char / end
; no match; try the next character:
;
_skip add.HL.BC ; skip 3 bytes in opcode tree
jr _cp ; compare next char in tree
;-----------------------------------------------------------------------
; handle continuation character / end-of-branch:
;
; a continuation character has no branch -- one character has to
; immediately follow another -- any mismatch is an unknown opcode
;
_cont or %10000000 ; *add* top bit to input char
cp*HL ; redo comparison with tree
inc.HL ; (move to next char in tree)
jr?z _next ; match, check next char
jp :errInvalIns ; error for continuation mismatch
;=======================================================================
; emit opcode(s):
;-----------------------------------------------------------------------
; if a branch ends in an opcode then no more characters must follow,
; with one exception -- an apostrophe can be appended to an instruction
; for indicating shadow registers. this is a crude hack as no check is
; made to ensure it's a register at the end, but it saves hundreds of
; extra branches in the opcode tree
;
_opcode and.A ; if the last char is already 0,
jr?z _ok ; then no further check is needed
_get call :readChar ; read one more character
cp '' ; if it is apostrophe,
jr?z _get ; then ignore and go again
cp #SPC + 1 ; is it whitespace (or eof)?
jp?nc :errInvalIns ; if not, invalid instruction!
_ok ex.DE.HL ; swap heap back to HL
; the flags byte is a set of flags for CPU-specifics and what, if any,
; kind of parameter is required. regardless of ISA, a "0" (with hi-bit
; removed) always indicates no-parameters
;
ld.A.B ; opcode flags byte
and %01111111 ; remove the top bit
; if flags byte is non-zero, analyse further (this routine is in
; the CPU-specific module, e.g. "v80_z80.v80" or "v80_6502.v80")
;
jp?nz :emitOpcode
; single opcode, no params:
;-----------------------------------------------------------------------
ld*IY.C [ 0 ] ; emit opcode byte
inc.IY ; move to next byte in code-segment
inc.IX ; increment virtual program-counter
ret
). Sorry that I don't have it better described somewhere but its a small amount of code; the tables themselves describe and demonstrate the structure so it's possible to use that alone as a guide. I'm getting near the end of the v2 instruction parser but have been struggling a lot with focus. The v2 parser is only guaranteed to make the instruction tables easier to read and write, performance is an unknown factor at the moment until I complete my prototype, so there's a small possibility v2 might be abandoned.

The "build.bat" script does some testing by building samples of the entire Z80/6502 instruction set and comparing against the same produced with WLA-DX maybe this would be a starting point? I haven't examined the PR enough to know what the build requirements of your C version are and if/how this would work as part of the current, rather crude, system. I use a batch file only so that v80 can be built out-of-the-box without having to install any dependencies or deal with high up-front demands like requiring knowledge of Docker -- remember that whatever is required to build v80 is itself a dependency of the 8-bit software at the end of the pipeline and the goal is to get away from gigabytes of constantly evolving build infrastructure :P

@gvvaughan
Copy link
Collaborator Author

No apologies necessary. I'm setting off on a 2-3 week road trip tomorrow, so any free time I would have had for coding will probably be spent on driving instead. Absolutely no hurry on anything from my perspective.

Build requirements for v80.c are a c89 C-compiler toolchain and a libc with support for stdio FILE*streams and a selection of c89 *printf calls (these could be coded around if it needs to build and run in an environment without stdio, but I'd rather not -- it's a lot of boring code) as well as stdlib.h for malloc, free and exit calls (could probably write a custom allocator if malloc and free are missing, managing without exit is probably a bit harder). If sys/param.h is available, it'll use the proper values for some constants, but has sensible fallbacks if not. If sys/stat.h is available, it'll check inode types when opening files for reading.

I was looking at your build.bat, and even though I enforce CP/M compatible filenames for .I arguments, you can pass any path to the compiled v80.c on the command line... I should probably take any directory prefix from the command line input file and prepend it to any filenames that come from .I args so you don't have to run it from the directory with the sources inside to find the include files.

I haven't tried building anything with WLA-DX or runcpm yet, so that's probably a good thing for me to get going to decide how to proceed, but I'd also like to write specific tests to exercise the tokenizer and parser in v80.c which probably needs a custom test harness anyway... which is why I don't want to pile that all on top of this PR.

I'm still not clear on how to assemble the *.v80 files to end up with a working assembler that contains the instruction lookup tables and the code that uses them to assemble instruction op-codes. It appears that that assembler needs to exist before it's possible to assemble the table lookup code?!?

And finally (for now ;-) ) -- I was thinking it might be easier to share the instruction opcode to binary mappings between v80.c and v80 proper if we define the instruction set separately somewhere that v80.c can load directly into a hash table, and I also provide some code to generate the lookup table sources (for v80 sources) rather than you hand coding them. That will let you tune the format for speed/space efficiency without the work of hand coding the tables too. WDYT?

@gvvaughan gvvaughan changed the title v80.c: work in progress for a v80 assembler in c89 v80.c: v80 assembler in c89 Aug 16, 2024
- with _POSIX_C_SOURCE=1, use all local function implementations
- add preprocessor guards to use library functions as available
- unroll single use of TOKEN_TYPES x-macro
- defer to standard ctype functions and use them as available
- split xstrtou into two, and use standard strtoul library
  function if available
- replace uses of zstrncpy and non-standard zstrlcpy with standard
  strlcat and strlcpy when available, or interface compatible
  local implementations otherwise (note: it can take some coaxing
  with feature macros to get declarations out of the standard
  headers!)
- remove some newly unused functions
- add some section comments
- provide fallback dirname() function in case libgen.h is missing
- new global zincludedir
- save a copy of the directory of infile argument to zincludedir
  (or "." if argv[1] has no directory component)
- adjust parse_keyword_include and helpers to search zincludedir
- improve the option parser in v80.c, add a new `-i` option that
  preloads the symbol table with the named ISA
- use a hash table for the symbol table instead of a linked list
- new .m keyword support.  `.m instruction body tokens` stores
  `instruction` as a key in the symbol table with the rest of the
  tokenized line as its value
- new v1/tbl_6502.v80 defines the 6502 ISA using .m
- new v1/tbl_z80.v80 defines the Z80 ISA using .m
- when the parser encounters a (.m defined) instruction, it switches
  to parsing the associated macro body, usually injecting the bytes
  from the body into codesegment, evaluating expressions as necessary
  to calculate those bytes: Except for the following tokens
  + .b - consume a byte from the assembly source, evaluating an
    expression if necessary, and write the result to the codesegment
  + .w - consume a word from the assembly source, evaluating an
    expression if necessary, and write the resulting two bytes in
    little-endian order to the codesegment
  + .r - consume a word from the assembly source, evaluating an
    expression if necessary, treating that as a destination address,
    and write a single byte to the codesegment as a relative offset
    to that destination address
@gvvaughan
Copy link
Collaborator Author

gvvaughan commented Aug 28, 2024

Had a couple of unexpected evenings to finish the code!

This implements the instruction tables for v80.c, as well as loading and parsing. It produces sensible looking (but untested) cpm_z80.com binary from the assembly sources, so can now serve as a bootstrap mechanism.

I need to write some code to generate the isa_*.v80 tables for the v80 assembler from the tbl_*.v80 tables for the C assembler, and validate that the binary it generates runs and regenerates bit-identical content from itself when reassembling itself.

QQ: v80.c is becoming hard to navigate at this size when editing it, but also having everything in a single file makes it easier to compile. I'm tempted to pull the polyfills (for missing libc APIs) and maybe some of the data structures (linked lists, hash tables, perhaps the tokenizer) into individual pseudo-headers. That would mean adding -I$PWD/v1 to the compiler invocation to pull all that code back in (but still a single compilation unit), but would make editing and navigating the code a lot easier for me. Do you have a preference? I could be nudged either way quite easily...

@Kroc
Copy link
Owner

Kroc commented Aug 30, 2024

QQ: v80.c is becoming hard to navigate at this size when editing it, but also having everything in a single file makes it easier to compile. I'm tempted to pull the polyfills (for missing libc APIs) and maybe some of the data structures (linked lists, hash tables, perhaps the tokenizer) into individual pseudo-headers. That would mean adding -I$PWD/v1 to the compiler invocation to pull all that code back in (but still a single compilation unit), but would make editing and navigating the code a lot easier for me. Do you have a preference? I could be nudged either way quite easily...

Thank you for hard work! Yes, you should split the code where you are essentially "patching" the base C-functionality; I fully expect that additional replacement functions may be needed for certain combinations of operating system and compiler -- C89 compatibility was very variable in compilers even late into the 90s! Such monkey-patching and non-portable considerations shouldn't factor into the code of v80 itself so that others may have an easier time fixing for their choice of compiler/OS.

- add a more robust option processing loop
- support --version
- support -h, --help with some basic option help
- rename --isa to --include
- polyfill/ a new directory with replacements for likely candidates
  for missing system headers and apis.  Note: it's not a replacement
  for the system library, only fallbacks for apis used by this
  project
- error.c: error handling
- file.h, file.c: file handling
- stack.c: simple generic stack datatype
- hash.c: simple hash table datatype with buckets made with stack.c
- symtab.c: symbols and a symbol table for them made with hash.c
- token.h, token.c: token data type, and a line at a time tokenizer
- parser.c: recursive descent parser for v80 assembly using token.c
- main.c: command line processing, and driver for feeding the parser
- Makefile: simple rules for making versions of v80 from the above to
  check that it works with almost nothing from libc using c89, and also
  using c99 with optimized libc functions instead of polyfills.
- README.md: A little about how to build and use it all.
@gvvaughan
Copy link
Collaborator Author

gvvaughan commented Sep 4, 2024

Okay, all done @Kroc!

If I compile main.c from bootstrap to make a v80 executable on my machine:

$ cd bootstrap
$ cc -std=c89 -pedantic -ggdb3 -D_POSIX_C_SOURCE=1 -DNO_STRING_H -DNO_SYS_STAT_H -DNO_CTYPE_H -DNO_LIBGEN_H -DNO_SIZE_T -DNDEBUG -I. -o ./v80 main.c

And then use that to make a cpm_z80.com file for CP/M (note the use of the simplified tbl_z80.v80 table to populate the instruction lookup table):

$  ./v80 -i tbl_z80.v80 ../v1/cpm_z80.v80 v80c.com

It produces identical bytes after recompiling itself with ntvcm (according to vbindiff):

$ ntvcm -l ../bootstrap/v80c.com cpm_z80.v80

And also identical bytes to recompiling sources with your most recent v80.com release:

$ ntvcm -l ../release/v80.com cpm_z80.v80

Incidentally the byte encodings for the set* instructions are the same as the res* instructions in your v1/is2_z80.v80 file. I discovered and corrected those in my bootstrap/tbl_z80.com file when comparing binaries, but I haven't done a full audit to see if there are other typos in there.

If you like and merge this PR, I'll be happy to work on generating the is2_*.v80 files from the simpler tbl_*.v80 tables when you've finalized the format. Or to isa_*.v80 if you decide to abandon the v2 format.

Also, feel free to let me know if you have any suggestions for changes or improvements to what is already here.

@Kroc
Copy link
Owner

Kroc commented Sep 4, 2024

Incidentally the byte encodings for the set* instructions are the same as the res* instructions in your v1/is2_z80.v80 file. I discovered and corrected those in my bootstrap/tbl_z80.com file when comparing binaries, but I haven't done a full audit to see if there are other typos in there.

I had seen this and fixed it, but maybe that was only on the v2 branch :/ I can't remember things straight. My son will be back to school next week and I'll focus on integrating your C version then. I think we should merge it in the current state to a separate branch; are you able to update the PR to use a different branch (or this something I need to do?)

@gvvaughan
Copy link
Collaborator Author

Cool! I can definitely do it if you tell me what branch you'd like me to retarget to. I think you might also be able to do it with the edit button near the very top of the PR page? Let me know whenever you're ready!

@Kroc Kroc changed the base branch from main to c September 12, 2024 15:31
@Kroc Kroc merged commit 5e25949 into Kroc:c Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants