Whitespace Assembly Language to Whitespace transpiler
Run the program like ./wsm2ws.pl <filename>
with a filename as the first argument.
The Whitespace output will be written to a file in the same directory as the input file. If the filename ends in .wsm
the output file will have the filename <basename>.ws
, otherwise the filename will be <filename>.ws
. For example ./wsm2ws.pl test.wsm
will create or overwrite the file test.ws
in the current directory and ./wsm2ws.pl dir/test.pl
will create or overwrite the file dir/test.pl.ws
.
The program will output a human readable version of the code to STDOUT followed by a tokenised version of the code then the name of the output file. While the tokenised code appends the keywords corresponding to each instruction as provided in the input file the output may differ slightlly when it comes to labels and numbers.
$ cat examples/reverse.wsm
push 0 ; initialise stack with 0. state: [0]
input_loop:
dup ichar ; read character from stdin (stored at address n). state: [n] [n:<char>]
dup retr ; get character value from heap address n. state: [n, <char>]
jez "output_loop"; null character read so assume end of input and start output loop. state: [n]
add 1 ; increment n. state: [n+1]
jmp "input_loop" ; continue input loop
output_loop:
sub 1 ; decrement n. state: [n-1]
dup retr ; get character value from heap address n. state: [n, <char>]
ochar ; write character to stdout. state: [n]
jmp "output_loop" ; continue output loop
$ ./wsm2ws.pl examples/reverse.wsm
sssnnsstnsnstntssnstttntsnssstntsssnsntnnssnssstntsstsnsttttnssnsnn
sssn ; push 0
nsssn ; input_loop:
sns ; dup
tnts ; ichar
sns ; dup
ttt ; retr
ntsn ; jez "output_loop"
ssstn ; push 1
tsss ; add
nsnsn ; jmp "input_loop"
nssn ; output_loop:
ssstn ; push 1
tsst ; sub
sns ; dup
ttt ; retr
tnss ; ochar
nsnn ; jmp "output_loop"
See examples/reverse.ws for transpiled source
This program does not attempt to validate or verify the Whitespace code beyond ensuring correct syntax. It is up to the programmer to ensure the code will execute without errors. Problems with the example above include a missing end token, no way to leave the output loop, and there's no check to avoid executing retr with a negative heap address.
See the list of issues with the [enhancement] label.
Refer to the issue tracker (specifically issues with the [bug] label).
Keywords are generally derived from the first verb of the command description from the Whitespace Tutorial, with some abbreviations just to keep tokens to five characters or less.
A semicolon (;
) starts a comment, everything from that token to the end of the line will be ignored.
The following keywords are available. In fact, any tokens that start with a listed keyword can be used (with some exceptions noted below).
push
: Pushes a value to the stack.1dup
: Duplicates the top stack item.copy
: Copies the nth stack item to the top of the stack.1swap
: Swaps the top two stack items.
Synonyms:swp
pop
: Removes the top stack item.slide
: Removes the top n stack items, keeping the top item.1
add
: Additionsub
: Subtractionmul
: Multiplicationdiv
: Integer Divisionmod
: Modulo/Remainder
Synonyms:rem
Arithmetic commands can be followed by a number to use as the RHS of the operation. A push <number>
command will be inserted before the arithmetic command in the transpiled output. For example, sequences like push 5 sub 3
will be transpiled to push 5 push 3 sub
.
stor
: Stores the value of the top stack item at the address given by the next stack item.retr
: Retrieves the value at the address given by the top stack item and pushes it to the stack.
Heap access commands can be followed by a number to use as the heap address. For stor
commands a push
will be inserted into the transpiled output as described for arithmetic commands. retr
commands are a little more complicated. As the spec uses the top stack value as the value to be stored and the second from the top as the address an additional swap
command will be inserted between the push
and the retr
to maintain consistency with the stor
syntax.
label
: Declares a label.2call
: Call a subroutine, effectively a jump to a label that also marks the current location for a laterret
.2jmp
: Unconditionally jump to a label.2
Synonyms:jump
,goto
jez
: Jump to a label if the top stack item is 0.2
Synonyms:jz
jlz
: Jump to a label if the top stack item is negative.2
Synonyms:jn
ret
: Return to the location of the lastcall
command.
Note: Pattern matching for this command is actually/^ret(?!r)/
so thatretrieve
unambiguously matchesretr
end
: End the program.
Synonyms:exit
ochar
: Output the character given by the value of the top stack item.
Synonyms:putchar
onum
: Output the value of the top stack item.
Synonyms:putnum
ichar
: Read a character and store it at the address given by the top stack item.
Synonyms:getchar
inum
: Read a number and store it at the address given by the top stack item.
Synonyms:getnum
I/O commands can be followed by a number to use as the heap address (for input commands) or value (for output commands). In all cases a push
will be inserted into the transpiled output as described for arithmetic commands.
- These commands expect the next token to be a number as described below. If the next token doesn't look like a number a 0 value will be inserted and a warning printed to STDERR.
- These commands expect the next token to be a label as described below. As the spec allows for an empty label if the next token doesn't match the label rules an empty label is inserted and parsing continues with no warning.
Numbers can be written in any of the following formats:
- Integer (
[+-]?\d+
) - A sequence of digits.1 2 - Binary (
[+-]?0b[01]+
) - The string0b
followed by a sequence of0
and1
characters.1 3 4 - Octal (
[+-]?0[0-7]+
) - A0
character followed by a sequence of digits between0
and7
(inclusive).1 - Hex (
[+-]?0x[\da-f]+
) - The string0x
followed by a sequence of digits or the charactersa
tof
.1 4 - Character literal (
'\?.'
) - A single quoted character or escape sequence. Character literals will be converted to the corresponding ascii character code value.
- These formats can optionally be prefixed with a
+
or-
character to specify the sign. - To shorten output slightly the integer 0 has special case handling so it is encoded as an empty sequence instead of a single space character. To avoid this behaviour use the binary/octal/hex format
- Leading
0
digits are significant when used with binary numbers. - These formats are case insensitive.
Labels can be written in any of the following formats:
- Integer (
\d+
) - A sequence of 1 or more digits.1 2 - Binary (
0b[01]+
) - The string0b
followed by a sequence 1 or more of0
and1
characters.1 3 4 - Octal (
0[0-7]+
) - A0
character followed by a sequence of 1 or more digits between0
and7
(inclusive).1 - Hex (
0x[\da-f]+
) - The string0x
followed by a sequence of 1 or more digits or the charactersa
tof
.1 4 - Quoted string (
"[^"]*"
) - A sequence of 0 or more non-double-quote characters surrounded by double-quotes.5
- While labels are not numeric, for ease of representing labels in WSM syntax unsigned numerical formats are allowed. The binary representation of the number will be translated to a sequence of spaces and tabs to make a valid label.
- To shorten output slightly the integer 0 has special case handling so it is encoded as an empty sequence instead of a single space character. To avoid this behaviour use the binary/octal/hex format
- Leading
0
digits are significant when used with binary numbers. - These formats are case insensitive.
- The following rules apply when using strings for labels:
- When a string is used for a label a sequence of space/tab characters (followed by a newline) will be generated for each reference in the Whitespace output such that the most frequently referenced strings receive the shortest unused sequence. Using the reverse program as an example
output_loop
gets the empty sequence because it has three references in the code andinput_loop
gets a length 1 sequence because it has two references. If there was another command using an implicit label (or the integer0
as a label) the empty sequence would be unavailable, sooutput_loop
would get a length 1 sequence (e.g.sn
) andinput_loop
would get the next unique sequence of length 1 or greater (e.g.tn
). - To support the short syntax for declaring a label when a
short_label:
token is encountered the text (excluding the:
) will be treated as if it was a quoted string so that flow control statements can refer back to that location. For exampleshort_label: jump "short_label"
is equivalent tolabel "short_label" jump "short_label"
and both commands will have the same generated label in the Whitespace output.
- When a string is used for a label a sequence of space/tab characters (followed by a newline) will be generated for each reference in the Whitespace output such that the most frequently referenced strings receive the shortest unused sequence. Using the reverse program as an example