# Monthly Archives

4 Articles

## Sweet16-GP Assembler – Part 3

This entry is part 8 of 8 in the series Sweet16-GP CPU: A Complete Development Cycle

Last time we left off with the first pass of our assembler creating an intermediate  representation for our assembly code. Today we are going to begin the second pass. So let’s get on with it!

Our first pass completes with a list of code that contains all the information we need to assemble the code. So why did we use a two pass design instead of a single pass? The answer is that in the single pass we have to do a lot more work to handle forward references to label values. By using a two pass design we can forego the complexities and simply handle any foreward references in the second pass after we have already calculated each instructions address in the first pass.

So what work do we need to accomplish in the second pass? Well, we need to keep track of the instructions location in memory. We will do this by maintaining a symbol in a synbol table called “pc” for program counter. We will use  will need to replace any symbols with thier values stored in the symbol table. Then we will need to encode the instruction in hex values. So let’s see how to accomplish this:

```# Second pass -- Assembly
for lineno, pc, value, register, mode, mnemonic, icode in objcode:
print(f'Pass 2 : {lineno, pc, value, register, mode, mnemonic, icode}')
# Evaluate the string
try:
symbols[pc] = pc

if value is not '':
realvalue = eval(value, symbols)

if isinstance(realvalue, str):
realvalue = ord(realvalue) & 0xff

if not isinstance(realvalue, int):
raise TypeError("Integer expected in {0}".format(value))
except Exception as e:
print("{0:4d} : Error : {1}".format(lineno, e), file=sys.stderr)
realvalue = 0```

Referring to the code above, we can see that first we loop over objcode extracting the values we assembled in the first pass.  With each pass we set the new value of the pc counter. Then we check the value field. Any value we stored here will be a string. So we use eval() to convert the value to an integer. If the value happens to be a  symbol we can convert it to it’s value by passing the symbol table along with the value to eval(). However, if the value was a string representation of a numerical value, and not a symbol in the symbol table, eval will simply return our string value. So we check if the value returned by eval is a string and if so, we make a call to ord to convert it to an integer value. We also “AND”  the value with 0xFF to ensure it remains a byte sized value.  Now we should have an integer value between 0 and 255 in “realvalue”. In the last if statement we raises an TypeError if this is not the case. Finally, we wrap this all in an try/except block and print an error message to stderr is any issues arise and set the “realvalue” variable back to zero.

Our next step is to take the values and data we extracted from objcode and assemble all of it into the actual values to represent the instruction. For this we will need to use the Callable library to call the functions we stored in the opcode table. So our first order of business here will be to import the Callable module from the collections library.

`from collections import Callable`

Now that we have the Callable module we can encode our actual instruction code:

```# Encode the instruction. If register is present,
# it must be added to the base opcode value stored
# in icode[0]
opcode = int(icode[0], 16)
if register != '':
opcode += int(register, 16)
encode = [op(pc, realvalue) if isinstance(op, Callable) else op for op in icode]
encode[0] = opcode

print(lineno, pc, encode)
execode.append((lineno, pc, encode))
return execode```

Here we first extract the instruction opcode and convert it to hexadecimal. Then we check if the register field is empty. If not, we have a register value that must be added to the instruction opcode. The next step is to encode the extra data values. We do this by looping over the values in “icode” and if they are Callable, i.e. stored functions, we call the functions on “realvalue” variable and store the values in the list variable “encode”. We update the value of the opcode in encode and then print the results for a simple sanity check. Finally, we append the list of values in encode to the instruction list execode passing in the line number and pc value as well. Once we have iterrated over all the instructions in our program, we return the execode list to the caller.

We now have a list of encoded instruction with a little extra info attached. If you run the code above the final results in encode will look something like this:

[(2, 0, [8, ‘222’, ‘255’]), (3, 3, [0])]

What we see here is a list of tuples where the fisrt entry in each tuple is the line number in the source file the instruction was found on. The second value is program counter value at the start of the instructions, and so where to store the first byte of the instruction.  The last value in the tuple is a list of values representing the values to be placed in memory. Here, our first instruction has an opcode of 8 and is followed by the values 222 and 255. Recall the first byte is the lower order byte. If we look up the opcode 8 we will see it is the SET instruction. Since the this instruction requires a register index be added to it, and the register index did not change the opcode. We can deduce the register is register 0 or ACC. The value to be loaded into register 0 is 255 << 8 + 222 =  65502 or 0xFFDE. So the complete instruction is SET R0, 0xFFDE. The second instruction is even easier to decode. The opcode 0x00 is the HALT instruction and takes no arguments.

OK, let’s see the code as we have it now:

```"""Sweet16-GP Assembler"""
from collections import Callable

# Exception used for errors
class AssemblyError(Exception):
pass

# Functions used in the creation of object code (used in the table below)
def VALUE_L(pc: int, value):
val = value_to_int(value) & 0xff
return str(val & 0xff)

def VALUE_H(pc: int, value: int) -> int:
val = (value_to_int(value) & 0xff00) >> 8
return str(val & 0xff)

def LABEL_L(pc, value):
return (value) & 0xff

def LABEL_H(pc, value):
return ((value) & 0xff00) >> 8

def OFFSET(pc, value):
return ((value - pc)) & 0xff

def value_to_int(value):
# convert value to int from various bases
if isinstance(value, str):
if value[1:3].lower() == '0x':
return int(value, 16)
elif value[1:3].lower == '0c':
return int(value, 8)
elif value[1:3].lower() == '0b':
return int(value, 2)
else:
print('Invalid value.')
elif isinstance(value, int):
return value
# else:
#     raise ValueError("Expected numerical value, got: {0}".format(value))

def register_to_int(reg):
if isinstance(reg, str):
if reg.startswith('@R'):
return int(reg[2:], 16)
elif reg.startswith('R') and reg.endswith(','):
return int(reg[1:-1], 16)
elif reg.startswith('R'):
return int(reg[1:])
else:
raise ValueError("Cannot parse the register: '{0}'".format(reg))

opcode_table = {
# TODO: Get DATA implemented

'HALT': {
'implicit': ['0x00'],
},
'BRA': {
'immediate': ['0x01', VALUE_L],
'offset': ['0x01', OFFSET],
},
'BRC': {
'immediate': ['0x02', VALUE_L],
'offset': ['0x02', OFFSET],
},
'BRZ': {
'immediate': ['0x03', VALUE_L],
'offset': ['0x03', OFFSET],
},
'BRN': {
'immediate': ['0x04', VALUE_L],
'offset': ['0x04', OFFSET],
},
'BRV': {
'immediate': ['0x05', VALUE_L],
'offset': ['0x05', OFFSET],
},
'BSR': {
'immediate': ['0x06', VALUE_L, VALUE_H],
'offset': ['0x06', LABEL_L, LABEL_H],
},
'RTS': {
'implicit': ['0x07'],
},

'SET': {
'direct': ['0x08', VALUE_L, VALUE_H],
},
'LD': {
'register': ['0x10'],
'indirect': ['0x20'],
},
'ST': {
'register': ['0x18'],
'indirect': ['0x28'],
},
'LDD': {
'indirect': ['0x30'],
},
'STD': {
'indirect': ['0x38'],
},
'POP': {
'indirect': ['0x40'],
},
'STP': {
'indirect': ['0x48'],
},
'register': ['0x50'],
},
'SUB': {
'register': ['0x58'],
},
'MUL': {
'register': ['0x60'],
},
'DIV': {
'register': ['0x68'],
},
'AND': {
'register': ['0x70'],
},
'OR': {
'register': ['0x78'],
},
'XOR': {
'register': ['0x80'],
},
'NOT': {
'register': ['0x88'],
},
'SHL': {
'register': ['0x90'],
},
'SHR': {
'register': ['0x98'],
},
'ROL': {
'register': ['0xA0'],
},
'ROR': {
'register': ['0xA8'],
},
'POPD': {
'indirect': ['0xE0'],
},
'CPR': {
'register': ['0xE8'],
},
'INC': {
'register': ['0xF0'],
},
'DEC': {
'register': ['0xF8'],
},
}

def parse_value(val: str) -> str:
""" Try to parse a value, which can be an integer
in base 2, 8, 10, 16 or a register.

Returns: A string representation of the integer
value in base 10, or empty string if the value
cannot be converted to an integer.
On Error: return original value.
"""
# convert value to int from various bases
if isinstance(val, str):
# Get any possible prefix
val = val.strip(' ')
base = val[0:2]
if val.isnumeric():
return str(int(val, 10))
if base == '0x':
return str(int(val, 16))
elif base == '0c':
return str(int(val, 8))
elif base == '0b':
return str(int(val, 2))
elif val.__contains__('R'):
return ''
else:
return val

""" Example inputs:
['HALT']                    : implicit mode
['BRA', '0x1F']             : immediate mode
[LD, R2]                    : register
[STD, @R3]                  : indirect
['SET', 'R0,', '0xFFFE']    : direct
['BYTE' '0x1f']             : immediate
['WORD' '0x3BFC']           : immediate
['STRING' 'This is a test'] : immediate
['BRC' 'end']               : offset
"""
numfields = len(fields)
if numfields == 1:
return 'implicit'
elif numfields == 2:
# if fields[0] in directives:
#     return 'immediate'
if fields[1].__contains__('@R'):
return 'indirect'
elif fields[1].__contains__('R') and fields[1].__contains__(','):
return 'direct'
elif fields[1].__contains__('R'):
return 'register'
elif parse_value(fields[1]).isnumeric():
return 'immediate'
elif parse_value(fields[1]).isalnum() or parse_value(fields[1]).isalpha():
return 'offset'
elif numfields == 3 and parse_value(fields[2]).isnumeric():
return 'direct'
elif numfields == 3 and '(' in fields[2] and ')' in fields[2]:
return 'direct'
else:
raise ValueError('Expected numeric value got: {expected}'.format(expected=fields[2]))

def parse_register(field: str) -> str:
""" Examples:
R2
@R3
R2, 0x1F
"""
register = field.upper()
if register.__contains__('R'):
pos = field.find('R')
register = field[pos + 1:pos + 2]
return register

def parse_opcode(line: str, symbols) -> tuple:
""" Break the line into its constituent parts.
Locate any labels and store their definition.
Then locate the instruction's addressing mode,
and stores it in mode. Calculate the instruction's
actual opcode value.

Returns: tuple(value, register, mode, objcode) where
value is a string to be evaluated in the second pass.
register is the register value or empty string if no
register exists. "mode" is the addressing mode and
objcode is a dict containing the base opcode value,
and a list of functions needed to process the
instruction
"""
fields = line.split(None, 1)
nofields = len(fields)
if nofields > 1:
extra = fields[1].split(',')
if len(extra) > 1:
fields[1] = extra[0]
fields.extend(extra[1:])
nofields = len(fields)
mnemonic = fields[0]
register = ''
value = ''

""" Examples:
HALT
BRA 0x1F
SET R0, 0xFFFE
LD R2
STD @R3
"""
# Get register and value
if nofields > 1:
register = parse_register(fields[1])

if register == '' and nofields > 1:
value = parse_value(fields[1])

if nofields > 2:
register = parse_register(fields[1])
value = parse_value(fields[2])

# Get all addressing modes for this mnemonic
opcodemodes = opcode_table.get(mnemonic)
if not opcodemodes:
raise AssemblyError("Unknown opcode '{}'".format(mnemonic))

# Get the address mode used in the instruction
objcode = opcodemodes.get(mode)
if not objcode:
raise AssemblyError("Invalid addressing mode '{0}' for {1}".format(mode, mnemonic))

return value, register, mode, mnemonic, list(objcode)

def parse_lines(lines, symbols):
""" Determine mnemonic, register, value, and address mode"""
for lineno, line in enumerate(lines, 1):
# Handle labels
label, *colon, statement = line.rpartition(":")
try:
# parse the line into ir (intermediate representation)
data = lineno, label, parse_opcode(statement, symbols) if statement \
else (None, None, None, None, None)
yield data
except AssemblyError as e:
print("{0:4d} : Error : {1}".format(lineno, e))

def assemble(lines, lc=0):
"""Breaks the line into its constituent parts.
it then locates the instruction's addressing mode,
stores it in mode, and calculates the instruction's
actual opcode value and length. """
objcode = []
symbols = {}

# Pass 1 : Parse instructions and create intermediate code
for lineno, label, (value, register, mode, mnemonic, icode) in parse_lines(lines, symbols):
# Try to evaluate numeric labels and set the lc (location counter)
if label:
try:
lc = int(eval(label, symbols))
except (ValueError, NameError):
symbols[label] = lc

# Store the resulting object code for later
# expansion and adjust the location counter lc.
if icode:
objcode.append((lineno, lc, value, register, mode, mnemonic, icode))
lc += len(icode)

# Second pass -- Assembly
execode = []
for lineno, pc, value, register, mode, mnemonic, icode in objcode:
# Evaluate the string
try:
symbols[pc] = pc

if value is not '':
realvalue = eval(value, symbols)

if isinstance(realvalue, str):
realvalue = ord(realvalue) & 0xff

if not isinstance(realvalue, int):
raise TypeError("Integer expected in {0}".format(value))
except Exception as e:
print("{0:4d} : Error : {1}".format(lineno, e), file=sys.stderr)
realvalue = 0

# Encode the instruction. If register is present,
# it must be added to the base opcode value stored
# in icode[0]
opcode = int(icode[0], 16)
if register != '':
opcode += int(register, 16)
encode = [op(pc, realvalue) if isinstance(op, Callable) else op for op in icode]
encode[0] = opcode

execode.append((lineno, pc, encode))
return execode

def replace_register_names(line):
""" Replace register names with 'R' + register index.
Example: ACC becomes R0. STATUS becomes R6."""
line = line.replace('ACC', 'R0')
line = line.replace('RETSTACK', 'R4')
line = line.replace('COMP', 'R5')
line = line.replace('STATUS', 'R6')
line = line.replace('PC', 'R7')
return line

def strip_lines(lines):
""" Takes a sequence of lines and strips comments and blank lines."""
for line in lines:
comment_index = line.find(";")
if comment_index >= 0:
line = line[:comment_index]
line = line.strip()
line = replace_register_names(line)
yield line

if __name__ == '__main__':
text = ';This is a comment\n' \
'start:  SET  ACC, 0xFFDE  ; This is also a comment\n' \
'end:    HALT  ; end of program'
lines = text.split('\n')

# Remove comments and blank lines
lines = strip_lines(lines)
# Assemble code
print(assemble(lines))```

Recall back in our emulator we added a routine to load a hex-file:

```"""
Load a program assembled into a hex file and store in memory.     File format is: each line contains 18 bytes. First two bytes are the beginning address of the data in the current line. The following 16 bytes are the memory values. All values are in hexadecimal. If the 0x character pair is used, it will be stripped. All values are space delimited. The file must end with a new line.
"""```

Since this is the format our emulator expects, we are going to write a function to write out our data into this format:

```def write_rom_code(execode):
last_pc = 0
rom = []

for lineno, pc, ecode in execode:
if last_pc > pc:
pass
elif last_pc < pc:
while last_pc < pc:
rom.append('0x00')
last_pc += 1
else:
for byte in ecode:
byte = '0x%.2x' % int(byte)
rom.append(byte)
last_pc += 1

while last_pc % 16 != 0:
rom.append('0x00')
last_pc += 1

return rom```

As you can see, the process is pretty straight forward. The code write the instruction code from the execode parameter and if this does not end an a 16 byte boundery, it write 0x00 from the end of the code to the next 16 byte boundary. This is accomplished in while loop near the end of the function. Also note that we convert the integer values to hexadecimal value. This is typical and comes in handy when transfering programs across a serial of JTAG channel to load it into hardware.

Add this function just above the final if statement (if __name__ == ‘__main__’:) and the change that if statement to read as follows:

```if __name__ == '__main__':
text = ';This is a comment\n' \
'start:  SET  ACC, 0xFFDE  ; This is also a comment\n' \
'end:    HALT  ; end of program'
lines = text.split('\n')

# Remove comments and blank lines
lines = strip_lines(lines)
# Assemble
code = assemble(lines)
# Convert ot Hexfile format
print(write_rom_code(code))```

Now if we run the program we get a list of ROM values that always end on a 16 byte boundary. Note that this raw ROM format does not include the leading two byte address values:

`['0x08', '0xde', '0xff', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00']`

Now that we have a raw ROM format, we need to a function to convert this to our hex file format. Since we will be writing to a file we will need to import the sys module.

`def rom2hexfile(rom):    col = 0    row = 0    max_cols = 16    max_rows = int(len(rom) / max_cols)    idx = 0     # Generate output filename    # If none given use input filename's    # path and base name with a '.hex'    # extension appended.    if len(sys.argv) > 2:        # we should have an outfile name        filename = sys.argv[2]    else:        file = sys.argv[1]        file = file.split('.')        head, tail = os.path.split(sys.argv[1])        parts = tail.split('.')        filename = os.path.join(head, parts[0] + '.hex')    with open(filename, 'w') as ofh:        for b in rom:            if col == 0:                ofh.write('{:02x} {:02x} '.format((idx & 0xff), (idx & 0xff00) >> 8))                ofh.write('{:02x} '.format(int(b, 16)))            else:                ofh.write('{:02x} '.format(int(b, 16)))            col += 1            if col >= max_cols:                row += 1                col = 0                ofh.write('\n')            if row >= max_rows:                ofh.write('\n')                break;            idx += 1        ofh.close()`

The code above writes the hex file to a file with the same name as the asm file we load. Well, actually, we haven’t loaded an asm file yet. We’ve been passing our assembly code as a string to our assemble function. The above code depends on a main() function we’ve yet to write and filename values passed as command-line arguments. So let’s see our main function:

```def main():

if len(sys.argv) >= 2:
filename = sys.argv[1]

# Remove comments and blank lines
lines = strip_lines(open(filename))

if 1:
rom = write_rom_code(assemble(lines))
print(rom)```

Now that we are accpting assembly language files from the command line we will be needing a simple assembly language file. Below is a simple assembly program to add two number. Create a file named add.asm (all our assembly language files will have a file extension of “.asm”) and save it as add.asm in the same folder as the assembler.

```;
; On Exit:
; Save in a file named add.asm
;
start:      SET R1, 0x00fe      ; initialize pointers.
SET R0, 0x01        ; load ACC with 1
ADD R1              ; ACC gets 0cfe + 0x01 = 0xff
end:        HALT                ; STOP Execution```

Now we need to change our “if name equals main” statement to call our main() function:

```if __name__ == '__main__':
import sys, os
main()```

Now if you run this code from the command line, you should get the following output:

```python3 assembler_03.py add.asm
['0x09', '0xfe', '0x00', '0x08', '0x01', '0x00', '0x51', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00', '0x00']```

`00 00 09 fe 00 08 01 00 51 00 00 00 00 00 00 00 00 00 `

As you can see we added the two address bytes to the rom format. This hex file should now be loadable into the emulator.

Now let’s see it all put together for completeness:

```"""Sweet16-GP Assembler"""
from collections import Callable

# Exception used for errors
class AssemblyError(Exception):
pass

# Functions used in the creation of object code (used in the table below)
def VALUE_L(pc: int, value):
val = value_to_int(value) & 0xff
return str(val & 0xff)

def VALUE_H(pc: int, value: int) -> int:
val = (value_to_int(value) & 0xff00) >> 8
return str(val & 0xff)

def LABEL_L(pc, value):
return (value) & 0xff

def LABEL_H(pc, value):
return ((value) & 0xff00) >> 8

def OFFSET(pc, value):
return ((value - pc)) & 0xff

def value_to_int(value):
# convert value to int from various bases
if isinstance(value, str):
if value[1:3].lower() == '0x':
return int(value, 16)
elif value[1:3].lower == '0c':
return int(value, 8)
elif value[1:3].lower() == '0b':
return int(value, 2)
else:
print('Invalid value.')
elif isinstance(value, int):
return value
# else:
#     raise ValueError("Expected numerical value, got: {0}".format(value))

def register_to_int(reg):
if isinstance(reg, str):
if reg.startswith('@R'):
return int(reg[2:], 16)
elif reg.startswith('R') and reg.endswith(','):
return int(reg[1:-1], 16)
elif reg.startswith('R'):
return int(reg[1:])
else:
raise ValueError("Cannot parse the register: '{0}'".format(reg))

opcode_table = {
# TODO: Get DATA implemented

'HALT': {
'implicit': ['0x00'],
},
'BRA': {
'immediate': ['0x01', VALUE_L],
'offset': ['0x01', OFFSET],
},
'BRC': {
'immediate': ['0x02', VALUE_L],
'offset': ['0x02', OFFSET],
},
'BRZ': {
'immediate': ['0x03', VALUE_L],
'offset': ['0x03', OFFSET],
},
'BRN': {
'immediate': ['0x04', VALUE_L],
'offset': ['0x04', OFFSET],
},
'BRV': {
'immediate': ['0x05', VALUE_L],
'offset': ['0x05', OFFSET],
},
'BSR': {
'immediate': ['0x06', VALUE_L, VALUE_H],
'offset': ['0x06', LABEL_L, LABEL_H],
},
'RTS': {
'implicit': ['0x07'],
},

'SET': {
'direct': ['0x08', VALUE_L, VALUE_H],
},
'LD': {
'register': ['0x10'],
'indirect': ['0x20'],
},
'ST': {
'register': ['0x18'],
'indirect': ['0x28'],
},
'LDD': {
'indirect': ['0x30'],
},
'STD': {
'indirect': ['0x38'],
},
'POP': {
'indirect': ['0x40'],
},
'STP': {
'indirect': ['0x48'],
},
'register': ['0x50'],
},
'SUB': {
'register': ['0x58'],
},
'MUL': {
'register': ['0x60'],
},
'DIV': {
'register': ['0x68'],
},
'AND': {
'register': ['0x70'],
},
'OR': {
'register': ['0x78'],
},
'XOR': {
'register': ['0x80'],
},
'NOT': {
'register': ['0x88'],
},
'SHL': {
'register': ['0x90'],
},
'SHR': {
'register': ['0x98'],
},
'ROL': {
'register': ['0xA0'],
},
'ROR': {
'register': ['0xA8'],
},
'POPD': {
'indirect': ['0xE0'],
},
'CPR': {
'register': ['0xE8'],
},
'INC': {
'register': ['0xF0'],
},
'DEC': {
'register': ['0xF8'],
},
}

def parse_value(val: str) -> str:
""" Try to parse a value, which can be an integer
in base 2, 8, 10, 16 or a register.

Returns: A string representation of the integer
value in base 10, or empty string if the value
cannot be converted to an integer.
On Error: return original value.
"""
# convert value to int from various bases
if isinstance(val, str):
# Get any possible prefix
val = val.strip(' ')
base = val[0:2]
if val.isnumeric():
return str(int(val, 10))
if base == '0x':
return str(int(val, 16))
elif base == '0c':
return str(int(val, 8))
elif base == '0b':
return str(int(val, 2))
elif val.__contains__('R'):
return ''
else:
return val

""" Example inputs:
['HALT']                    : implicit mode
['BRA', '0x1F']             : immediate mode
[LD, R2]                    : register
[STD, @R3]                  : indirect
['SET', 'R0,', '0xFFFE']    : direct
['BYTE' '0x1f']             : immediate
['WORD' '0x3BFC']           : immediate
['STRING' 'This is a test'] : immediate
['BRC' 'end']               : offset
"""
numfields = len(fields)
if numfields == 1:
return 'implicit'
elif numfields == 2:
# if fields[0] in directives:
#     return 'immediate'
if fields[1].__contains__('@R'):
return 'indirect'
elif fields[1].__contains__('R') and fields[1].__contains__(','):
return 'direct'
elif fields[1].__contains__('R'):
return 'register'
elif parse_value(fields[1]).isnumeric():
return 'immediate'
elif parse_value(fields[1]).isalnum() or parse_value(fields[1]).isalpha():
return 'offset'
elif numfields == 3 and parse_value(fields[2]).isnumeric():
return 'direct'
elif numfields == 3 and '(' in fields[2] and ')' in fields[2]:
return 'direct'
else:
raise ValueError('Expected numeric value got: {expected}'.format(expected=fields[2]))

def parse_register(field: str) -> str:
""" Examples:
R2
@R3
R2, 0x1F
"""
register = field.upper()
if register.__contains__('R'):
pos = field.find('R')
register = field[pos + 1:pos + 2]
return register

def parse_opcode(line: str, symbols) -> tuple:
""" Break the line into its constituent parts.
Locate any labels and store their definition.
Then locate the instruction's addressing mode,
and stores it in mode. Calculate the instruction's
actual opcode value.

Returns: tuple(value, register, mode, objcode) where
value is a string to be evaluated in the second pass.
register is the register value or empty string if no
register exists. "mode" is the addressing mode and
objcode is a dict containing the base opcode value,
and a list of functions needed to process the
instruction
"""
fields = line.split(None, 1)
nofields = len(fields)
if nofields > 1:
extra = fields[1].split(',')
if len(extra) > 1:
fields[1] = extra[0]
fields.extend(extra[1:])
nofields = len(fields)
mnemonic = fields[0]
register = ''
value = ''

""" Examples:
HALT
BRA 0x1F
SET R0, 0xFFFE
LD R2
STD @R3
"""
# Get register and value
if nofields > 1:
register = parse_register(fields[1])

if register == '' and nofields > 1:
value = parse_value(fields[1])

if nofields > 2:
register = parse_register(fields[1])
value = parse_value(fields[2])

# Get all addressing modes for this mnemonic
opcodemodes = opcode_table.get(mnemonic)
if not opcodemodes:
raise AssemblyError("Unknown opcode '{}'".format(mnemonic))

# Get the address mode used in the instruction
objcode = opcodemodes.get(mode)
if not objcode:
raise AssemblyError("Invalid addressing mode '{0}' for {1}".format(mode, mnemonic))

return value, register, mode, mnemonic, list(objcode)

def parse_lines(lines, symbols):
""" Determine mnemonic, register, value, and address mode"""
for lineno, line in enumerate(lines, 1):
# Handle labels
label, *colon, statement = line.rpartition(":")
try:
# parse the line into ir (intermediate representation)
data = lineno, label, parse_opcode(statement, symbols) if statement \
else (None, None, None, None, None)
yield data
except AssemblyError as e:
print("{0:4d} : Error : {1}".format(lineno, e))

def assemble(lines, lc=0):
"""Breaks the line into its constituent parts.
it then locates the instruction's addressing mode,
stores it in mode, and calculates the instruction's
actual opcode value and length. """
objcode = []
symbols = {}

# Pass 1 : Parse instructions and create intermediate code
for lineno, label, (value, register, mode, mnemonic, icode) in parse_lines(lines, symbols):
# Try to evaluate numeric labels and set the lc (location counter)
if label:
try:
lc = int(eval(label, symbols))
except (ValueError, NameError):
symbols[label] = lc

# Store the resulting object code for later
# expansion and adjust the location counter lc.
if icode:
objcode.append((lineno, lc, value, register, mode, mnemonic, icode))
lc += len(icode)

# Second pass -- Assembly
execode = []
for lineno, pc, value, register, mode, mnemonic, icode in objcode:
# Evaluate the string
try:
symbols[pc] = pc

if value is not '':
realvalue = eval(value, symbols)

if isinstance(realvalue, str):
realvalue = ord(realvalue) & 0xff

if not isinstance(realvalue, int):
raise TypeError("Integer expected in {0}".format(value))
except Exception as e:
print("{0:4d} : Error : {1}".format(lineno, e), file=sys.stderr)
realvalue = 0

# Encode the instruction. If register is present,
# it must be added to the base opcode value stored
# in icode[0]
opcode = int(icode[0], 16)
if register != '':
opcode += int(register, 16)
encode = [op(pc, realvalue) if isinstance(op, Callable) else op for op in icode]
encode[0] = opcode

execode.append((lineno, pc, encode))
return execode

def replace_register_names(line):
""" Replace register names with 'R' + register index.
Example: ACC becomes R0. STATUS becomes R6."""
line = line.replace('ACC', 'R0')
line = line.replace('RETSTACK', 'R4')
line = line.replace('COMP', 'R5')
line = line.replace('STATUS', 'R6')
line = line.replace('PC', 'R7')
return line

def strip_lines(lines):
""" Takes a sequence of lines and strips comments and blank lines."""
for line in lines:
comment_index = line.find(";")
if comment_index >= 0:
line = line[:comment_index]
line = line.strip()
line = replace_register_names(line)
yield line

def write_rom_code(execode):
last_pc = 0
rom = []

for lineno, pc, ecode in execode:
if last_pc > pc:
pass
elif last_pc < pc:
while last_pc < pc:
rom.append('0x00')
last_pc += 1
else:
for byte in ecode:
byte = '0x%.2x' % int(byte)
rom.append(byte)
last_pc += 1

while last_pc % 16 != 0:
rom.append('0x00')
last_pc += 1

return rom

def rom2hexfile(rom):
col = 0
row = 0

max_cols = 16
max_rows = int(len(rom) / max_cols)
idx = 0
filename = 'a.hex'

# Generate output filename
# If none given use input filename's
# path and base name with a '.hex'
# extension appended.
if len(sys.argv) > 2:
# we should have an outfile name
filename = sys.argv[2]
else:
file = sys.argv[1]
file = file.split('.')
parts = tail.split('.')
filename = os.path.join(head, parts[0] + '.hex')

with open(filename, 'w') as ofh:
for b in rom:
if col == 0:
ofh.write('{:02x} {:02x} '.format((idx & 0xff), (idx & 0xff00) >> 8))
ofh.write('{:02x} '.format(int(b, 16)))
else:
ofh.write('{:02x} '.format(int(b, 16)))
col += 1
if col >= max_cols:
row += 1
col = 0
ofh.write('\n')
if row >= max_rows:
ofh.write('\n')
break;
idx += 1
ofh.close()

def main():

if len(sys.argv) >= 2:
filename = sys.argv[1]

# Remove comments and blank lines
lines = strip_lines(open(filename))

if 1:
rom = write_rom_code(assemble(lines))
rom2hexfile(rom)
print(rom)

if __name__ == '__main__':
import sys, os
main()```

OK, that it for today. We now have a working assembler. However, there are a few features I would like to add. So next time we will add a few simple assembler directives and complete the assembler. Then we will develop a few simple assembly language programs for the Sweet16gp.

Until next time, Keep coding!

## Sweet16-GP Assembler – Part 2

This entry is part 7 of 8 in the series Sweet16-GP CPU: A Complete Development Cycle

Last time we left our assembler with only two functions. The strip_lines() function which removes comments and blank lines and the rename_registers() function which replaces the register’s friendly names with the cononical names i.e. Rx where x is a number between 0 and 7 (the register’s index value into the register file).

With these to pre-processing functions out of the way we are ready to start processing our assembly code.  We will use a two-pass method of assembly. The first pass will break each line into its constituent parts and then locate the instruction’s addressing mode, stores it in a field called mode, and calculate the instruction’s actual opcode value and length. In addition it will store a labels in a symbol table to record their value for later use in the second pass or the assembler.

Our main assembler method will take our clean lines, one at a time, and parse them as described above. We will need a table of instructions that includes details about each instruct including it’s addressing mode(s), opcode(s) and length. You may be wondering why the “(s)” above. This is because some of our instructions have multiple addressing modes and therefore multipple opcodes and perhaps even multiple instruction lengths. We will use our table to disambiguate the instruction format into the proper values.

We will implement our opcode_table as a dict. Each entry will be  indexed (key) by the instruction mnemonic.  The value for each key in the table will be a dict indexed by the addressing mode. Each addressing mode will then have a list of values containing the instruction’s opcode. If the opcode requires a one or two by value, they will be provided as VALUE_L and VALUE_H in the list. This description sounds much more complex than it really is. Refere to the code for our table below and re-read the above paragraph. It should all become clear.

```opcode_table = {
'HALT': {
'implicit': ['0x00'],
},
'BRA': {
'immediate': ['0x01', VALUE_L],
'offset': ['0x01', OFFSET],
},
'BRC': {
'immediate': ['0x02', VALUE_L],
'offset': ['0x02', OFFSET],
},
'BRZ': {
'immediate': ['0x03', VALUE_L],
'offset': ['0x03', OFFSET],
},
'BRN': {
'immediate': ['0x04', VALUE_L],
'offset': ['0x04', OFFSET],
},
'BRV': {
'immediate': ['0x05', VALUE_L],
'offset': ['0x05', OFFSET],
},
'BSR': {
'immediate': ['0x06', VALUE_L, VALUE_H],
'offset': ['0x06', LABEL_L, LABEL_H],
},
'RTS': {
'implicit': ['0x07'],
},

'SET': {
'direct': ['0x08', VALUE_L, VALUE_H],
},
'LD': {
'register': ['0x10'],
'indirect': ['0x20'],
},
'ST': {
'register': ['0x18'],
'indirect': ['0x28'],
},
'LDD': {
'indirect': ['0x30'],
},
'STD': {
'indirect': ['0x38'],
},
'POP': {
'indirect': ['0x40'],
},
'STP': {
'indirect': ['0x48'],
},
'register': ['0x50'],
},
'SUB': {
'register': ['0x58'],
},
'MUL': {
'register': ['0x60'],
},
'DIV': {
'register': ['0x68'],
},
'AND': {
'register': ['0x70'],
},
'OR': {
'register': ['0x78'],
},
'XOR': {
'register': ['0x80'],
},
'NOT': {
'register': ['0x88'],
},
'SHL': {
'register': ['0x90'],
},
'SHR': {
'register': ['0x98'],
},
'ROL': {
'register': ['0xA0'],
},
'ROR': {
'register': ['0xA8'],
},
'POPD': {
'indirect': ['0xE0'],
},
'CPR': {
'register': ['0xE8'],
},
'INC': {
'register': ['0xF0'],
},
'DEC': {
'register': ['0xF8'],
},
}
```

As you can see, our instructions can fall into one of four addressing modes. These include implicit (when the instruction itself indicates the addressing mode),  immediate (when the value needed by the instruction directly follows it), indirect (when the register specified contains the value needed to complete the instructions, and register (when the instruction operates directly on the register specified). To tell the truth, these could be further broken down but that would only complicate things.

OK, now we have our opcode_table we are going to start writing our parser. Our parser is a little bit unorthodox. Since our input language is so constrained we can forego all the formal theory and much of the discipline or parser development.

I do believe however, that any developer worth their weight should write at least a simple parser at some point. Understanding even a very simple parser expands your understanding of how your tools work whether you’re a web developer, AI master, ETL programmer, or Embedded Systems developer. If you call yourself a Software Engineer or Programmer and you have written at least a simple parser. compiler, or interpreter, then you owe it to yourself to do so! To get you started I recomend DR. Jack Crenshaw’s “Let’s Build A Compiler” and Ruslans’s “Let’s Build A Simple Interpreter” blog serries. If you’re a seasoned developer checkout the book “Language Implementation Patterns”. For those mathematically oriented, checkout the Dragon Book

So what I am going to do is simply plow forward. We will tackle one issue at a time and add functions to handle each issue as it arises. Note that this technique wont work if you’re trying to write a high level language compiler or interpreter. There you will need all the formal methods and a have good design before writing code. What we are going for here is a quick and simple boot-strap assembler. We can write better tools later. For now, we just need something to get us up and running.

OK, the first thing we are going to need is some set of methods to break up the line of source code we feed to our assembler into it’s various fields. We will create an entry point function called assemble which will do our heavy lifting by calling other functions.

We will need to produce some data about the current source line. This will include any label, register, addressing mode, and value. Additionally, we may also want to keep track of the original instruction mnemonic and any intermediate code we produce. So we will need a supporting function to handle this. We will also need some storage for any lables or symbols we encounter during parsing.

The code below is our entry point. It will be responsible for handling our two apasses over our code and  will call the functions that do the heavy lifting.

```def assemble(lines, lc=0):
"""Breaks the line into its constituent parts.
it then locates the instruction's addressing mode,
stores it in mode, and calculates the instruction's
actual opcode value and length. """
objcode = []
symbols = {}

# Pass 1 : Parse instructions and create intermediate code
for lineno, label, (value, register, mode, mnemonic, icode) in parse_lines(lines, symbols):
# Try to evaluate numeric labels and set the lc (location counter)
if label:
try:
lc = int(eval(label, symbols))
except (ValueError, NameError):
symbols[label] = lc

# Store the resulting object code for later expansion
if icode:
objcode.append((lineno, lc, value, register, mode, mnemonic, icode))
lc += len(icode
```

Note the call to parse_lines() we will write this function next. But, let’s step back and think about what we need it to do…

Given a line of code like:

`start:    SET   R1, 0xFA`

We can see we need to break up this line into it’s smaller parts. Luckily, this is pretty easy to do. Also, since each line will follow a similar pattern for the fields, we can easily deduce what each field contains.

```# Exception used for errors
class AssemblyError(Exception):
pass

def parse_lines(lines, symbols):
""" Determine line number, mnemonic, register, value, and address mode"""
for lineno, line in enumerate(lines, 1):
# Handle labels
label, *colon, statement = line.rpartition(":")
try:
# parse the line into ir (intermediate representation)
data = lineno, label, parse_opcode(statement, symbols) if statement else (None, None, None, None, None)
yield data
except AssemblyError as e:
print("{0:4d} : Error : {1}".format(lineno, e))```

Refering to the code above, we use the enumerate function to keep track of our line numbers. One drawback to doing this approach is that our comment and blank lines wont be counted. Something we could fix by tracking line numbers in the strip_lines() function. But for now, we will do it this way.

Enumerate returns to us an integer value for the line number (lineno) and the line of source code. Next, we use rpartition(“:”) on on the source line to break it into three parts. These parts are return in a tuple containing the part before the seperator (colon in our case, which ends a label declaration), the seperator, and the part after the seperator. If any part is not found (for example, if there is no colon in the source line) then that part returns an empty string value. So if our line does not contains a colon, we will get an empty string for the label and for the seperator followed by the initial line of code which we save in the statement variable.

Once we have gotten any label, we need to take our statement and further parse it into its various parts. If it turns out that our statement is empty, we need to return a default tuple of ‘None’ values otherwise, we make a call to parse_opcode() passing in our statement and our symbol table so that any symbols found in our source line can be handled. If this all goes horribly wrong, we throw an exception and print the error message. We defined a simple exception class above for this purpose.

Our next task is to implement a function to parse the line into opcodes. We called this function pasre_opcode() above. The implementation is shown below:

```def parse_opcode(line: str, symbols) -> tuple:
""" Break the line into its constituent parts.
Locate any labels and store their definition.
Then locate the instruction's addressing mode,
and stores it in mode. Calculate the instruction's
actual opcode value.

Returns: tuple(value, register, mode, objcode) where
value is a string to be evaluated in the second pass.
register is the register value or empty string if no
register exists. "mode" is the addressing mode and
objcode is a dict containing the base opcode value,
and a list of functions needed to process the
instruction
"""
fields = line.split(None, 1)
nofields = len(fields)
if nofields > 1:
extra = fields[1].split(',')
if len(extra) > 1:
fields[1] = extra[0]
fields.extend(extra[1:])
nofields = len(fields)
mnemonic = fields[0]
register = ''
value = ''

""" Examples:
HALT
BRA 0x1F
SET R0, 0xFFFE
LD R2
STD @R3
"""
# Get register and value
if nofields > 1:
register = parse_register(fields[1])

if register == '' and nofields > 1:
value = parse_value(fields[1])

if nofields > 2:
register = parse_register(fields[1])
value = parse_value(fields[2])

# Get all addressing modes for this mnemonic
opcodemodes = opcode_table.get(mnemonic)
if not opcodemodes:
raise AssemblyError("Unknown opcode '{}'".format(mnemonic))

# Get the address mode used in the instruction
objcode = opcodemodes.get(mode)
if not objcode:
raise AssemblyError("Invalid addressing mode '{0}' for {1}".format(mode, mnemonic))

return value, register, mode, mnemonic, list(objcode)
```

There is a lot going on here so let’s unpack it. First, we use the split() method to split the statement into two portions. The first parameter to split() gives the character to split on and the second parameter is used to limit the number of splits. What we are doing here is seperating the instruction mnemonic from the rest of the statement. Since some instructions (i.e.: HALT) only have the mnemonic we need to check the number of fields we get back. If it’s only one, then we have an instruction like HALT or RTS.

However, if we have more than one field returned from the split we need to further decompose the line. We now have a mnemonic in fields[0] and the remainder of the line in fields[1]. We now know we may have something like Rx, @Rx, or Rx, 0xFA following the instruction mnemonic. So the next move is to tray and seperate the remaining portion of the statement on the comma seperator. If the comma is found, one of two cases will be returned i.e.: (register, value) or @register, value). These are stored in the extra variable. We then add the extra fields to the fields variable and assign mnemonic the value in fields[0] which contains the instruction mnemonic.

Next, using the value of ‘nofields’ which indicate what we should expect next, we sort out the remaining fields and store their values into their perspective variables. In each case, we call parse functions which perform a specific sub-task.

Our first sub-task is to locate any register in the fields. At this point we know if a register exists it should be in fields[1]. So we pass this value to the parse_register() function shown below:

```def parse_register(field: str) -> str:
""" Examples:
R2
@R3
R2, 0x1F
"""
register = field.upper()
if register.__contains__('R'):
pos = field.find('R')
register = field[pos + 1:pos + 2]
return register
```

The first thing we do is convert the register value to upper case. Then we check if the string contains ‘R’ and return the character following the ‘R’ which should be the register index value (a single digit between 0 and 7 inclusively). We then return that value to the caller.

It’s possible that we didn’t have a register. In this case, there may be a value in the fields[1] position. So we check if the call to parse_register() returned a value of an empty string. If the latter was returned, we try and parse a value using the function parse_value() passing in fields[1] as a parameter. This situation occurs with the branch instructions where the mnemonic is directly followed by the offset value for the jump.

```def parse_value(val: str) -> str:
""" Try to parse a value, which can be an integer
in base 2, 8, 10, 16 or a register.

Returns: A string representation of the integer
value in base 10, or empty string if the value
cannot be converted to an integer.
On Error: return original value.
"""
# convert value to int from various bases
if isinstance(val, str):
# Get any possible prefix
val = val.strip(' ')
base = val[0:2]
if val.isnumeric():
return str(int(val, 10))
if base == '0x':
return str(int(val, 16))
elif base == '0c':
return str(int(val, 8))
elif base == '0b':
return str(int(val, 2))
elif val.__contains__('R'):
return ''
else:
return val
```

As you can see above the parse_value() function must handle many types of values. It first strips off any prefix to the value and tests if this matches any of the supported numerical base indicators. If a match is found the value is converted to an integer and then cast into a string and returned to the caller.

Next, we call parse_address_mode() passing in all the fields as we may need more than one to identify the addressing mode. The parse_address_mode() function can only be understood in the context of the opcode_table. So refer to the table as you gork this code:

```def parse_address_mode(fields: list) -> str:
""" Example inputs:
['HALT']                    : implicit mode
['BRA', '0x1F']             : immediate mode
[LD, R2]                    : register
[STD, @R3]                  : indirect
['SET', 'R0,', '0xFFFE']    : direct
['BRC' 'end']               : offset
"""
numfields = len(fields)
if numfields == 1:
return 'implicit'
elif numfields == 2:
if fields[0] in directives:
return 'immediate'
if fields[1].__contains__('@R'):
return 'indirect'
elif fields[1].__contains__('R') and fields[1].__contains__(','):
return 'direct'
elif fields[1].__contains__('R'):
return 'register'
elif parse_value(fields[1]).isnumeric():
return 'immediate'
elif parse_value(fields[1]).isalnum() or parse_value(fields[1]).isalpha():
return 'offset'
elif numfields == 3 and parse_value(fields[2]).isnumeric():
return 'direct'
elif numfields == 3 and '(' in fields[2] and ')' in fields[2]:
return 'direct'
else:
raise ValueError('Expected numeric value got: {expected}'.format(expected=fields[2]))
```

The parse_address_mode() function uses the data contained in the fields to sort out the addressing mode of the instruction and return it to the caller, parse_opcode() which then stores this value in the mode variable.

Next we use the mnemonic value to get the instruction data from the opcode_table and then use the mode variable to get the proper opcode value and instruction format.

If all has gone well, we return a tuple containing the value, register, mode, mnemonic, variables and the object code taken from the opcode_table.

At this poiont our first pass is complete. We have all the data necessary to assemble our instruction code with the possible exception of any forward declared symbol.

We have one last issue before we can try this out. Our opcode_table has method names in some of the fields. We need to implemnt these helper methods:

All of these methods are pretty strainght forward. They simply take the value from the source code and separe it out into high bytes and low bytes in the case of VALUE_H, VALUE_L, LABEL_H, and LABEL_L. The OFFSET function calculates the offset from the current position.

```# Functions used in the creation of object code (used in the table below)
def VALUE_L(pc: int, value):
val = value_to_int(value) & 0xff
return str(val & 0xff)

def VALUE_H(pc: int, value: int) -> int:
val = (value_to_int(value) & 0xff00) >> 8
return str(val & 0xff)

def LABEL_L(pc, value):
print(f'LABEL_L PC: {pc}, value: {value}')
return (value) & 0xff

def LABEL_H(pc, value):
print(f'LABEL_H PC: {pc}, value: {value}')
return ((value) & 0xff00) >> 8

def OFFSET(pc, value):
print(f'OFFSET: {value}')
return ((value - pc)) & 0xff

def value_to_int(value):
# convert value to int from various bases
if isinstance(value, str):
if value[1:3].lower() == '0x':
return int(value, 16)
elif value[1:3].lower == '0c':
return int(value, 8)
elif value[1:3].lower() == '0b':
return int(value, 2)
else:
print('Invalid value.')
elif isinstance(value, int):
return value
# else:
#     raise ValueError("Expected numerical value, got: {0}".format(value))

def register_to_int(reg):
if isinstance(reg, str):
if reg.startswith('@R'):
return int(reg[2:], 16)
elif reg.startswith('R') and reg.endswith(','):
return int(reg[1:-1], 16)
elif reg.startswith('R'):
return int(reg[1:])
else:
raise ValueError("Cannot parse the register: '{0}'".format(reg))
```

PK, I think we’re ready to try this! Below is the final code. Note I made some adjustments to the code from the last installment so we could test what we’ve done so far.

```"""Sweet16-GP Assembler"""

# Exception used for errors
class AssemblyError(Exception):
pass

# Functions used in the creation of object code (used in the table below)
def VALUE_L(pc: int, value):
# print(f'VALUE_L value: {value}')
#val = value_to_int(value) & 0xff
#print(f'VALUE_L val: {val}')
#return str(val & 0xff)
pass

def VALUE_H(pc: int, value: int) -> int:
val = (value_to_int(value) & 0xff00) >> 8
return str(val & 0xff)

def LABEL_L(pc, value):
return (value) & 0xff

def LABEL_H(pc, value):
return ((value) & 0xff00) >> 8

def OFFSET(pc, value):
return ((value - pc)) & 0xff

def value_to_int(value):
# convert value to int from various bases
if isinstance(value, str):
if value[1:3].lower() == '0x':
return int(value, 16)
elif value[1:3].lower == '0c':
return int(value, 8)
elif value[1:3].lower() == '0b':
return int(value, 2)
else:
print('Invalid value.')
elif isinstance(value, int):
return value
# else:
#     raise ValueError("Expected numerical value, got: {0}".format(value))

def register_to_int(reg):
if isinstance(reg, str):
if reg.startswith('@R'):
return int(reg[2:], 16)
elif reg.startswith('R') and reg.endswith(','):
return int(reg[1:-1], 16)
elif reg.startswith('R'):
return int(reg[1:])
else:
raise ValueError("Cannot parse the register: '{0}'".format(reg))

opcode_table = {
'HALT': {
'implicit': ['0x00'],
},
'BRA': {
'immediate': ['0x01', VALUE_L],
'offset': ['0x01', OFFSET],
},
'BRC': {
'immediate': ['0x02', VALUE_L],
'offset': ['0x02', OFFSET],
},
'BRZ': {
'immediate': ['0x03', VALUE_L],
'offset': ['0x03', OFFSET],
},
'BRN': {
'immediate': ['0x04', VALUE_L],
'offset': ['0x04', OFFSET],
},
'BRV': {
'immediate': ['0x05', VALUE_L],
'offset': ['0x05', OFFSET],
},
'BSR': {
'immediate': ['0x06', VALUE_L, VALUE_H],
'offset': ['0x06', LABEL_L, LABEL_H],
},
'RTS': {
'implicit': ['0x07'],
},

'SET': {
'direct': ['0x08', VALUE_L, VALUE_H],
},
'LD': {
'register': ['0x10'],
'indirect': ['0x20'],
},
'ST': {
'register': ['0x18'],
'indirect': ['0x28'],
},
'LDD': {
'indirect': ['0x30'],
},
'STD': {
'indirect': ['0x38'],
},
'POP': {
'indirect': ['0x40'],
},
'STP': {
'indirect': ['0x48'],
},
'register': ['0x50'],
},
'SUB': {
'register': ['0x58'],
},
'MUL': {
'register': ['0x60'],
},
'DIV': {
'register': ['0x68'],
},
'AND': {
'register': ['0x70'],
},
'OR': {
'register': ['0x78'],
},
'XOR': {
'register': ['0x80'],
},
'NOT': {
'register': ['0x88'],
},
'SHL': {
'register': ['0x90'],
},
'SHR': {
'register': ['0x98'],
},
'ROL': {
'register': ['0xA0'],
},
'ROR': {
'register': ['0xA8'],
},
'POPD': {
'indirect': ['0xE0'],
},
'CPR': {
'register': ['0xE8'],
},
'INC': {
'register': ['0xF0'],
},
'DEC': {
'register': ['0xF8'],
},
}

def parse_value(val: str) -> str:
""" Try to parse a value, which can be an integer
in base 2, 8, 10, 16 or a register.

Returns: A string representation of the integer
value in base 10, or empty string if the value
cannot be converted to an integer.
On Error: return original value.
"""
# convert value to int from various bases
if isinstance(val, str):
# Get any possible prefix
val = val.strip(' ')
base = val[0:2]
if val.isnumeric():
return str(int(val, 10))
if base == '0x':
return str(int(val, 16))
elif base == '0c':
return str(int(val, 8))
elif base == '0b':
return str(int(val, 2))
elif val.__contains__('R'):
return ''
else:
return val

""" Example inputs:
['HALT']                    : implicit mode
['BRA', '0x1F']             : immediate mode
[LD, R2]                    : register
[STD, @R3]                  : indirect
['SET', 'R0,', '0xFFFE']    : direct
['BYTE' '0x1f']             : immediate
['WORD' '0x3BFC']           : immediate
['STRING' 'This is a test'] : immediate
['BRC' 'end']               : offset
"""
numfields = len(fields)
if numfields == 1:
return 'implicit'
elif numfields == 2:
# if fields[0] in directives:
#     return 'immediate'
if fields[1].__contains__('@R'):
return 'indirect'
elif fields[1].__contains__('R') and fields[1].__contains__(','):
return 'direct'
elif fields[1].__contains__('R'):
return 'register'
elif parse_value(fields[1]).isnumeric():
return 'immediate'
elif parse_value(fields[1]).isalnum() or parse_value(fields[1]).isalpha():
return 'offset'
elif numfields == 3 and parse_value(fields[2]).isnumeric():
return 'direct'
elif numfields == 3 and '(' in fields[2] and ')' in fields[2]:
return 'direct'
else:
raise ValueError('Expected numeric value got: {expected}'.format(expected=fields[2]))

def parse_register(field: str) -> str:
""" Examples:
R2
@R3
R2, 0x1F
"""
register = field.upper()
if register.__contains__('R'):
pos = field.find('R')
register = field[pos + 1:pos + 2]
return register

def parse_opcode(line: str, symbols) -> tuple:
""" Break the line into its constituent parts.
Locate any labels and store their definition.
Then locate the instruction's addressing mode,
and stores it in mode. Calculate the instruction's
actual opcode value.

Returns: tuple(value, register, mode, objcode) where
value is a string to be evaluated in the second pass.
register is the register value or empty string if no
register exists. "mode" is the addressing mode and
objcode is a dict containing the base opcode value,
and a list of functions needed to process the
instruction
"""
fields = line.split(None, 1)
nofields = len(fields)
if nofields > 1:
extra = fields[1].split(',')
if len(extra) > 1:
fields[1] = extra[0]
fields.extend(extra[1:])
nofields = len(fields)
mnemonic = fields[0]
register = ''
value = ''

""" Examples:
HALT
BRA 0x1F
SET R0, 0xFFFE
LD R2
STD @R3
"""
# Get register and value
if nofields > 1:
register = parse_register(fields[1])

if register == '' and nofields > 1:
value = parse_value(fields[1])

if nofields > 2:
register = parse_register(fields[1])
value = parse_value(fields[2])

# Get all addressing modes for this mnemonic
opcodemodes = opcode_table.get(mnemonic)
if not opcodemodes:
raise AssemblyError("Unknown opcode '{}'".format(mnemonic))

# Get the address mode used in the instruction
objcode = opcodemodes.get(mode)
if not objcode:
raise AssemblyError("Invalid addressing mode '{0}' for {1}".format(mode, mnemonic))

return value, register, mode, mnemonic, list(objcode)

def parse_lines(lines, symbols):
""" Determine mnemonic, register, value, and address mode"""
for lineno, line in enumerate(lines, 1):
# Handle labels
label, *colon, statement = line.rpartition(":")
try:
# parse the line into ir (intermediate representation)
data = lineno, label, parse_opcode(statement, symbols) if statement \
else (None, None, None, None, None)
yield data
except AssemblyError as e:
print("{0:4d} : Error : {1}".format(lineno, e))

def assemble(lines, lc=0):
"""Breaks the line into its constituent parts.
it then locates the instruction's addressing mode,
stores it in mode, and calculates the instruction's
actual opcode value and length. """
objcode = []
symbols = {}

# Pass 1 : Parse instructions and create intermediate code
for lineno, label, (value, register, mode, mnemonic, icode) in parse_lines(lines, symbols):
# Try to evaluate numeric labels and set the lc (location counter)
if label:
try:
lc = int(eval(label, symbols))
except (ValueError, NameError):
symbols[label] = lc

# Store the resulting object code for later
# expansion and adjust the location counter lc.
if icode:
objcode.append((lineno, lc, value, register, mode, mnemonic, icode))
lc += len(icode)
return objcode

def replace_register_names(line):
""" Replace register names with 'R' + register index.
Example: ACC becomes R0. STATUS becomes R6."""
line = line.replace('ACC', 'R0')
line = line.replace('RETSTACK', 'R4')
line = line.replace('COMP', 'R5')
line = line.replace('STATUS', 'R6')
line = line.replace('PC', 'R7')
return line

def strip_lines(lines):
""" Takes a sequence of lines and strips comments and blank lines."""
for line in lines:
comment_index = line.find(";")
if comment_index >= 0:
line = line[:comment_index]
line = line.strip()
line = replace_register_names(line)
yield line

if __name__ == '__main__':
text = ';This is a comment\n' \
'start:  SET  ACC, 0xFFDE  ; This is also a comment\n' \
'end:    HALT  ; end of program'
lines = text.split('\n')

# Remove comments and blank lines
lines = strip_lines(lines)
print(assemble(lines))
```

If you run the above code you should get an output like the following:

```sweet16-articles-code/articles/part-07/assembler_02.py
[
(2, 0, '65502', '0', 'direct', 'SET', ['0x08', , ]),
(3, 3, '', '', 'implicit', 'HALT', ['0x00'])
]```

As you can see we have the intermediate representation for two instructions. All the info we will need to assemble these instructions into actual machine code. Reading the tuples the first value (2) is the line number from the source. The second value (0) is the location counter at the start of the instruction. The next value (65502) is the value provided in the instruction, followed by the register index (0), the addressing mode (direct), and the mnemonic. The final fields holds the opcode information from the opcode_table.  Try this out on several instructions and see that the results make sense. Write some unit test for the various functions and for the assemble as it stands.

This has been a long post. Next time we will tackle the second pass of our assembler. Until then, keep coding!

## Sweet16-GP Assembler

This entry is part 6 of 8 in the series Sweet16-GP CPU: A Complete Development Cycle

In my first post I showed you the instruction set of the Sweet16-GP. Within the instruction set presented we saw examples of the Sweet16-GP assembly language. However, we didn’t talk about the assembly language at that time. It’s now time to discuss the Sweet16-GP’s assembly language.

Statements: Statements in the Sweet16’s assembly language can be borken down into four basic parts; Label, Operation, Operand, and Comment. of all these parts only the Operation is required in all statements. Most statements however also require an operand. The operand can be a register, literal value or label reference. The label at the beginning of the statement is the label declaration.

Comments: The assembly language used with the Sweet16-GP allows comments. Comments begin with a semicolon and end at the end of the line.

Labels: Labels give an human readable name to the line. This name can then be used to reference the line at any point in the program. Labels must begin with a letter or underscore character and may contain alphanumeric and underscore character. Labels declarations must end with a colon. Label declarations occur when the label is located at the beginning of a line. Labels are referenced when they are not the first item in the line. If a label declaration occurs on a line by itself, it is associated with the following line.

Numbers: The assembler can accept numerical values in the following bases: 2, 8, 10, 16. These correspond to binary, octal, decimal, and hexadecimal. Binary values must begin with “0b”; a zero followed by a lower case “B”. Octal values must begin with “0c”, and hexadecimal values must begin with “0x”. Decimal numbers are written normally. Note that all values must be integers. Real numbers are not supported.

Here are some examples of numerical values:

• 0b11101011 – binary
• 0c47 – octal
• 65535 – decimal
• 0fae – not allowed
• 34.92 – not allowed

Style: Assembly language code typically follows a simple coding style. Below is an example of a typical piece of Sweet16-GP assembly code:

```;
; Test for BRN
; On Entry:
; On Exit: ACC = 0x8000, R1 = 0x0010, R2 = 0x01, PC = 0x13, N and V flags Set
;
start:  SET     R0,     0x7FF0      ; Load ACC with 0xFFF0
SET     R2,     0x0001      ; Load R2 with 0x1F00
SET     R1,     0x0000      ; Load R1 with 0x00
SET     R3,     0x0000
calc:   ADD     R2                  ; ADD R2 to ACC and place results in ACC
INC     R1                  ; Count how many times we Add R2 to ACC (R0)
BRV     end                 ; Branch to HALT instruction
BRA     calc                ; Data block FB
end:    HALT                        ; STOP Execution```

As you can see, there is a comment block at the top that documents the code. It is suggested that each block of code be documented this way. Always include the conditions that are expected on entry into the code, and the exit state. List any modified registers, stack pointers, or memory locations that are left in an alter state by the program.

Declare your line labels at the beginning of the line. Make then human readable. If you need long labels place them above the line it should be associated with. Always end a label declaration with a colon. Here are some examples of label declarations:

• start:
• end:
• _loop_01:
• _cond_true:
• _cond_false:
• loop_iter:

Instructions should all be placed in a single column after the label declarations as show above. White space should be used to place the instruction operands in a single column after the instructions. Finally, comments should all line up into a single column. There is nothing in the assembler to enforce this columnization.  However, you’ll find your code much easier to read and groc if you follow these guidelines.

Notes: Many assemblers allow for the operands to be mathematical expressions. They also, typically offer special features like a characters that reference the program counter at the point in the program at which the special character is used. The Sweet16-GP assembler does not include such features. Perhaps a later version will add these feature. However, for this project I wanted to maintain simplicity. Lastly, the assembler also does not support macros or predefined routines.

Assemblers are unique in the parser/translator world in that their source (assembly) language has a nearly one-to-one relationship with its output language (machine code). This fact makes it simple to create basic assemblers using crude pattern matching parsers. Often regular expressions are employed to recognize the various language constructs.

The only difficulty encounter is usually label references. Since labels can be referenced before they are declared. This, however, can be dealt with by breaking the parser into two parts and allowing each part to separately scan the source code.

Our assembler will use a simple two-pass process in which the first pass (scan of the source code) creates a simple intermediate representation. The second pass will take the results of the first pass and convert it to actual machine code. We will also use a table of opcodes and addressing modes to help us build our intermediate code.

The first thing we will need is a simple framework with a line of code for testing. Let’s get that setup. Our assembler will expect our assembly code in an interrable form as it will work line-by-line. So we’ll need to pass our code as a list of lines of assembly.

```"""Sweet16-GP Assembler"""

if __name__ == '__main__':
text = """
;This is a comment
start:  SET  ACC, 0xFFDE  ; This is also a comment
end:    HALT  ; end of program
"""
lines = text.split('\n')
```

Next. to make parsing easier we will write a simple routine to remove comments from our assembly source code.

```def strip_lines(self, lines):
""" Takes a sequence of lines and strips comments and blank lines."""
for line in lines:
comment_index = line.find(";")
if comment_index >= 0:
line = line[:comment_index]
line = line.strip()
yield line```

As you can see we simply loop over the lines in our text and if we find the semicolon character that begins a comment, we remove everything from the semicolon to the end of the line. Next, we call strip() on the line to remove all leading and trailing whitespace, then we yield the line to the caller.

If you’re not familiar with the yield keyword in python 3 I suggest you lookup python 3 generators for a detailed explanation. In a nutshell, however, yield basically stops the loop on each iteration and returns the currently processed line to the caller. When the caller returns back to the strip_lines() method, it will re-enter the method with the state that existed the last time yield was called. In other words, the method remembers where it left off in the loop and starts from there on subsequent calls. You can learn more about the yield keyword here.

It’s time to test this. Change the code to read:

```def strip_lines(lines):
""" Takes a sequence of lines and strips comments and blank lines."""
for line in lines:
comment_index = line.find(";")
if comment_index >= 0:
line = line[:comment_index]
line = line.strip()
yield line

if __name__ == '__main__':
text = """
;This is a comment
start:  SET  ACC, 0xFFDE  ; This is also a comment
end:    HALT  ; end of program
"""
lines = text.split('\n')

# Remove comments and blank lines
for line in strip_lines(lines):
print(line)```

If you run this code now you should get something like the following output:

```start:  SET  ACC, 0xFFDE
end:    HALT

Process finished with exit code 0```

Not bad for the few minutes it took us to write that. We can now take in assembly code and clean it of blank lines and comments.

Now we have one more step of preprocessing before we can move on to assembling our code. Recall that some registers can be referred to by name. For example, R0 is also known as the accumulator and is named ACC. This naming is great for reading and writing code but we can make assembly easier if we replace these friendly names with the canonical register name Rx where x is the numerical index into the register file. The best place to do this is within our strip_lines() method. However, to keep the code modular and easy to understand we’ll write a subroutine for this and make a call to it in strip_lines().

```def replace_register_names(self, line):
""" Replace register names with 'R' + register index.
Example: ACC becomes R0. STATUS becomes R6."""
line = line.replace('ACC', 'R0')
line = line.replace('RETSTACK', 'R4')
line = line.replace('COMP', 'R5')
line = line.replace('STATUS', 'R6')
line = line.replace('PC', 'R7')
return line```

As you can see we simply take a line of text and replace any register names found in the line with the R-value. Now let’s place the call to replace_register_names() in strip_lines().

```def strip_lines(self, lines):
""" Takes a sequence of lines and strips comments and blank lines.
Also, calls replace_register_names on stripped lines."""
for line in lines:
comment_index = line.find(";")
if comment_index >= 0:
line = line[:comment_index]
line = line.strip()
line = replace_register_names(line)
yield line```

Our new code now looks like this:

```"""Sweet16-GP Assembler"""

def replace_register_names(line):
""" Replace register names with 'R' + register index.
Example: ACC becomes R0. STATUS becomes R6."""
line = line.replace('ACC', 'R0')
line = line.replace('RETSTACK', 'R4')
line = line.replace('COMP', 'R5')
line = line.replace('STATUS', 'R6')
line = line.replace('PC', 'R7')
return line

def strip_lines(lines):
""" Takes a sequence of lines and strips comments and blank lines."""
for line in lines:
comment_index = line.find(";")
if comment_index >= 0:
line = line[:comment_index]
line = line.strip()
line = replace_register_names(line)
yield line

if __name__ == '__main__':
text = """
;This is a comment
start:  SET  ACC, 0xFFDE  ; This is also a comment
end:    HALT  ; end of program
"""
lines = text.split('\n')```

If you run the program again, you will see that the register name ACC has been replaced by R0. Exactly what we want. Your output should look like the following:

```start:  SET  R0, 0xFFDE
end:    HALT

Process finished with exit code 0```

That’s it for today, next time we are going to write a method to take our cleaned code and parse it line by line.