Header Image - Randall Morgan

Tag Archives

3 Articles

Sweet16-GP Assembler – Part 2

by SysOps 0 Comments
This entry is part 7 of 8 in the series Sweet16-GP CPU: A Complete Development Cycle

Last time we left our assembler with only two functions. The strip_lines() function which removes comments and blank lines and the rename_registers() function which replaces the register’s friendly names with the cononical names i.e. Rx where x is a number between 0 and 7 (the register’s index value into the register file).

With these to pre-processing functions out of the way we are ready to start processing our assembly code.  We will use a two-pass method of assembly. The first pass will break each line into its constituent parts and then locate the instruction’s addressing mode, stores it in a field called mode, and calculate the instruction’s actual opcode value and length. In addition it will store a labels in a symbol table to record their value for later use in the second pass or the assembler.

Our main assembler method will take our clean lines, one at a time, and parse them as described above. We will need a table of instructions that includes details about each instruct including it’s addressing mode(s), opcode(s) and length. You may be wondering why the “(s)” above. This is because some of our instructions have multiple addressing modes and therefore multipple opcodes and perhaps even multiple instruction lengths. We will use our table to disambiguate the instruction format into the proper values. 

We will implement our opcode_table as a dict. Each entry will be  indexed (key) by the instruction mnemonic.  The value for each key in the table will be a dict indexed by the addressing mode. Each addressing mode will then have a list of values containing the instruction’s opcode. If the opcode requires a one or two by value, they will be provided as VALUE_L and VALUE_H in the list. This description sounds much more complex than it really is. Refere to the code for our table below and re-read the above paragraph. It should all become clear.

opcode_table = {
    'HALT': {
        'implicit': ['0x00'],
    },
    'BRA': {
        'immediate': ['0x01', VALUE_L],
        'offset': ['0x01', OFFSET],
    },
    'BRC': {
        'immediate': ['0x02', VALUE_L],
        'offset': ['0x02', OFFSET],
    },
    'BRZ': {
        'immediate': ['0x03', VALUE_L],
        'offset': ['0x03', OFFSET],
    },
    'BRN': {
        'immediate': ['0x04', VALUE_L],
        'offset': ['0x04', OFFSET],
    },
    'BRV': {
        'immediate': ['0x05', VALUE_L],
        'offset': ['0x05', OFFSET],
    },
    'BSR': {
        'immediate': ['0x06', VALUE_L, VALUE_H],
        'offset': ['0x06', LABEL_L, LABEL_H],
    },
    'RTS': {
        'implicit': ['0x07'],
    },

    'SET': {
        'direct': ['0x08', VALUE_L, VALUE_H],
    },
    'LD': {
        'register': ['0x10'],
        'indirect': ['0x20'],
    },
    'ST': {
        'register': ['0x18'],
        'indirect': ['0x28'],
    },
    'LDD': {
        'indirect': ['0x30'],
    },
    'STD': {
        'indirect': ['0x38'],
    },
    'POP': {
        'indirect': ['0x40'],
    },
    'STP': {
        'indirect': ['0x48'],
    },
    'ADD': {
        'register': ['0x50'],
    },
    'SUB': {
        'register': ['0x58'],
    },
    'MUL': {
        'register': ['0x60'],
    },
    'DIV': {
        'register': ['0x68'],
    },
    'AND': {
        'register': ['0x70'],
    },
    'OR': {
        'register': ['0x78'],
    },
    'XOR': {
        'register': ['0x80'],
    },
    'NOT': {
        'register': ['0x88'],
    },
    'SHL': {
        'register': ['0x90'],
    },
    'SHR': {
        'register': ['0x98'],
    },
    'ROL': {
        'register': ['0xA0'],
    },
    'ROR': {
        'register': ['0xA8'],
    },
    'POPD': {
        'indirect': ['0xE0'],
    },
    'CPR': {
        'register': ['0xE8'],
    },
    'INC': {
        'register': ['0xF0'],
    },
    'DEC': {
        'register': ['0xF8'],
    },
}

As you can see, our instructions can fall into one of four addressing modes. These include implicit (when the instruction itself indicates the addressing mode),  immediate (when the value needed by the instruction directly follows it), indirect (when the register specified contains the value needed to complete the instructions, and register (when the instruction operates directly on the register specified). To tell the truth, these could be further broken down but that would only complicate things.

OK, now we have our opcode_table we are going to start writing our parser. Our parser is a little bit unorthodox. Since our input language is so constrained we can forego all the formal theory and much of the discipline or parser development. 

I do believe however, that any developer worth their weight should write at least a simple parser at some point. Understanding even a very simple parser expands your understanding of how your tools work whether you’re a web developer, AI master, ETL programmer, or Embedded Systems developer. If you call yourself a Software Engineer or Programmer and you have written at least a simple parser. compiler, or interpreter, then you owe it to yourself to do so! To get you started I recomend DR. Jack Crenshaw’s “Let’s Build A Compiler” and Ruslans’s “Let’s Build A Simple Interpreter” blog serries. If you’re a seasoned developer checkout the book “Language Implementation Patterns”. For those mathematically oriented, checkout the Dragon Book

So what I am going to do is simply plow forward. We will tackle one issue at a time and add functions to handle each issue as it arises. Note that this technique wont work if you’re trying to write a high level language compiler or interpreter. There you will need all the formal methods and a have good design before writing code. What we are going for here is a quick and simple boot-strap assembler. We can write better tools later. For now, we just need something to get us up and running.

OK, the first thing we are going to need is some set of methods to break up the line of source code we feed to our assembler into it’s various fields. We will create an entry point function called assemble which will do our heavy lifting by calling other functions.

We will need to produce some data about the current source line. This will include any label, register, addressing mode, and value. Additionally, we may also want to keep track of the original instruction mnemonic and any intermediate code we produce. So we will need a supporting function to handle this. We will also need some storage for any lables or symbols we encounter during parsing.

The code below is our entry point. It will be responsible for handling our two apasses over our code and  will call the functions that do the heavy lifting.

def assemble(lines, lc=0):
    """Breaks the line into its constituent parts.
        it then locates the instruction's addressing mode,
        stores it in mode, and calculates the instruction's
        actual opcode value and length. """
    objcode = []
    symbols = {}

    # Pass 1 : Parse instructions and create intermediate code
    for lineno, label, (value, register, mode, mnemonic, icode) in parse_lines(lines, symbols):
        # Try to evaluate numeric labels and set the lc (location counter)
        if label:
            try:
                lc = int(eval(label, symbols))
            except (ValueError, NameError):
                symbols[label] = lc

        # Store the resulting object code for later expansion
        if icode:
            objcode.append((lineno, lc, value, register, mode, mnemonic, icode))
            lc += len(icode

Note the call to parse_lines() we will write this function next. But, let’s step back and think about what we need it to do…

Given a line of code like:

start:    SET   R1, 0xFA

We can see we need to break up this line into it’s smaller parts. Luckily, this is pretty easy to do. Also, since each line will follow a similar pattern for the fields, we can easily deduce what each field contains. 

# Exception used for errors
class AssemblyError(Exception):
    pass

def parse_lines(lines, symbols):
    """ Determine line number, mnemonic, register, value, and address mode"""
    for lineno, line in enumerate(lines, 1):
        # Handle labels
        label, *colon, statement = line.rpartition(":")
        try:
            # parse the line into ir (intermediate representation)
            data = lineno, label, parse_opcode(statement, symbols) if statement else (None, None, None, None, None)
            yield data
        except AssemblyError as e:
            print("{0:4d} : Error : {1}".format(lineno, e))

Refering to the code above, we use the enumerate function to keep track of our line numbers. One drawback to doing this approach is that our comment and blank lines wont be counted. Something we could fix by tracking line numbers in the strip_lines() function. But for now, we will do it this way.

Enumerate returns to us an integer value for the line number (lineno) and the line of source code. Next, we use rpartition(“:”) on on the source line to break it into three parts. These parts are return in a tuple containing the part before the seperator (colon in our case, which ends a label declaration), the seperator, and the part after the seperator. If any part is not found (for example, if there is no colon in the source line) then that part returns an empty string value. So if our line does not contains a colon, we will get an empty string for the label and for the seperator followed by the initial line of code which we save in the statement variable.

Once we have gotten any label, we need to take our statement and further parse it into its various parts. If it turns out that our statement is empty, we need to return a default tuple of ‘None’ values otherwise, we make a call to parse_opcode() passing in our statement and our symbol table so that any symbols found in our source line can be handled. If this all goes horribly wrong, we throw an exception and print the error message. We defined a simple exception class above for this purpose.

Our next task is to implement a function to parse the line into opcodes. We called this function pasre_opcode() above. The implementation is shown below:

def parse_opcode(line: str, symbols) -> tuple:
    """ Break the line into its constituent parts.
        Locate any labels and store their definition.
        Then locate the instruction's addressing mode,
        and stores it in mode. Calculate the instruction's
        actual opcode value.

        Returns: tuple(value, register, mode, objcode) where
        value is a string to be evaluated in the second pass.
        register is the register value or empty string if no
        register exists. "mode" is the addressing mode and
        objcode is a dict containing the base opcode value,
        and a list of functions needed to process the
        instruction
    """
    fields = line.split(None, 1)
    nofields = len(fields)
    if nofields > 1:
        extra = fields[1].split(',')
        if len(extra) > 1:
            fields[1] = extra[0]
            fields.extend(extra[1:])
    nofields = len(fields)
    mnemonic = fields[0]
    register = ''
    value = ''

    """ Examples:
            HALT
            BRA 0x1F
            SET R0, 0xFFFE
            LD R2
            STD @R3  
    """
    # Get register and value
    if nofields > 1:
        register = parse_register(fields[1])

    if register == '' and nofields > 1:
        value = parse_value(fields[1])

    if nofields > 2:
        register = parse_register(fields[1])
        value = parse_value(fields[2])

    # Get address mode
    mode = parse_address_mode(fields)

    # Get all addressing modes for this mnemonic
    opcodemodes = opcode_table.get(mnemonic)
    if not opcodemodes:
        raise AssemblyError("Unknown opcode '{}'".format(mnemonic))

    # Get the address mode used in the instruction
    objcode = opcodemodes.get(mode)
    if not objcode:
        raise AssemblyError("Invalid addressing mode '{0}' for {1}".format(mode, mnemonic))

    return value, register, mode, mnemonic, list(objcode)

There is a lot going on here so let’s unpack it. First, we use the split() method to split the statement into two portions. The first parameter to split() gives the character to split on and the second parameter is used to limit the number of splits. What we are doing here is seperating the instruction mnemonic from the rest of the statement. Since some instructions (i.e.: HALT) only have the mnemonic we need to check the number of fields we get back. If it’s only one, then we have an instruction like HALT or RTS. 

However, if we have more than one field returned from the split we need to further decompose the line. We now have a mnemonic in fields[0] and the remainder of the line in fields[1]. We now know we may have something like Rx, @Rx, or Rx, 0xFA following the instruction mnemonic. So the next move is to tray and seperate the remaining portion of the statement on the comma seperator. If the comma is found, one of two cases will be returned i.e.: (register, value) or @register, value). These are stored in the extra variable. We then add the extra fields to the fields variable and assign mnemonic the value in fields[0] which contains the instruction mnemonic.

Next, using the value of ‘nofields’ which indicate what we should expect next, we sort out the remaining fields and store their values into their perspective variables. In each case, we call parse functions which perform a specific sub-task.

Our first sub-task is to locate any register in the fields. At this point we know if a register exists it should be in fields[1]. So we pass this value to the parse_register() function shown below:

def parse_register(field: str) -> str:
    """ Examples:
            R2
            @R3
            R2, 0x1F
    """
    register = field.upper()
    if register.__contains__('R'):
        pos = field.find('R')
        register = field[pos + 1:pos + 2]
    return register

The first thing we do is convert the register value to upper case. Then we check if the string contains ‘R’ and return the character following the ‘R’ which should be the register index value (a single digit between 0 and 7 inclusively). We then return that value to the caller.

It’s possible that we didn’t have a register. In this case, there may be a value in the fields[1] position. So we check if the call to parse_register() returned a value of an empty string. If the latter was returned, we try and parse a value using the function parse_value() passing in fields[1] as a parameter. This situation occurs with the branch instructions where the mnemonic is directly followed by the offset value for the jump.

def parse_value(val: str) -> str:
    """ Try to parse a value, which can be an integer
        in base 2, 8, 10, 16 or a register.

        Returns: A string representation of the integer
        value in base 10, or empty string if the value
        cannot be converted to an integer.
        On Error: return original value.
    """
    # convert value to int from various bases
    if isinstance(val, str):
        # Get any possible prefix
        val = val.strip(' ')
        base = val[0:2]
        if val.isnumeric():
            return str(int(val, 10))
        if base == '0x':
            return str(int(val, 16))
        elif base == '0c':
            return str(int(val, 8))
        elif base == '0b':
            return str(int(val, 2))
        elif val.__contains__('R'):
            return ''
        else:
            return val

As you can see above the parse_value() function must handle many types of values. It first strips off any prefix to the value and tests if this matches any of the supported numerical base indicators. If a match is found the value is converted to an integer and then cast into a string and returned to the caller.

Next, we call parse_address_mode() passing in all the fields as we may need more than one to identify the addressing mode. The parse_address_mode() function can only be understood in the context of the opcode_table. So refer to the table as you gork this code:

def parse_address_mode(fields: list) -> str:
    """ Example inputs:
                ['HALT']                    : implicit mode
                ['BRA', '0x1F']             : immediate mode
                [LD, R2]                    : register
                [STD, @R3]                  : indirect
                ['SET', 'R0,', '0xFFFE']    : direct
                ['BRC' 'end']               : offset
    """
    numfields = len(fields)
    if numfields == 1:
        return 'implicit'
    elif numfields == 2:
        if fields[0] in directives:
            return 'immediate'
        if fields[1].__contains__('@R'):
            return 'indirect'
        elif fields[1].__contains__('R') and fields[1].__contains__(','):
            return 'direct'
        elif fields[1].__contains__('R'):
            return 'register'
        elif parse_value(fields[1]).isnumeric():
            return 'immediate'
        elif parse_value(fields[1]).isalnum() or parse_value(fields[1]).isalpha():
            return 'offset'
    elif numfields == 3 and parse_value(fields[2]).isnumeric():
        return 'direct'
    elif numfields == 3 and '(' in fields[2] and ')' in fields[2]:
        return 'direct'
    else:
        raise ValueError('Expected numeric value got: {expected}'.format(expected=fields[2]))

The parse_address_mode() function uses the data contained in the fields to sort out the addressing mode of the instruction and return it to the caller, parse_opcode() which then stores this value in the mode variable. 

Next we use the mnemonic value to get the instruction data from the opcode_table and then use the mode variable to get the proper opcode value and instruction format.

If all has gone well, we return a tuple containing the value, register, mode, mnemonic, variables and the object code taken from the opcode_table.

At this poiont our first pass is complete. We have all the data necessary to assemble our instruction code with the possible exception of any forward declared symbol. 

We have one last issue before we can try this out. Our opcode_table has method names in some of the fields. We need to implemnt these helper methods:

All of these methods are pretty strainght forward. They simply take the value from the source code and separe it out into high bytes and low bytes in the case of VALUE_H, VALUE_L, LABEL_H, and LABEL_L. The OFFSET function calculates the offset from the current position.

# Functions used in the creation of object code (used in the table below)
def VALUE_L(pc: int, value):
    val = value_to_int(value) & 0xff
    return str(val & 0xff)


def VALUE_H(pc: int, value: int) -> int:
    val = (value_to_int(value) & 0xff00) >> 8
    return str(val & 0xff)
 
  
def LABEL_L(pc, value):
    print(f'LABEL_L PC: {pc}, value: {value}')
    return (value) & 0xff


def LABEL_H(pc, value):
    print(f'LABEL_H PC: {pc}, value: {value}')
    return ((value) & 0xff00) >> 8


def OFFSET(pc, value):
    print(f'OFFSET: {value}')
    return ((value - pc)) & 0xff


def value_to_int(value):
    # convert value to int from various bases
    if isinstance(value, str):
        if value[1:3].lower() == '0x':
            return int(value, 16)
        elif value[1:3].lower == '0c':
            return int(value, 8)
        elif value[1:3].lower() == '0b':
            return int(value, 2)
        else:
            print('Invalid value.')
    elif isinstance(value, int):
        return value
    # else:
    #     raise ValueError("Expected numerical value, got: {0}".format(value))


def register_to_int(reg):
    if isinstance(reg, str):
        if reg.startswith('@R'):
            return int(reg[2:], 16)
        elif reg.startswith('R') and reg.endswith(','):
            return int(reg[1:-1], 16)
        elif reg.startswith('R'):
            return int(reg[1:])
        else:
            raise ValueError("Cannot parse the register: '{0}'".format(reg))

PK, I think we’re ready to try this! Below is the final code. Note I made some adjustments to the code from the last installment so we could test what we’ve done so far. 

"""Sweet16-GP Assembler"""


# Exception used for errors
class AssemblyError(Exception):
    pass


# Functions used in the creation of object code (used in the table below)
def VALUE_L(pc: int, value):
    # print(f'VALUE_L value: {value}')
    #val = value_to_int(value) & 0xff
    #print(f'VALUE_L val: {val}')
    #return str(val & 0xff)
    pass


def VALUE_H(pc: int, value: int) -> int:
    val = (value_to_int(value) & 0xff00) >> 8
    return str(val & 0xff)


def LABEL_L(pc, value):
    return (value) & 0xff


def LABEL_H(pc, value):
    return ((value) & 0xff00) >> 8


def OFFSET(pc, value):
    return ((value - pc)) & 0xff


def value_to_int(value):
    # convert value to int from various bases
    if isinstance(value, str):
        if value[1:3].lower() == '0x':
            return int(value, 16)
        elif value[1:3].lower == '0c':
            return int(value, 8)
        elif value[1:3].lower() == '0b':
            return int(value, 2)
        else:
            print('Invalid value.')
    elif isinstance(value, int):
        return value
    # else:
    #     raise ValueError("Expected numerical value, got: {0}".format(value))


def register_to_int(reg):
    if isinstance(reg, str):
        if reg.startswith('@R'):
            return int(reg[2:], 16)
        elif reg.startswith('R') and reg.endswith(','):
            return int(reg[1:-1], 16)
        elif reg.startswith('R'):
            return int(reg[1:])
        else:
            raise ValueError("Cannot parse the register: '{0}'".format(reg))


opcode_table = {
    'HALT': {
        'implicit': ['0x00'],
    },
    'BRA': {
        'immediate': ['0x01', VALUE_L],
        'offset': ['0x01', OFFSET],
    },
    'BRC': {
        'immediate': ['0x02', VALUE_L],
        'offset': ['0x02', OFFSET],
    },
    'BRZ': {
        'immediate': ['0x03', VALUE_L],
        'offset': ['0x03', OFFSET],
    },
    'BRN': {
        'immediate': ['0x04', VALUE_L],
        'offset': ['0x04', OFFSET],
    },
    'BRV': {
        'immediate': ['0x05', VALUE_L],
        'offset': ['0x05', OFFSET],
    },
    'BSR': {
        'immediate': ['0x06', VALUE_L, VALUE_H],
        'offset': ['0x06', LABEL_L, LABEL_H],
    },
    'RTS': {
        'implicit': ['0x07'],
    },

    'SET': {
        'direct': ['0x08', VALUE_L, VALUE_H],
    },
    'LD': {
        'register': ['0x10'],
        'indirect': ['0x20'],
    },
    'ST': {
        'register': ['0x18'],
        'indirect': ['0x28'],
    },
    'LDD': {
        'indirect': ['0x30'],
    },
    'STD': {
        'indirect': ['0x38'],
    },
    'POP': {
        'indirect': ['0x40'],
    },
    'STP': {
        'indirect': ['0x48'],
    },
    'ADD': {
        'register': ['0x50'],
    },
    'SUB': {
        'register': ['0x58'],
    },
    'MUL': {
        'register': ['0x60'],
    },
    'DIV': {
        'register': ['0x68'],
    },
    'AND': {
        'register': ['0x70'],
    },
    'OR': {
        'register': ['0x78'],
    },
    'XOR': {
        'register': ['0x80'],
    },
    'NOT': {
        'register': ['0x88'],
    },
    'SHL': {
        'register': ['0x90'],
    },
    'SHR': {
        'register': ['0x98'],
    },
    'ROL': {
        'register': ['0xA0'],
    },
    'ROR': {
        'register': ['0xA8'],
    },
    'POPD': {
        'indirect': ['0xE0'],
    },
    'CPR': {
        'register': ['0xE8'],
    },
    'INC': {
        'register': ['0xF0'],
    },
    'DEC': {
        'register': ['0xF8'],
    },
}

def parse_value(val: str) -> str:
    """ Try to parse a value, which can be an integer
        in base 2, 8, 10, 16 or a register.

        Returns: A string representation of the integer
        value in base 10, or empty string if the value
        cannot be converted to an integer.
        On Error: return original value.
    """
    # convert value to int from various bases
    if isinstance(val, str):
        # Get any possible prefix
        val = val.strip(' ')
        base = val[0:2]
        if val.isnumeric():
            return str(int(val, 10))
        if base == '0x':
            return str(int(val, 16))
        elif base == '0c':
            return str(int(val, 8))
        elif base == '0b':
            return str(int(val, 2))
        elif val.__contains__('R'):
            return ''
        else:
            return val


def parse_address_mode(fields: list) -> str:
    """ Example inputs:
                ['HALT']                    : implicit mode
                ['BRA', '0x1F']             : immediate mode
                [LD, R2]                    : register
                [STD, @R3]                  : indirect
                ['SET', 'R0,', '0xFFFE']    : direct
                ['BYTE' '0x1f']             : immediate
                ['WORD' '0x3BFC']           : immediate
                ['STRING' 'This is a test'] : immediate
                ['BRC' 'end']               : offset
    """
    numfields = len(fields)
    if numfields == 1:
        return 'implicit'
    elif numfields == 2:
        # if fields[0] in directives:
        #     return 'immediate'
        if fields[1].__contains__('@R'):
            return 'indirect'
        elif fields[1].__contains__('R') and fields[1].__contains__(','):
            return 'direct'
        elif fields[1].__contains__('R'):
            return 'register'
        elif parse_value(fields[1]).isnumeric():
            return 'immediate'
        elif parse_value(fields[1]).isalnum() or parse_value(fields[1]).isalpha():
            return 'offset'
    elif numfields == 3 and parse_value(fields[2]).isnumeric():
        return 'direct'
    elif numfields == 3 and '(' in fields[2] and ')' in fields[2]:
        return 'direct'
    else:
        raise ValueError('Expected numeric value got: {expected}'.format(expected=fields[2]))


def parse_register(field: str) -> str:
    """ Examples:
            R2
            @R3
            R2, 0x1F
    """
    register = field.upper()
    if register.__contains__('R'):
        pos = field.find('R')
        register = field[pos + 1:pos + 2]
    return register


def parse_opcode(line: str, symbols) -> tuple:
    """ Break the line into its constituent parts.
        Locate any labels and store their definition.
        Then locate the instruction's addressing mode,
        and stores it in mode. Calculate the instruction's
        actual opcode value.

        Returns: tuple(value, register, mode, objcode) where
        value is a string to be evaluated in the second pass.
        register is the register value or empty string if no
        register exists. "mode" is the addressing mode and
        objcode is a dict containing the base opcode value,
        and a list of functions needed to process the
        instruction
    """
    fields = line.split(None, 1)
    nofields = len(fields)
    if nofields > 1:
        extra = fields[1].split(',')
        if len(extra) > 1:
            fields[1] = extra[0]
            fields.extend(extra[1:])
    nofields = len(fields)
    mnemonic = fields[0]
    register = ''
    value = ''

    """ Examples:
            HALT
            BRA 0x1F
            SET R0, 0xFFFE
            LD R2
            STD @R3
    """
    # Get register and value
    if nofields > 1:
        register = parse_register(fields[1])

    if register == '' and nofields > 1:
        value = parse_value(fields[1])

    if nofields > 2:
        register = parse_register(fields[1])
        value = parse_value(fields[2])

    # Get address mode
    mode = parse_address_mode(fields)

    # Get all addressing modes for this mnemonic
    opcodemodes = opcode_table.get(mnemonic)
    if not opcodemodes:
        raise AssemblyError("Unknown opcode '{}'".format(mnemonic))

    # Get the address mode used in the instruction
    objcode = opcodemodes.get(mode)
    if not objcode:
        raise AssemblyError("Invalid addressing mode '{0}' for {1}".format(mode, mnemonic))

    return value, register, mode, mnemonic, list(objcode)


def parse_lines(lines, symbols):
    """ Determine mnemonic, register, value, and address mode"""
    for lineno, line in enumerate(lines, 1):
        # Handle labels
        label, *colon, statement = line.rpartition(":")
        try:
            # parse the line into ir (intermediate representation)
            data = lineno, label, parse_opcode(statement, symbols) if statement \
                else (None, None, None, None, None)
            yield data
        except AssemblyError as e:
            print("{0:4d} : Error : {1}".format(lineno, e))


def assemble(lines, lc=0):
    """Breaks the line into its constituent parts.
        it then locates the instruction's addressing mode,
        stores it in mode, and calculates the instruction's
        actual opcode value and length. """
    objcode = []
    symbols = {}

    # Pass 1 : Parse instructions and create intermediate code
    for lineno, label, (value, register, mode, mnemonic, icode) in parse_lines(lines, symbols):
        # Try to evaluate numeric labels and set the lc (location counter)
        if label:
            try:
                lc = int(eval(label, symbols))
            except (ValueError, NameError):
                symbols[label] = lc

        # Store the resulting object code for later
        # expansion and adjust the location counter lc.
        if icode:
            objcode.append((lineno, lc, value, register, mode, mnemonic, icode))
            lc += len(icode)
    return objcode


def replace_register_names(line):
    """ Replace register names with 'R' + register index.
        Example: ACC becomes R0. STATUS becomes R6."""
    line = line.replace('ACC', 'R0')
    line = line.replace('RETSTACK', 'R4')
    line = line.replace('COMP', 'R5')
    line = line.replace('STATUS', 'R6')
    line = line.replace('PC', 'R7')
    return line


def strip_lines(lines):
    """ Takes a sequence of lines and strips comments and blank lines."""
    for line in lines:
        comment_index = line.find(";")
        if comment_index >= 0:
            line = line[:comment_index]
        line = line.strip()
        line = replace_register_names(line)
        yield line


if __name__ == '__main__':
    text = ';This is a comment\n' \
           'start:  SET  ACC, 0xFFDE  ; This is also a comment\n' \
           'end:    HALT  ; end of program'
    lines = text.split('\n')

    # Remove comments and blank lines
    lines = strip_lines(lines)
    print(assemble(lines))

If you run the above code you should get an output like the following:

 

sweet16-articles-code/articles/part-07/assembler_02.py
 [
  (2, 0, '65502', '0', 'direct', 'SET', ['0x08', , ]), 
  (3, 3, '', '', 'implicit', 'HALT', ['0x00']) 
 ]

As you can see we have the intermediate representation for two instructions. All the info we will need to assemble these instructions into actual machine code. Reading the tuples the first value (2) is the line number from the source. The second value (0) is the location counter at the start of the instruction. The next value (65502) is the value provided in the instruction, followed by the register index (0), the addressing mode (direct), and the mnemonic. The final fields holds the opcode information from the opcode_table.  Try this out on several instructions and see that the results make sense. Write some unit test for the various functions and for the assemble as it stands. 

This has been a long post. Next time we will tackle the second pass of our assembler. Until then, keep coding!

 

Newsletter Powered By : XYZScripts.com