Write your own compiler - Station #1: the tokenizer

The plan

Our journey is made of 4 stations - each of them depending on the previous ones:

The tokenizer (aka “Lexical Analysis”): converting an input code - in LISP syntax - into an array of tokens.
The parser (aka “Syntactic Analysis”): transforming an array of tokens into an Abstract Syntax Tree (AST).
The emitter (aka “Code Generation”): string-ifying an AST into C-like code.
The compiler (aka “You made it”): combining all the pieces together.

(The interactive code snippets are powered by a tool of mine named KLIPSE.)

The tokenizer

The tokenizer receives a string of code and breaks it down into an array of tokens.

tokens

There are three kinds of tokens:

single-character token: ( and )
multiple character token: 123 or abcd
a string: something that starts with a " and ends with a " (no escaping!)

First, we are going to write a couple of tokenizers for a single token. Each tokenizer receives the code as a string and the current position and returns:

the length of the token
the token as an object with two keys: type and value

Single-character token

Let’s write a generic function that tokenizes a single character:

tokenizeCharacter = (type, value, input, current) =>
(value === input[current]) ? [1, {type, value}] : [0, null]

Here is the tokenizer for (:

tokenizeParenOpen = (input, current) => tokenizeCharacter('paren', '(', input, current)

tokenizeParenOpen('(', 0)

And here is the tokenizer for ):

tokenizeParenClose = (input, current) => tokenizeCharacter('paren', ')', input, current)

tokenizeParenClose(')', 0)

Multiple character tokens:

We will describe our multi-character token by means of regular expressions:

Here is a generic regexp tokenizer:

tokenizePattern = (type, pattern, input, current) => {
  let char = input[current];
    let consumedChars = 0;
      if (pattern.test(char)) {
          let value = '';
	      while (char && pattern.test(char)) {
	            value += char;
		          consumedChars ++;
			        char = input[current + consumedChars];
				    }
				        return [consumedChars , { type, value }];
					  }
					    return [0, null]
					    }

And here is the number tokenizer:

tokenizeNumber = (input, current) => tokenizePattern("number", /[0-9]/, input, current)

tokenizeNumber("123aad", 0)

And the name tokenizer (in our language names are chains of letters):

tokenizeName = (input, current) => tokenizePattern("name", /[a-z]/i, input, current)

tokenizeName('hello world', 0)

String tokenizer

A string is something that starts with a " and ends with a " (no escaping in our language!):

tokenizeString = (input, current) => {
  if (input[current] === '"') {
    let value = '';
    let consumedChars = 0;
    consumedChars ++;
    char = input[current + consumedChars];
    while (char !== '"') {
          if(char === undefined) {
	          throw new TypeError("unterminated string ");
		        }
			      value += char;
			            consumedChars ++;
				          char = input[current + consumedChars];
					      }
					          return [consumedChars + 1, { type: 'string', value }];
						    }
						      return [0, null]
						      }

tokenizeString('"Hello World"', 0)

Last thing, we want to skip whitespaces:

skipWhiteSpace = (input, current) =>   (/\s/.test(input[current])) ? [1, null] : [0, null]

The tokenizer

Let’s put all our tokenizers into an array:

tokenizers = [skipWhiteSpace, tokenizeParenOpen, tokenizeParenClose, tokenizeString, tokenizeNumber, tokenizeName];

The code tokenizer is going go over its input and try all the tokenizers and when it finds a match it will:

push the token object
update the current position

Here is the code:

tokenizer = (input) => {
  let current = 0;
  let tokens = [];
  while (current < input.length) {
    let tokenized = false;
    tokenizers.forEach(tokenizer_fn => {
      if (tokenized) {return;}
      let [consumedChars, token] = tokenizer_fn(input, current);
      if(consumedChars !== 0) {
        tokenized = true;
        current += consumedChars;
      }
      if(token) {
        tokens.push(token);
      }
    });
    if (!tokenized) {
      throw new TypeError('I dont know what this character is: ' + char);
    }
  }
  return tokens;
}

Let’s see our tokenizer in action:

tokenizer('(add 2 3)')

Our tokenizer doesn’t do any semantic validation. As an example, it can read unbalanced parenthesis:

tokenizer('(add 2')

Let’s make sure we can handle nested expressions properly:

tokenizer('(add 2 (subtract "314" 2))')

Hourra!!!

Please take a short rest before moving towards Station #2: The parser.

Yehonathan Sharvit

Write your own compiler - Station #1: the tokenizer

The plan

The tokenizer

Single-character token

Multiple character tokens:

String tokenizer

The tokenizer

Read Next

A tiny compiler: the transformer

Write your own compiler - Introduction

The plan

The tokenizer

Single-character token

Multiple character tokens:

String tokenizer

The tokenizer

Subscribe to Yehonathan Sharvit newsletter

Read Next

Tags