The plan
Our journey is made of 4 stations - each of them depending on the previous ones:
- The tokenizer (aka “Lexical Analysis”): converting an input code - in
LISP
syntax - into an array of tokens. - The parser (aka “Syntactic Analysis”): transforming an array of tokens into an Abstract Syntax Tree (AST).
- The emitter (aka “Code Generation”): string-ifying an AST into
C
-like code. - The compiler (aka “You made it”): combining all the pieces together.
(The interactive code snippets are powered by a tool of mine named KLIPSE.)
The tokenizer
The tokenizer
receives a string of code and breaks it down into an array of tokens.
There are three kinds of tokens:
- single-character token:
(
and)
- multiple character token:
123
orabcd
- a string: something that starts with a
"
and ends with a"
(no escaping!)
First, we are going to write a couple of tokenizers for a single token. Each tokenizer receives the code as a string and the current position and returns:
- the length of the token
- the token as an object with two keys:
type
andvalue
Single-character token
Let’s write a generic function that tokenizes a single character:
tokenizeCharacter = (type, value, input, current) =>
(value === input[current]) ? [1, {type, value}] : [0, null]
Here is the tokenizer for (
:
tokenizeParenOpen = (input, current) => tokenizeCharacter('paren', '(', input, current)
tokenizeParenOpen('(', 0)
And here is the tokenizer for )
:
tokenizeParenClose = (input, current) => tokenizeCharacter('paren', ')', input, current)
tokenizeParenClose(')', 0)
Multiple character tokens:
We will describe our multi-character token by means of regular expressions:
Here is a generic regexp tokenizer:
tokenizePattern = (type, pattern, input, current) => {
let char = input[current];
let consumedChars = 0;
if (pattern.test(char)) {
let value = '';
while (char && pattern.test(char)) {
value += char;
consumedChars ++;
char = input[current + consumedChars];
}
return [consumedChars , { type, value }];
}
return [0, null]
}
And here is the number
tokenizer:
tokenizeNumber = (input, current) => tokenizePattern("number", /[0-9]/, input, current)
tokenizeNumber("123aad", 0)
And the name
tokenizer (in our language names are chains of letters):
tokenizeName = (input, current) => tokenizePattern("name", /[a-z]/i, input, current)
tokenizeName('hello world', 0)
String tokenizer
A string is something that starts with a "
and ends with a "
(no escaping in our language!):
tokenizeString = (input, current) => {
if (input[current] === '"') {
let value = '';
let consumedChars = 0;
consumedChars ++;
char = input[current + consumedChars];
while (char !== '"') {
if(char === undefined) {
throw new TypeError("unterminated string ");
}
value += char;
consumedChars ++;
char = input[current + consumedChars];
}
return [consumedChars + 1, { type: 'string', value }];
}
return [0, null]
}
tokenizeString('"Hello World"', 0)
Last thing, we want to skip whitespaces:
skipWhiteSpace = (input, current) => (/\s/.test(input[current])) ? [1, null] : [0, null]
The tokenizer
Let’s put all our tokenizers into an array:
tokenizers = [skipWhiteSpace, tokenizeParenOpen, tokenizeParenClose, tokenizeString, tokenizeNumber, tokenizeName];
The code tokenizer is going go over its input and try all the tokenizers and when it finds a match it will:
- push the token object
- update the current position
Here is the code:
tokenizer = (input) => {
let current = 0;
let tokens = [];
while (current < input.length) {
let tokenized = false;
tokenizers.forEach(tokenizer_fn => {
if (tokenized) {return;}
let [consumedChars, token] = tokenizer_fn(input, current);
if(consumedChars !== 0) {
tokenized = true;
current += consumedChars;
}
if(token) {
tokens.push(token);
}
});
if (!tokenized) {
throw new TypeError('I dont know what this character is: ' + char);
}
}
return tokens;
}
Let’s see our tokenizer in action:
tokenizer('(add 2 3)')
Our tokenizer doesn’t do any semantic validation. As an example, it can read unbalanced parenthesis:
tokenizer('(add 2')
Let’s make sure we can handle nested expressions properly:
tokenizer('(add 2 (subtract "314" 2))')
Hourra!!!
Please take a short rest before moving towards Station #2: The parser.