When you start out to program, look at all the possible programming languages until you settle for your first, one idea often pops up: "Hey, I want to make my own programming language!" And so it was for me too, which started a year long journey that I want to share. To note here is that I'm learning by recreating without doing research first. Or ever. So many terms are made up as I saw fit. With that being said:
"Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts."
And this describes it pretty well, however it's too broad for our purposes. For us, as programmers, parsing is more a sequence of multiple moving parts. These 5 pillars of parsing are:
Not all of these steps apply for each job, sometimes, like we do, you can leave out the optimization (or there even is none to do!). But these patterns hold true, ranging from parsing JSON to any modern programming language. We will get to know each one when we get there, for now, let's actually start with an example and see how these look like in action. The programming language I will be using is C99, due to its simplicity and for an easy translation in any other programming language for the ones following along.
Code snippets will look like this, and at the end of each chapter, the whole code will be shown once more.
#include <stdio.h>
int main(void) {
puts("Hello, Parsing!");
return 0;
}
Occasionally you will find exercises for the reader:
To start off, let's have a look at the CSV file format, and write a parser for it.
amount;item;importance 5;egg;high 2;apple;low 10;milk;moderate
Now before we can finally start writing some code, we need to analyze the format we are trying to parse. Thankfully CSV is very simple, the first row defines our column names, and each tine after that defines a row, filling in the data for each row. For simplicity, we will not deal with different types, and assume each row is a string.
The
This leaves us only with
Finally, we can start implementing. Our goal is to write a program
./csv FILE ROWwhich takes in a file and a row number, then prints out the data of that line with the right names for each column. So, as any good C program, we start with out entry point.
int main(int argc, char** argv) {
/* our code will follow here */
return 0;
}
Now let's add what's necessary for handling the arguments.
if(argc != 3) return 1; const char* file = argv[1]; unsigned int n = atoi(argv[2]);
Next up, we open a file handle and check whether it exists.
FILE* fh = fopen(file, "r"); if(!fh) return 1;
Okay, I promise the trivial part is over now. Let's actually start handling the datastream. Helpful here is
int fgetc(FILE*)which just fetches us a single character. Since we do not actually care for all the lines except the one given to us by the user (and the first one, but we will skip this for now), we can just skip forward until we get to the line we need.
Here we will start to write a function for each of these steps, to make the examples more coherent to look at.
int skip_to(FILE* fh, unsigned int line) {
int c = 1, cline = 1;
rewind(fh);
while(c != EOF && cline != line) {
c = fgetc(fh);
if(c == '\n')
++cline;
}
return cline == line;
}
This function first resets using rewind(), and then advances the current file handler until it reaches the line we actually care about, by simply counting the newlines. At the end, we simply return whether we exited because we reached the end of the file or actually got to the line we wanted. So we add this to our main.
if(!skip_to(fh, n)) return 1;
Great, we are at the start of our line, now what? Looking at the example CSV above, we see each piece of data is split by a semicolon (;). It would be good to have a function that just gives us the next piece of data, which we can call repeatedly to fill our row. So let's implement that function.
char* next(FILE* fh) {
char* buffer = calloc(256, 1);
char* ptr = buffer;
char c;
for(;;) {
c = fgetc(fh);
if(c == '\n' || c == EOF || c == ';') break;
*ptr = c;
++ptr;
}
return buffer;
}
What this function does, is allocate some memory (256 for now, to keep it simple) and keep appending the latest character to it, until we either encounter the end of the line, end of the file, or our seperator character. Notice here how we still take the seperator off the stream, so our next call to this function does not need to clear it. A visualization of what this function does:
/* stream = "3;egg;high" */ next(stream); // "3" next(stream); // "egg" next(stream); // "high"
This seems to work, now we can finish the first prototype of our program. It should for now just print the whole row at the given line. So back in our main:
char* data;
for(;;) {
data = next(fh);
puts(data);
free(data);
}
We are almost done, but what's importantly missing here is the condition to exit this loop. If we reach the end of the file, we return an empty string. Let's fix this by changing our next() function.
char* next(FILE* fh) {
/* ... */
for(;;) {
c = fgetc(fh);
if(c == ';') break;
if(c == '\n' || c == EOF) {
if(ptr - buffer == 0) {
free(buffer);
return NULL;
}
ungetc(c, fh);
return buffer;
}
/* ... */
}
/* ... */
}
Now, whenever we encounter the end of an line of the file, we check whether our buffer is empty, and if it is, we return NULL. Otherwise, we push the EOF or newline back into the buffer and return that. Why? Next time we call next(fh) it will return NULL, causing us to break out of our loops. The expression ptr - buffer is the characters we already put into our buffer. Changing our loop in the main function is now trivial:
char* data;
while((data = next(fh))) {
puts(data);
free(data);
}
If we didn't unget the newline, this would loop through the entire file. It's time to give our program a test drive, and see how it does.
$ cat test.csv amount;item;importance 5;egg;high 2;apple;low 10;milk;moderate $ ./csv test.csv 2 5 egg high $ ./csv test.csv 1 amount item importance
Good, this seems to be working as expected. What's left now is to integrate the header of this csv file into our output. We already know how extract the data from one line using next(). This time, this data has to be stored, and later when we lookup a line, we can use the same index for this list of header elements to see what our current column is actually about. For later practice, we will be opting to use a linked list over an array. So, we will need a struct to save a header entry, and point to the next.
struct header {
char* name;
struct header* next;
};
For convinience we will write simply append an element to our list.
struct header* h_append(struct header* list, char* name) {
if(list->name == NULL) {
list->name = name;
return list;
}
while(list->next) list = list->next;
list->next = malloc(sizeof(struct header));
list->next->name = name;
list->next->next = NULL;
return list;
}
It finds the last element in the list, creates a new entry and sets the name. So far so good. Next up is a function to get a header name by index.
char* h_get(struct header* list, unsigned int idx) {
while(idx != 0 && list->next) {
--idx;
list = list->next;
}
return list->name;
}
For ease of implementation, for indices greater than the size of the list, simply return the last element. Having all this, we can go back to our main() function and fill up our header list.
int main(int argc, char**) {
/* ... */
struct header h_list = {0};
char* h_data;
while((h_data = next(fh)))
h_append(&h_list, h_data);
if(!skip_to(fh, n)) return 1;
/* ... */
}
After we skip to the line we actually want to print, we can track the index as we go along, and fetch the header.
int main(int argc, char**) {
/* ... */
if(!skip_to(fh, n)) return 1;
unsigned int idx = 0;
char* data;
while((data = next(fh))) {
printf("%s: %s\n", h_get(&h_list, idx), data);
free(data);
++idx;
}
return 0;
}
To reiterate: 1.) we looked at an example of what we wanted to parse 2.) decided what steps of parsing are necessary and 3.) wrote it, and now ended up with an CSV parser that can be extended as much as you need it.
To sum it up for this chapter, here the final program:
#include#include struct header { char* name; struct header* next; }; struct header* h_append(struct header* list, char* name) { if(list->name == NULL) { list->name = name; return list; } while(list->next) list = list->next; list->next = malloc(sizeof(struct header)); list->next->name = name; list->next->next = NULL; return list; } char* h_get(struct header* list, unsigned int idx) { while(idx != 0 && list->next) { --idx; list = list->next; } return list->name; } int skip_to(FILE* fh, unsigned int line) { int c = 1, cline = 1; rewind(fh); while(c != EOF && cline != line) { c = fgetc(fh); if(c == '\n') ++cline; } return cline == line; } char* next(FILE* fh) { char* buffer = calloc(256, 1); char* ptr = buffer; char c; for(;;) { c = fgetc(fh); if(c == ';') break; if(c == '\n' || c == EOF) { if(ptr - buffer == 0) { free(buffer); return NULL; } ungetc(c, fh); return buffer; } *ptr = c; ++ptr; } return buffer; } int main(int argc, char** argv) { if(argc != 3) return 1; const char* file = argv[1]; unsigned int n = atoi(argv[2]); FILE* fh = fopen(file, "r"); if(!fh) return 1; struct header h_list = {0}; char* h_data; while((h_data = next(fh))) h_append(&h_list, h_data); if(!skip_to(fh, n)) return 1; unsigned int idx = 0; char* data; while((data = next(fh))) { printf("%s: %s\n", h_get(&h_list, idx), data); free(data); ++idx; } return 0; }
BASIC, or "Beginner's All-purpose Symbolic Instruction Code", created in 1964 is, as the name implies, perfect for beginners. And as such, also perfect to dive into a possible implementation of the most fundamental basic features. This is of couse a lot more complex than the CSV parser we did last chapter, but do not worry. First, we need to look at the BASIC syntax we want to implement. Here an example of what I have in mind:
10 PRINT "Hello World" 20 LET VAR 0 30 IF VAR IS 10 GOTO 100 40 PRINT VAR 50 ADD VAR 1 60 GOTO 30 70 EXIT 0
Now, let's see what the
This time we will also have a look at more data structures and start seperating the individual parts more clearly, as for our CSV implementation it was more convinient to have everything in one place. The goal this time: write an application that can run a subset of BASIC when given the file to run.
./basic file.basic
To start it off, we will need to write a Lexer that takes in the file, and splits it into seperate lines. Before we can do that, let's have an exact look at how each basic line is structured, it is essential to understand what you are trying to parse before doing so:
LINE COMMAND ARG1 ARG2 ... ARGN
Our lexer could already seperate them like this, however to determine what these tokens mean is more the job of the analyzing process, so we will put this into it's own function. That being said, it's time to actually start writing this. We will first need to seperate the file into seperate lines, and then split that line into seperate tokens. An adequate data structure is also required, which will be returned by our function.
struct line {
char** tokens;
size_t capacity;
size_t size;
};
struct lexed {
struct line* lines;
size_t len;
};
For those who have worked with C before, this line struct will look familiar, it's an array list. Since we dont know yet how many tokens are on a line, we need to dynamically grow our list. On the other hand, by counting the newline characters in the whole file (+ 1) we know exactly what the maximum amount of lines are there. Let's implement this behaviour right away.
struct lexed lexer(const char* context) {
struct lexed lexed = {0};
lexed.len = 1;
for(size_t i = 0; i < strlen(context); ++i)
if(context[i] == '\n')
++lexed.len;
lexed.lines = malloc(lexed.len * sizeof(struct line));
return lexed;
}
What's next? The algorithm we will be using to determine where the tokens are in our line is rather simple. Essentially, we swap between two modes: "In a token" and "In whitespace". To keep track of this, we need some variables. If we are in a token, do nothing until we reach a whitespace, in which case we have reached the end of the token and
struct lexed lexer(const char* context) {
/* ... */
struct line line = {0};
size_t token_len = 0;
size_t cur_line = 0;
int in_token = 0;
/* ... */
}
Our current line, the length of the current token, the line number we are on, and a variable to store in which mode we are at the moment. We start with being in the "in whitespace" mode for now. We can start with iterating over each element, and check what it is.
struct lexed lexer(const char* context) {
/* ... */
int in_token = 0;
for(size_t i = 0; i < strlen(context) + 1; ++i) {
if(isspace(context[i]) || context[i] == '\0') {
}
else {
}
}
/* ... */
}
The reason for iterating over not only the length of the string, but also the string terminating character is that the end of the input technically also counts as a whitespace and should terminate the current token.
Now what's the actual goal, when do we switch modes and what does that actually do? The idea is to place token_len at the beginning of a token when we encounter it, and once it ends take this slice, push it onto our line, and continue. So, in short, if we switch from whitespaces to token mode, we save the current position in token_len.
struct lexed lexer(const char* context) {
/* ... */
int in_token = 0;
for(size_t i = 0; i < strlen(context) + 1; ++i) {
if(isspace(context[i]) || context[i] == '\0') {
}
else {
if(!in_token) {
in_token = 1;
token_len = i;
}
}
}
/* ... */
}
We implemented the switch to the token mode, now let's implement switching back, and saving the token into our line. Important to remember here is that if we encounter a whitespace, we only want to switch if we were in a token before.
struct lexed lexer(const char* context) {
/* ... */
int in_token = 0;
for(size_t i = 0; i < strlen(context) + 1; ++i) {
if(isspace(context[i]) || context[i] == '\0') {
/* The keen eye might want to optimize this nesting */
/* However it will make sense later why we do it like this */
if(in_token) {
char* token = calloc(i - token_len + 1, 1);
memcpy(token, context + token_len, i - token_len);
if(line.capacity == line.size) {
/* multiplying by any number > 1.5 will work */
line.capacity *= 2;
line.tokens = realloc(line.tokens,
line.capacity * sizeof(char*));
}
line.tokens[line.size++] = token;
in_token = 0;
}
}
/* ... */
}
/* ... */
}
Good. This is a bigger code snippet this time, but if you take a look at it, most of it is simply increasing the size of our dynamic array if the size becomes too big, and copying our token over. However what we are still missing here is handling switching lines. Our criteria for doing so is to reach a newline or null terminator, then we push our line into our lexed struct and increase cur_line.
struct lexed lexer(const char* context) {
/* ... */
for(size_t i = 0; i < strlen(context) + 1; ++i) {
/* ... */
if(context[i] == '\n' || context[i] == '\0') {
lexed.lines[cur_line] = line;
++cur_line;
line.size = line.capacity = 0;
line.tokens = NULL;
}
}
/* ... */
}
Thankfully we already handled terminating the current token on a newline/terminator with isspace(), so no need to worry about an empty token unless there was no token on the line at all, which we will say is desired behaviour as we keep the line information.
A good idea when writing parsers is to always give it a try when you finish some part. It's almost like giving your car a drive after giving it some upgrades.
int main(void) {
struct lexed lexed = lexer("Hello, World!\n Test 123 ");
printf("Lines: %ld\n", lexed.len);
for(size_t line = 0; line < lexed.len; ++line) {
for(size_t i = 0; i < lexed.lines[line].size; ++i)
puts(lexed.lines[line].tokens[i]);
puts("---");
}
}
Lines: 2 Hello, World! --- Test 123 ---
Great. There is only one issue we didn't tackle yet, strings. Or more quotes (") in general. Thankfully this behaviour is rather simple, we just switch using a variable in_quote whenever we encounter one, and ignore whitespaces while we are "in a quotation".
struct lexed lexer(const char* context) {
/* ... */
int in_token = 0;
int in_quote = 0;
for(size_t i = 0; i < strlen(context) + 1; ++i) {
if((isspace(context[i]) && !in_quote) || context[i] == '\0') {
if(in_token) {
/* ... */
}
}
else {
if(context[i] == '"') {
in_quote = !in_quote;
}
/* ... */
}
if(context[i] == '\n' || context[i] == '\0') {
/* ... */
in_quote = 0;
}
}
/* ... */
}
Again, a test drive. This time with 'Hello "World !"\n "Good Job"'.
Lines: 2 Hello "World !" --- "Good Job" ---
This is working now. Our function got rather complex now, but it will do it's job for now. So as a whole:
struct line {
char** tokens;
size_t capacity;
size_t size;
};
struct lexed {
struct line* lines;
size_t len;
};
struct lexed lexer(const char* context) {
struct lexed lexed = {0};
lexed.len = 1;
for(size_t i = 0; i < strlen(context); ++i)
if(context[i] == '\n')
++lexed.len;
lexed.lines = malloc(lexed.len * sizeof(struct line));
struct line line = {0};
size_t token_len = 0;
size_t cur_line = 0;
int in_quote = 0;
int in_token = 0;
for(size_t i = 0; i < strlen(context) + 1; ++i) {
if((isspace(context[i]) && !in_quote) || context[i] == '\0') {
if(in_token) {
char* token = calloc(i - token_len + 1, 1);
memcpy(token, context + token_len, i - token_len);
if(line.capacity == line.size) {
line.capacity *= 2;
line.tokens = realloc(line.tokens,
line.capacity * sizeof(char*));
}
line.tokens[line.size++] = token;
in_token = 0;
}
}
else {
if(context[i] == '"') {
in_quote = !in_quote;
}
if(!in_token) {
in_token = 1;
token_len = i;
}
}
if(context[i] == '\n' || context[i] == '\0') {
lexed.lines[cur_line] = line;
++cur_line;
line.size = line.capacity = 0;
line.tokens = NULL;
}
}
return lexed;
}
Next up,
To translate a token we just lexed into an analyzed token, we will need some structure to hold this. Using C features, using an enum for the type information and a union for the actual type is preferable - this is often called a "Tagged Union".
struct analyzed_token {
enum {
T_NUMBER,
T_STRING,
T_IDENTIFIER,
T_UNKNOWN,
} type;
union {
long number;
char* string;
char* identifier;
};
};
T_UNKNOWN is for error handling, invalid tokens will cause our program to exit later on.
Previously we have split our input stream into lines and then tokens, now we take one of these tokens and look what it actually means, what is this token? To do so, we will write a function that converts a string token into our analyzed token.
struct analyzed_token analyze_token(char* token) {
struct analyzed_token analyzed;
}
This function now does simple checks, one after the other, until it knows what type of token it was given, then constructs an analyzed_token structure and returns it. Starting out: T_STRING.
Strings start and end with quotes, and must be more than one character long of course, so the check becomes trivial.
struct analyzed_token analyze_token(char* token) {
struct analyzed_token analyzed;
if(strlen(token) > 1 && token[0] == '"' && token[strlen(token)-1] == '"') {
}
}
Due to C not allowing us to simply "cut off" the start/end, we will copy the token and leave out the two quotes.
struct analyzed_token analyze_token(char* token) {
struct analyzed_token analyzed;
if(strlen(token) > 1 && token[0] == '"' && token[strlen(token)-1] == '"') {
size_t content_len = strlen(token) - 1;
char* content = malloc(content_len);
memset(content, 0, content_len);
memcpy(content, token + 1, content_len - 1);
analyzed.type = T_STRING;
analyzed.string = content;
free(token);
return analyzed;
}
}
Good, however the next 2 types we have to deal with won't be this easy, as we have to do a check for every character of the token. For numbers, all characters have to be digits. For identifiers, it's alphanumeric characters. Luckily we can check them at the same time, so let's do that quickly.
struct analyzed_token analyze_token(char* token) {
/* ... */
int all_digit = 1;
int all_alnum = 1;
for(size_t i = 0; i < strlen(token); ++i) {
all_digit &= isdigit(token[i]) > 0;
all_alnum &= isalnum(token[i]) > 0;
}
if(strlen(token) != 0)
all_alnum &= isalpha(token[0]) > 0;
}
Since identifiers are not allowed to start with numbers, we add an additional check at the end. Now using the gathered information we can check whether it's a string or identifier.
struct analyzed_token analyze_token(char* token) {
/* ... */
if(strlen(token) != 0 && all_digit) {
analyzed.type = T_NUMBER;
}
}
Having set the type, only left is to convert the token to a number, free it, and return the new analyzed token. For simplicity we do not support floats, this could be however an exercise for the reader.
struct analyzed_token analyze_token(char* token) {
/* ... */
if(strlen(token) != 0 && all_digit) {
analyzed.type = T_NUMBER;
analyzed.number = 0;
for(size_t i = 0; i < strlen(token); ++i)
analyzed.number = analyzed.number * 10 + token[i] - '0';
free(token);
return analyzed;
}
}
Identifiers are simpler, if the previous check succeeds, copy the token, return, done.
struct analyzed_token analyze_token(char* token) {
/* ... */
if(strlen(token) != 0 && all_alnum) {
analyzed.type = T_IDENTIFIER;
analyzed.identifier = token;
return analyzed;
}
}
Good. But what do we do if all our checks fail now? Well, it's not a number, nor a string, nor matches the criteria to be an identifier, so our program cannot handle it, it's an unknown token.
struct analyzed_token analyze_token(char* token) {
/* ... */
free(token);
analyzed.type = T_UNKNOWN;
return analyzed;
}
So what did we actually do? Let's showcase it with an example. Given an input, we should now be able to tokenize it, and print the correct types, let's try making a short program that does that.
int main(void) {
const char* types[] = {"NUMBER", "STRING", "IDENTIFIER", "UNKNOWN"};
struct lexed lexed = lexer(".......");
for(size_t line = 0; line < lexed.len; ++line) {
for(size_t i = 0; i < lexed.lines[line].size; ++i) {
struct analyzed_token at = analyze_token(lexed.lines[line].tokens[i]);
switch(at.type) {
case T_NUMBER:
printf("%ld", at.number);
break;
case T_STRING:
printf("\"%s\"", at.string);
break;
case T_IDENTIFIER:
printf("%s", at.identifier);
break;
case T_UNKNOWN:
printf("??");
}
printf(" (%s) ", types[at.type]);
}
printf("\n");
}
}
Combining our lexer and analyzer, let's boot this up. Using as input:
LET var 12 LET text "example" PRINT var
We get:
LET (IDENTIFIER) var (IDENTIFIER) 12 (NUMBER) LET (IDENTIFIER) text (IDENTIFIER) "example" (STRING) PRINT (IDENTIFIER) var (IDENTIFIER)
Perfect, this is working as intended. Following this, we still need however a function that takes in a line, assigns the line numbers and returns each token properly analyzed. For this, we will take the structs we already defined for our lexer, and transform them to an analyzed variant. Thankfully the sizes of the lists are now known, so no need to manually allocate them dynamically.
struct analyzed_line {
struct analyzed_token* tokens;
size_t size;
};
struct analyzed {
struct analyzed_line* lines;
size_t len;
};
Since analyze_token() is already done, our final analyzer function becomes rather simple. We go line by line, converting each token into an analyzed_token and appending it to the analyzed_line.
struct analyzed analyzer(struct lexed lexed) {
struct analyzed analyzed = {0};
analyzed.len = lexed.len;
analyzed.lines = calloc(analyzed.len, sizeof(struct analyzed_line));
for(size_t i = 0; i < lexed.len; ++i) {
struct analyzed_line aline = {0};
aline.size = lexed.lines[i].size;
aline.tokens = calloc(aline.size, sizeof(struct analyzed_token));
for(size_t j = 0; j < aline.size; ++j)
aline.tokens[j] = analyze_token(lexed.lines[i].tokens[j]);
free(lexed.lines[i].tokens);
analyzed.lines[i] = aline;
}
free(lexed.lines);
return analyzed;
}
Converting text to a stream of analyzed tokens became trivial:
struct analyzed analyzed = analyzer(lexer("......."));
To summarize our analyzing step:
struct analyzed_token {
enum {
T_NUMBER,
T_STRING,
T_IDENTIFIER,
T_UNKNOWN,
} type;
union {
long number;
char* string;
char* identifier;
};
};
struct analyzed_line {
struct analyzed_token* tokens;
size_t size;
};
struct analyzed {
struct analyzed_line* lines;
size_t len;
};
struct analyzed_token analyze_token(char* token) {
struct analyzed_token analyzed;
if(strlen(token) > 1 && token[0] == '"' && token[strlen(token)-1] == '"') {
size_t content_len = strlen(token) - 1;
char* content = malloc(content_len);
memset(content, 0, content_len);
memcpy(content, token + 1, content_len - 1);
analyzed.type = T_STRING;
analyzed.string = content;
free(token);
return analyzed;
}
int all_digit = 1;
int all_alnum = 1;
for(size_t i = 0; i < strlen(token); ++i) {
all_digit &= isdigit(token[i]) > 0;
all_alnum &= isalnum(token[i]) > 0;
}
if(strlen(token) != 0)
all_alnum &= isalpha(token[0]) > 0;
if(strlen(token) != 0 && all_digit) {
analyzed.type = T_NUMBER;
analyzed.number = 0;
for(size_t i = 0; i < strlen(token); ++i)
analyzed.number = analyzed.number * 10 + token[i] - '0';
free(token);
return analyzed;
}
if(strlen(token) != 0 && all_alnum) {
analyzed.type = T_IDENTIFIER;
analyzed.identifier = token;
return analyzed;
}
free(token);
analyzed.type = T_UNKNOWN;
return analyzed;
}
struct analyzed analyzer(struct lexed lexed) {
struct analyzed analyzed = {0};
analyzed.len = lexed.len;
analyzed.lines = calloc(analyzed.len, sizeof(struct analyzed_line));
for(size_t i = 0; i < lexed.len; ++i) {
struct analyzed_line aline = {0};
aline.size = lexed.lines[i].size;
aline.tokens = calloc(aline.size, sizeof(struct analyzed_token));
for(size_t j = 0; j < aline.size; ++j)
aline.tokens[j] = analyze_token(lexed.lines[i].tokens[j]);
free(lexed.lines[i].tokens);
analyzed.lines[i] = aline;
}
free(lexed.lines);
return analyzed;
}
Since we are skipping the optimizing and compiling step, we go straight to evaluating. In BASIC, we think of the program as a collection of lines which are ordered by their line number. A program pointer starts at the top and moves downwards, executing each line as it reaches it. Recalling the BASIC syntax, the line number is declared as the very first token. Finding these could be part of the analyzer, however since I wanted to have a clear seperation between interpreting the tokens in a context and the tokens on their own, we will move this into our evaluater() function.
This will also mark the first time our program will be able to fail, to aid with cleaner syntax we will introduce two macros.
#define fail(fmt, ...) fprintf(stderr, fmt, __VA_ARGS__), exit(1) #define assure(expr, fmt, ...) if(!(expr)) fail(fmt, __VA_ARGS__)
To start of, we define a function that evaluates a given analyzed struct, this time however, we will not return anything.
void evaluater(struct analyzed analyzed) {
}
Next we need to see what the files are indexed with (the first token, which should be a number). To store this, we will simply use a static array.
void evaluater(struct analyzed analyzed) {
struct analyzed_line* indexed_lines[256] = {0};
for(size_t i = 0; i < analyzed.len; ++i) {
struct analyzed_line line = analyzed.lines[i];
if(line.size == 0)
continue;
assure(line.tokens[0].type == T_NUMBER,
"line(%ld): no line number\n", i);
assure(line.tokens[0].number >= 0,
"line(%ld): line number is negative\n", i);
indexed_lines[line.tokens[0].number] = &analyzed.lines[i];
}
}
With our list prepared, keeping track of the program counter and iterating over every line that is not NULL is trivial.
void evaluater(struct analyzed analyzed) {
/* ... */
for(size_t pc = 0; pc < 256; ++pc) {
if(indexed_lines[pc] == NULL) continue;
pc = evaluate_line(pc, indexed_lines[pc]);
}
}
Since the only control structure in our BASIC implementation will be gotos, our evaluate_line() function (which we will implement shortly) will take in the program counter, and return where it should jump to.
Since our
Each instruction that we support will be it's own function too, doing its own error handling and argument parsing.
size_t evaluate_line(size_t pc, struct analyzed_line* line) {
}
First of all we skip lines that are empty (only contain the line number. They are not an error per se, as we can still jump there.
size_t evaluate_line(size_t pc, struct analyzed_line* line) {
if(line->size == 1)
return pc;
}
And then we start checking for the first token, which has to be an identifier. Our first statement we will be implementing is PRINT, as the name implies, it outputs all its arguments.
size_t evaluate_line(size_t pc, struct analyzed_line* line) {
/* ... */
struct analyzed_token statement = line->tokens[1];
assure(statement.type == T_IDENTIFIER,
"pc(%ld): no statement\n", pc);
if(!strcmp(statement.identifier, "PRINT"))
return stm_print(pc, line);
fail("pc(%ld) unknown statement: \"%s\"\n", pc, statement.identifier);
return pc;
}
We are very close to running our first program, only stm_print() needs to be implemented. Which is rather simple. We check for the type of the argument we are at while iterating over them, and print out the correct union member. Otherwise, we fail.
size_t stm_print(size_t pc, struct analyzed_line* line) {
for(size_t i = 2; i < line->size; ++i) {
struct analyzed_token arg = line->tokens[i];
switch(arg.type) {
case T_NUMBER:
printf("%ld", arg.number);
break;
case T_STRING:
printf("%s", arg.string);
break;
default:
fail("pc(%ld): cannot print arg number %ld\n", pc, i);
break;
}
}
printf("\n");
return pc;
}
Finally, we can run our first program, the code for doing so is rather simple to. Let's do it.
evaluater(analyzer(lexer(code)));
In goes:
10 PRINT "Hello" 20 PRINT "World!" 5 PRINT 1 2 3
Out comes:
1 2 3 Hello World!
Great. An integral part of programming is the ability to store values for later, called variables. Our BASIC implementation should also be able to have them. To achive this, we will use a linked list, which is rather simple to implement, with enough practice. It should store the name of the variable, the value, and the next element in the list (which is what creates the linked list model).
struct vardecl {
char* name;
struct analyzed_token value;
struct vardecl* next;
};
There aren't any scopes in (our version of) BASIC, so we only need one list. We will call it head.
static struct vardecl* head = NULL;
The operations required to interact with this list are getvar() and setvar(). Let's start with getvar().
To get the variable we need, we iterate over each one, check the name and if it matches, we return it.
struct vardecl* getvar(const char* name) {
struct vardecl* current = head, *last;
while(current != NULL) {
if(!strcmp(current->name, name))
return current;
last = current;
current = current->next;
}
}
If we cannot find it, we create a new variable and copy over the name.
struct vardecl* getvar(const char* name) {
/* ... */
current = malloc(sizeof(struct vardecl));
current->value.type = T_UNKNOWN;
current->name = NULL;
current->name = calloc(strlen(name) + 1, 1);
strcpy(current->name, name);
}
Lastly, if our list is empty, this new element is the start of the list, otherwise, attach it to the end of it.
struct vardecl* getvar(const char* name) {
/* ... */
if(head == NULL)
head = current;
else
last->next = current;
return current;
}
Wasn't so hard, now was it. Best part, implementing our setvar() is now trivial, as it calls getvar().
void setvar(const char* name, struct analyzed_token value) {
getvar(name)->value = value;
}
To actually use this functionallity though, we require a new statement: LET. It will define and reassign variables for us.
size_t stm_let(size_t pc, struct analyzed_line* line) {
return pc;
}
Let's fill it with some necessary checks, like the correct amount of arguments, the correct typing, etc.
size_t stm_let(size_t pc, struct analyzed_line* line) {
assure(line->size == 4,
"pc(%ld): syntax is LET var value\n", pc);
struct analyzed_token var = line->tokens[2];
assure(var.type == T_IDENTIFIER,
"pc(%ld): syntax is LET var value\n", pc);
struct analyzed_token value = line->tokens[3];
/* ... */
}
The way we set up our setvar() let's us just pass the analyzed token as value. Only problem, if the value we want to assign is itself a variable, we need to first getvar() said value.
size_t stm_let(size_t pc, struct analyzed_line* line) {
/* ... */
if(value.type == T_IDENTIFIER)
setvar(var.identifier, getvar(value.identifier)->value);
else
setvar(var.identifier, value);
return pc;
}
Quickly adding it to our list of known statements...
size_t evaluate_line(size_t pc, struct analyzed_line* line) {
/* ... */
if(!strcmp(statement.identifier, "LET"))
return stm_let(pc, line);
/* ... */
}
We are now able to set variables, however we cannot use them yet, as our stm_print() function is not allowing identifiers. If we want to print out variables as well, we need to restructure our code a bit. We start out by creating a function called print_token() which prints out exactly one token that's given to it, by moving the code used in our previous stm_print() function.
void print_token(size_t pc, struct analyzed_token arg) {
switch(arg.type) {
case T_NUMBER:
printf("%ld", arg.number);
break;
case T_STRING:
printf("%s", arg.string);
break;
case T_IDENTIFIER:
break;
default:
fail("pc(%ld): cannot print of type UNKNOWN\n", pc);
break;
}
}
If we want to print a variable, we call the function recursively, and provide the getvar() call.
void print_token(size_t pc, struct analyzed_token arg) {
switch(arg.type) {
/* ... */
case T_IDENTIFIER:
print_token(pc, getvar(arg.identifier)->value);
break;
/* ... */
}
}
To finalize our initial statement function:
size_t stm_print(size_t pc, struct analyzed_line* line) {
for(size_t i = 2; i < line->size; ++i) {
struct analyzed_token arg = line->tokens[i];
print_token(pc, arg);
}
/* ... */
}
Good. Finally, we give it a test run yet again.
10 LET x 30 20 PRINT "x = " x 30 LET x 20 40 PRINT "x = " x
Results in:
x = 30 x = 20
Lastly, we will add control flow. A simple IF should do the trick. It will take in 2 values, compare them, and if they are the same, go to a specific line. For a statement to modify the program counter, it must return it's new value. Let's start by implementing the stm_if() function.
size_t stm_if(size_t pc, struct analyzed_line* line) {
}
There will be quite a bit of assure() calls to enforce the correct keyword usage.
size_t stm_if(size_t pc, struct analyzed_line* line) {
assure(line->size == 7,
"pc(%ld): syntax is IF value IS value GOTO number\n", pc);
assure(line->tokens[3].type == T_IDENTIFIER &&
!strcmp(line->tokens[3].identifier, "IS"),
"pc(%ld): syntax is IF value IS value GOTO number\n", pc);
assure(line->tokens[5].type == T_IDENTIFIER &&
!strcmp(line->tokens[5].identifier, "GOTO"),
"pc(%ld): syntax is IF value IS value GOTO number\n", pc);
}
The left and right hand side, as well as the target we want to goto to can be variables, so just in case, we check and call getvar().
size_t stm_if(size_t pc, struct analyzed_line* line) {
/* ... */
struct analyzed_token lhs = line->tokens[2];
struct analyzed_token rhs = line->tokens[4];
struct analyzed_token target = line->tokens[6];
if(lhs.type == T_IDENTIFIER)
lhs = getvar(lhs.identifier)->value;
if(rhs.type == T_IDENTIFIER)
rhs = getvar(rhs.identifier)->value;
if(target.type == T_IDENTIFIER)
rhs = getvar(rhs.identifier)->value;
assure(target.type == T_NUMBER && target.number >= 0,
"pc(%ld): can only GOTO positive numbers\n", pc);
}
Now that we have all our values sanitized and prepared, we can check whether lhs and rhs are equal. If so, we return target - 1 to indicate that's where we need to jump to. We can already determine that the expression will be false if the types of the 2 values don't match, otherwise we use a switch statement.
size_t stm_if(size_t pc, struct analyzed_line* line) {
/* ... */
if(lhs.type != rhs.type)
return pc;
switch(lhs.type) {
case T_NUMBER:
if(lhs.number == rhs.number)
return target.number - 1;
break;
case T_STRING:
if(!strcmp(lhs.string, rhs.string))
return target.number - 1;
break;
default:
fail("pc(%ld): cannot compare 2 unknown", pc);
}
return pc;
}
After registering the function inside of out line evaluator:
size_t evaluate_line(size_t pc, struct analyzed_line* line) {
/* ... */
if(!strcmp(statement.identifier, "IF"))
return stm_if(pc, line);
/* ... */
}
We can once more test our final product.
0 PRINT "Does IF work...." 10 LET x 33 20 IF x IS 33 GOTO 40 30 PRINT "No" 40 PRINT "Yes"
Evaluate:
Does IF work.... Yes
To come back to our original goal, we still need to quickly add some code to read in an entire file. Let's not waste too much time with this and speedrun it quickly.
char* file(const char* name) {
FILE* file = fopen(name, "r");
if(!file)
fail("unable to open %s", name);
fseek(file, 0L, SEEK_END);
size_t size = ftell(file);
rewind(file);
char* filecontent = malloc(size);
fread(filecontent, 1, size, file);
fclose(file);
return filecontent;
}
int main(int argc, char** argv) {
assure(argc == 2,
"%s FILE.basic\n", argv[0]);
evaluater(analyzer(lexer(file(argv[1]))));
}
Final test:
$ cat hello.basic 10 PRINT "Welcome to BASIC!" 20 LET x 10 30 IF x IS 0 GOTO 70 40 PRINT x 50 LET x 0 60 IF 0 IS 0 GOTO 30 70 PRINT "Goodbye!" $ ./basic hello.basic Welcome to BASIC! 10 Goodbye!
This chapter has been quite long, but we learned how to split the different components into their own function, especially the last line of the main function shows how independendly these can work. In later chapters we will have a look at more complex concepts of parsing, but this was a good lession on when simplicity can be more efficient and faster to write than these giant flagships we will cover later for more advanced syntax.
Obligatory end of chapter code share:
#include <ctype.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>
struct line {
char** tokens;
size_t capacity;
size_t size;
};
struct lexed {
struct line* lines;
size_t len;
};
struct lexed lexer(const char* context) {
struct lexed lexed = {0};
lexed.len = 1;
for(size_t i = 0; i < strlen(context); ++i)
if(context[i] == '\n')
++lexed.len;
lexed.lines = malloc(lexed.len * sizeof(struct line));
struct line line = {0};
size_t token_len = 0;
size_t cur_line = 0;
int in_quote = 0;
int in_token = 0;
for(size_t i = 0; i < strlen(context) + 1; ++i) {
if((isspace(context[i]) && !in_quote) || context[i] == '\0') {
if(in_token) {
char* token = calloc(i - token_len + 1, 1);
memcpy(token, context + token_len, i - token_len);
if(line.capacity <= line.size) {
if(line.capacity == 0) line.capacity = 1;
line.capacity *= 2;
line.tokens = realloc(line.tokens, line.capacity * sizeof(char*));
}
line.tokens[line.size++] = token;
in_token = 0;
}
}
else {
if(context[i] == '"') {
in_quote = !in_quote;
}
if(!in_token) {
in_token = 1;
token_len = i;
}
}
if(context[i] == '\n' || context[i] == '\0') {
lexed.lines[cur_line] = line;
++cur_line;
line.size = line.capacity = 0;
line.tokens = NULL;
}
}
return lexed;
}
struct analyzed_token {
enum {
T_NUMBER,
T_STRING,
T_IDENTIFIER,
T_UNKNOWN,
} type;
union {
long number;
char* string;
char* identifier;
};
};
struct analyzed_line {
struct analyzed_token* tokens;
size_t size;
};
struct analyzed {
struct analyzed_line* lines;
size_t len;
};
struct analyzed_token analyze_token(char* token) {
struct analyzed_token analyzed;
if(strlen(token) > 1 && token[0] == '"' && token[strlen(token)-1] == '"') {
size_t content_len = strlen(token) - 1;
char* content = malloc(content_len);
memset(content, 0, content_len);
memcpy(content, token + 1, content_len - 1);
analyzed.type = T_STRING;
analyzed.string = content;
free(token);
return analyzed;
}
int all_digit = 1;
int all_alnum = 1;
for(size_t i = 0; i < strlen(token); ++i) {
all_digit &= isdigit(token[i]) > 0;
all_alnum &= isalnum(token[i]) > 0;
}
if(strlen(token) != 0)
all_alnum &= isalpha(token[0]) > 0;
if(strlen(token) != 0 && all_digit) {
analyzed.type = T_NUMBER;
analyzed.number = 0;
for(size_t i = 0; i < strlen(token); ++i)
analyzed.number = analyzed.number * 10 + token[i] - '0';
free(token);
return analyzed;
}
if(strlen(token) != 0 && all_alnum) {
analyzed.type = T_IDENTIFIER;
analyzed.identifier = token;
return analyzed;
}
free(token);
analyzed.type = T_UNKNOWN;
return analyzed;
}
struct analyzed analyzer(struct lexed lexed) {
struct analyzed analyzed = {0};
analyzed.len = lexed.len;
analyzed.lines = calloc(analyzed.len, sizeof(struct analyzed_line));
for(size_t i = 0; i < lexed.len; ++i) {
struct analyzed_line aline = {0};
aline.size = lexed.lines[i].size;
aline.tokens = calloc(aline.size, sizeof(struct analyzed_token));
for(size_t j = 0; j < aline.size; ++j)
aline.tokens[j] = analyze_token(lexed.lines[i].tokens[j]);
free(lexed.lines[i].tokens);
analyzed.lines[i] = aline;
}
free(lexed.lines);
return analyzed;
}
#define fail(fmt, ...) fprintf(stderr, fmt, __VA_ARGS__), exit(1)
#define assure(expr, fmt, ...) if(!(expr)) fail(fmt, __VA_ARGS__)
struct vardecl {
char* name;
struct analyzed_token value;
struct vardecl* next;
};
static struct vardecl* head = NULL;
struct vardecl* getvar(const char* name) {
struct vardecl* current = head, *last;
while(current != NULL) {
if(!strcmp(current->name, name))
return current;
last = current;
current = current->next;
}
current = malloc(sizeof(struct vardecl));
current->value.type = T_UNKNOWN;
current->name = NULL;
current->name = calloc(strlen(name) + 1, 1);
strcpy(current->name, name);
if(head == NULL)
head = current;
else
last->next = current;
return current;
}
void setvar(const char* name, struct analyzed_token value) {
getvar(name)->value = value;
}
void print_token(size_t pc, struct analyzed_token arg) {
switch(arg.type) {
case T_NUMBER:
printf("%ld", arg.number);
break;
case T_STRING:
printf("%s", arg.string);
break;
case T_IDENTIFIER:
print_token(pc, getvar(arg.identifier)->value);
break;
default:
fail("pc(%ld): cannot print of type UNKNOWN\n", pc);
break;
}
}
size_t stm_print(size_t pc, struct analyzed_line* line) {
for(size_t i = 2; i < line->size; ++i) {
struct analyzed_token arg = line->tokens[i];
print_token(pc, arg);
}
printf("\n");
return pc;
}
size_t stm_let(size_t pc, struct analyzed_line* line) {
assure(line->size == 4,
"pc(%ld): syntax is LET var value\n", pc);
struct analyzed_token var = line->tokens[2];
assure(var.type == T_IDENTIFIER,
"pc(%ld): syntax is LET var value\n", pc);
struct analyzed_token value = line->tokens[3];
if(value.type == T_IDENTIFIER)
setvar(var.identifier, getvar(value.identifier)->value);
else
setvar(var.identifier, value);
return pc;
}
size_t stm_if(size_t pc, struct analyzed_line* line) {
assure(line->size == 7,
"pc(%ld): syntax is IF value IS value GOTO number\n", pc);
assure(line->tokens[3].type == T_IDENTIFIER &&
!strcmp(line->tokens[3].identifier, "IS"),
"pc(%ld): syntax is IF value IS value GOTO number\n", pc);
assure(line->tokens[5].type == T_IDENTIFIER &&
!strcmp(line->tokens[5].identifier, "GOTO"),
"pc(%ld): syntax is IF value IS value GOTO number\n", pc);
struct analyzed_token lhs = line->tokens[2];
struct analyzed_token rhs = line->tokens[4];
struct analyzed_token target = line->tokens[6];
if(lhs.type == T_IDENTIFIER)
lhs = getvar(lhs.identifier)->value;
if(rhs.type == T_IDENTIFIER)
rhs = getvar(rhs.identifier)->value;
if(target.type == T_IDENTIFIER)
rhs = getvar(rhs.identifier)->value;
assure(target.type == T_NUMBER && target.number >= 0,
"pc(%ld): can only GOTO positive numbers\n", pc);
if(lhs.type != rhs.type)
return pc;
switch(lhs.type) {
case T_NUMBER:
if(lhs.number == rhs.number)
return target.number - 1;
break;
case T_STRING:
if(!strcmp(lhs.string, rhs.string))
return target.number - 1;
break;
default:
fail("pc(%ld): cannot compare 2 unknown", pc);
}
return pc;
}
size_t evaluate_line(size_t pc, struct analyzed_line* line) {
if(line->size == 1)
return pc;
struct analyzed_token statement = line->tokens[1];
assure(statement.type == T_IDENTIFIER,
"pc(%ld): no statement\n", pc);
if(!strcmp(statement.identifier, "PRINT"))
return stm_print(pc, line);
if(!strcmp(statement.identifier, "LET"))
return stm_let(pc, line);
if(!strcmp(statement.identifier, "IF"))
return stm_if(pc, line);
fail("pc(%ld) unknown statement: \"%s\"\n", pc, statement.identifier);
return pc;
}
void evaluater(struct analyzed analyzed) {
struct analyzed_line* indexed_lines[256] = {0};
for(size_t i = 0; i < analyzed.len; ++i) {
struct analyzed_line line = analyzed.lines[i];
if(line.size == 0)
continue;
assure(line.tokens[0].type == T_NUMBER,
"line(%ld): no line number\n", i);
assure(line.tokens[0].number >= 0,
"line(%ld): line number is negative\n", i);
indexed_lines[line.tokens[0].number] = &analyzed.lines[i];
}
for(size_t pc = 0; pc < 256; ++pc) {
if(indexed_lines[pc] == NULL) continue;
pc = evaluate_line(pc, indexed_lines[pc]);
}
}
char* file(const char* name) {
FILE* file = fopen(name, "r");
if(!file)
fail("unable to open %s", name);
fseek(file, 0L, SEEK_END);
size_t size = ftell(file);
rewind(file);
char* filecontent = malloc(size);
fread(filecontent, 1, size, file);
fclose(file);
return filecontent;
}
int main(int argc, char** argv) {
assure(argc == 2,
"%s FILE.basic\n", argv[0]);
evaluater(analyzer(lexer(file(argv[1]))));
}