string - What is the best way to write a syntax tokenizer/parser in C? -
background information:
have desire make programming language, knowing tools so, don't have examples on how use them. not want use flex or bison doesn't teach abstractness feel needed creating compiler. have concepts of creating strings, tokenizing them, feeding them file acts grammar , parses creating actual program run language. problem is, don't know how write tokenizer or parser. have general ideas, understand better when can see examples. if post an/few example(s), great!
my question follows:
can post examples of how write syntax tokenizer/parser in c?
if want write complex syntax parser in c, without using existing pattern matching code, it's best implement state machine , process source code char char.
the output of flex+bison state machine. flex uses regular expressions tokenize string tokens passed bison state machine, processing 1 token after another, depending on current state of machine. don't need regex tokenizer, can tokenize input part of state machine processing. regex matcher can implemented state machine, token generation can directly part of state machine.
here's interesting link you; it's not c in particular, more general overview how state machines work, once got concept, it's easy transpose c code:
parsing command line arguments using finite state machine , backtracking
here's sample code of super primitive csv parser:
#include <stdlib.h> #include <stdio.h> static char currenttoken[4096]; static size_t currenttokenlength; static void addchartocurrenttoken ( char c ) { if (currenttokenlength < sizeof(currenttoken)) { currenttoken[currenttokenlength++] = c; } } static void printcurrenttoken ( ) { printf("token: >>>%.*s<<<\n", (int)currenttokenlength, currenttoken); currenttokenlength = 0; } typedef enum { state_findstartofdata, state_findstartoftoken, state_parsenumber, state_parsestring, state_checkendofstring, state_finddelimiter, state_parseerror, state_endofdata } parserstate; parserstate parserstate = state_findstartofdata; static void runthestatemachine ( ) { while (parserstate != state_parseerror && parserstate != state_endofdata ) { int c = fgetc(stdin); // end of data? if (c == -1) { switch (parserstate) { case state_parsenumber: case state_checkendofstring: printcurrenttoken(); parserstate = state_endofdata; break; case state_parsestring: // data ends in middle of token parsing? no way! fprintf(stderr, "data ended abruptly!\n"); parserstate = state_parseerror; break; case state_findstartofdata: case state_findstartoftoken: case state_finddelimiter: // okay, data stream may end while in these states parserstate = state_endofdata; break; case state_parseerror: case state_endofdata: break; } } switch (parserstate) { case state_findstartofdata: // skip blank lines if (c == '\n' || c == '\r') break; // !!!fallthrough!!! case state_findstartoftoken: // skip overe whitespace if (c == ' ' || c == '\t') break; // start of string? if (c == '"') { parserstate = state_parsestring; break; } // blank field? if (c == ',') { printcurrenttoken(); break; } // end of dataset? if (c == '\n' || c == '\r') { printf("------------------------------------------\n"); parserstate = state_findstartofdata; break; } // else can number parserstate = state_parsenumber; addchartocurrenttoken(c); break; case state_parsenumber: if (c == ' ' || c == '\t') { // numbers cannot contain spaces in middle, // must end of number. printcurrenttoken(); // still need find real delimiter, though. parserstate = state_finddelimiter; break; } if (c == ',') { // time number ends directly delimiter printcurrenttoken(); parserstate = state_findstartoftoken; break; } // end of dataset? if (c == '\n' || c == '\r') { printcurrenttoken(); printf("------------------------------------------\n"); parserstate = state_findstartofdata; break; } // otherwise keep reading number addchartocurrenttoken(c); break; case state_parsestring: if (c == '"') { // either regular end of string or // escaped quotation mark doubled ("") in cvs. parserstate = state_checkendofstring; break; } // other chars treated ordinary chars addchartocurrenttoken(c); break; case state_checkendofstring: if (c == '"') { // next char quotation mark, // not end of string. addchartocurrenttoken(c); parserstate = state_parsestring; break; } if (c == ' ' || c == '\t') { // end of string printcurrenttoken(); // still need find real delimiter, though. parserstate = state_finddelimiter; break; } if (c == ',') { // end of string printcurrenttoken(); // , found delimiter parserstate = state_findstartoftoken; break; } if (c == '\n' || c == '\r') { // end of string printcurrenttoken(); // , found end of dataset printf("------------------------------------------\n"); parserstate = state_findstartofdata; break; } // else parse error guess fprintf(stderr, "unexpected char 0x%02x after end of string!\n", c); parserstate = state_parseerror; break; case state_finddelimiter: // delemiter found? if (c == ',') { parserstate = state_findstartoftoken; break; } // skip overe whitespace if (c == ' ' || c == '\t') break; // end of dataset? if (c == '\n' || c == '\r') { // , found end of dataset printf("------------------------------------------\n"); parserstate = state_findstartofdata; break; } // else pare error guess fprintf(stderr, "unexpected char 0x%02x after end of token!\n", c); parserstate = state_parseerror; break; case state_parseerror: // nothing break; case state_endofdata: // nothing break; } } } int main ( ) { runthestatemachine(); return (parserstate == state_endofdata ? 0 : 1); } the code makes following assumptions:
- tokens never bigger 4096 chars.
- the separator character comma
(that's cvs implies not cvs files use comma purpose) - strings always quoted
(usually optional unless contain spaces or quotation marks) - there no line breaks inside quoted strings
(this allowed) - the code assumes not quoted number won't verify format of number correct.
this code can not parse csv data feed it, when feed file:
"year","make","model" ,"description", "price" 1997,"ford", "e350","ac, abs, moon", 3000.00 1999,"chevy","venture ""extended edition""",,4900.00 1999,"chevy", "venture ""extended edition, large""" , , 5000.00 1996,"jeep", "grand cherokee","must sell!" it produce output:
token: >>>year<<< token: >>>make<<< token: >>>model<<< token: >>>description<<< token: >>>price<<< ------------------------------------------ token: >>>1997<<< token: >>>ford<<< token: >>>e350<<< token: >>>ac, abs, moon<<< token: >>>3000.00<<< ------------------------------------------ token: >>>1999<<< token: >>>chevy<<< token: >>>venture "extended edition"<<< token: >>><<< token: >>>4900.00<<< ------------------------------------------ token: >>>1999<<< token: >>>chevy<<< token: >>>venture "extended edition, large"<<< token: >>><<< token: >>>5000.00<<< ------------------------------------------ token: >>>1996<<< token: >>>jeep<<< token: >>>grand cherokee<<< token: >>>must sell!<<< ------------------------------------------ and it's supposed give idea how parse complex syntax state machine. code far production quality , can see, such switch grows huge, i'd @ least put state code functions or turn every state struct or object data encapsulation, otherwise whole thing becomes unmanageable.
Comments
Post a Comment