string - What is the best way to write a syntax tokenizer/parser in C? -

background information:
have desire make programming language, knowing tools so, don't have examples on how use them. not want use flex or bison doesn't teach abstractness feel needed creating compiler. have concepts of creating strings, tokenizing them, feeding them file acts grammar , parses creating actual program run language. problem is, don't know how write tokenizer or parser. have general ideas, understand better when can see examples. if post an/few example(s), great!

my question follows:
can post examples of how write syntax tokenizer/parser in c?

if want write complex syntax parser in c, without using existing pattern matching code, it's best implement state machine , process source code char char.

the output of flex+bison state machine. flex uses regular expressions tokenize string tokens passed bison state machine, processing 1 token after another, depending on current state of machine. don't need regex tokenizer, can tokenize input part of state machine processing. regex matcher can implemented state machine, token generation can directly part of state machine.

here's interesting link you; it's not c in particular, more general overview how state machines work, once got concept, it's easy transpose c code:

parsing command line arguments using finite state machine , backtracking

here's sample code of super primitive csv parser:

#include <stdlib.h> #include <stdio.h>  static char currenttoken[4096]; static size_t currenttokenlength;  static void addchartocurrenttoken ( char c ) {     if (currenttokenlength < sizeof(currenttoken)) {         currenttoken[currenttokenlength++] = c;     } }  static void printcurrenttoken ( ) {     printf("token: >>>%.*s<<<\n", (int)currenttokenlength, currenttoken);     currenttokenlength = 0; }   typedef enum {     state_findstartofdata,     state_findstartoftoken,     state_parsenumber,     state_parsestring,     state_checkendofstring,     state_finddelimiter,     state_parseerror,     state_endofdata } parserstate;   parserstate parserstate = state_findstartofdata;   static void runthestatemachine ( ) {     while (parserstate != state_parseerror             && parserstate != state_endofdata     ) {         int c = fgetc(stdin);         // end of data?         if (c == -1) {             switch (parserstate) {                 case state_parsenumber:                 case state_checkendofstring:                     printcurrenttoken();                     parserstate = state_endofdata;                     break;                  case state_parsestring:                     // data ends in middle of token parsing? no way!                     fprintf(stderr, "data ended abruptly!\n");                     parserstate = state_parseerror;                     break;                  case state_findstartofdata:                 case state_findstartoftoken:                 case state_finddelimiter:                     // okay, data stream may end while in these states                     parserstate = state_endofdata;                     break;                  case state_parseerror:                 case state_endofdata:                     break;             }         }          switch (parserstate) {                 case state_findstartofdata:                     // skip blank lines                     if (c == '\n' || c == '\r') break;                     // !!!fallthrough!!!                  case state_findstartoftoken:                     // skip overe whitespace                     if (c == ' ' || c == '\t') break;                     // start of string?                     if (c == '"') {                         parserstate = state_parsestring;                         break;                     }                     // blank field?                     if (c == ',') {                         printcurrenttoken();                         break;                     }                     // end of dataset?                     if (c == '\n' || c == '\r') {                         printf("------------------------------------------\n");                         parserstate = state_findstartofdata;                         break;                     }                     // else can number                     parserstate = state_parsenumber;                     addchartocurrenttoken(c);                     break;                  case state_parsenumber:                     if (c == ' ' || c == '\t') {                         // numbers cannot contain spaces in middle,                         // must end of number.                         printcurrenttoken();                         // still need find real delimiter, though.                         parserstate = state_finddelimiter;                         break;                     }                     if (c == ',') {                         // time number ends directly delimiter                         printcurrenttoken();                         parserstate = state_findstartoftoken;                         break;                     }                     // end of dataset?                     if (c == '\n' || c == '\r') {                         printcurrenttoken();                         printf("------------------------------------------\n");                         parserstate = state_findstartofdata;                         break;                     }                     // otherwise keep reading number                     addchartocurrenttoken(c);                     break;                  case state_parsestring:                     if (c == '"') {                         // either regular end of string or                         // escaped quotation mark doubled ("") in cvs.                         parserstate = state_checkendofstring;                         break;                     }                     // other chars treated ordinary chars                     addchartocurrenttoken(c);                     break;                  case state_checkendofstring:                     if (c == '"') {                         // next char quotation mark,                         // not end of string.                         addchartocurrenttoken(c);                         parserstate = state_parsestring;                         break;                     }                     if (c == ' ' || c == '\t') {                         // end of string                         printcurrenttoken();                         // still need find real delimiter, though.                         parserstate = state_finddelimiter;                         break;                     }                     if (c == ',') {                         // end of string                         printcurrenttoken();                         // , found delimiter                         parserstate = state_findstartoftoken;                         break;                     }                     if (c == '\n' || c == '\r') {                         // end of string                         printcurrenttoken();                         // , found end of dataset                         printf("------------------------------------------\n");                         parserstate = state_findstartofdata;                         break;                     }                     // else parse error guess                     fprintf(stderr, "unexpected char 0x%02x after end of string!\n", c);                     parserstate = state_parseerror;                     break;                  case state_finddelimiter:                     // delemiter found?                     if (c == ',') {                         parserstate = state_findstartoftoken;                         break;                     }                     // skip overe whitespace                     if (c == ' ' || c == '\t') break;                     // end of dataset?                     if (c == '\n' || c == '\r') {                         // , found end of dataset                         printf("------------------------------------------\n");                         parserstate = state_findstartofdata;                         break;                     }                     // else pare error guess                     fprintf(stderr, "unexpected char 0x%02x after end of token!\n", c);                     parserstate = state_parseerror;                     break;                  case state_parseerror:                     // nothing                     break;                  case state_endofdata:                     // nothing                     break;         }     } }  int main ( ) {     runthestatemachine();     return (parserstate == state_endofdata ? 0 : 1); }

the code makes following assumptions:

tokens never bigger 4096 chars.
the separator character comma
(that's cvs implies not cvs files use comma purpose)
strings always quoted
(usually optional unless contain spaces or quotation marks)
there no line breaks inside quoted strings
(this allowed)
the code assumes not quoted number won't verify format of number correct.

this code can not parse csv data feed it, when feed file:

"year","make","model"   ,"description",  "price"     1997,"ford", "e350","ac, abs, moon", 3000.00 1999,"chevy","venture ""extended edition""",,4900.00  1999,"chevy",     "venture ""extended edition, large"""  ,  , 5000.00 1996,"jeep", "grand cherokee","must sell!"

it produce output:

token: >>>year<<< token: >>>make<<< token: >>>model<<< token: >>>description<<< token: >>>price<<< ------------------------------------------ token: >>>1997<<< token: >>>ford<<< token: >>>e350<<< token: >>>ac, abs, moon<<< token: >>>3000.00<<< ------------------------------------------ token: >>>1999<<< token: >>>chevy<<< token: >>>venture "extended edition"<<< token: >>><<< token: >>>4900.00<<< ------------------------------------------ token: >>>1999<<< token: >>>chevy<<< token: >>>venture "extended edition, large"<<< token: >>><<< token: >>>5000.00<<< ------------------------------------------ token: >>>1996<<< token: >>>jeep<<< token: >>>grand cherokee<<< token: >>>must sell!<<< ------------------------------------------

and it's supposed give idea how parse complex syntax state machine. code far production quality , can see, such switch grows huge, i'd @ least put state code functions or turn every state struct or object data encapsulation, otherwise whole thing becomes unmanageable.

Search This Blog

CSS

string - What is the best way to write a syntax tokenizer/parser in C? -

Comments

Post a Comment

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -