Monday, August 3, 2015

Gem of the Once-In-A-While: Ragel FSM generator

Yukari 3 ran a hand-built state machine to parse NMEA sentences coming from the GPS. It worked, but it was hard to maintain. I went looking for a tool that was a better solution. I was looking for something like a C++ template library that implements a state machine by abusing the template processor. What I found was Ragel.

This a nice, simple, clean program for generating finite state machines and weaving them into your programs. It does so in a way which I vastly prefer over Lex. It's main drawback is that since it is not a template library, it does require an external program to handle. Well, so do several other parts of my system. I use Perl without embarassment, so I will use Ragel similarly.

Ragel is structured very similarly to Yacc, in that it creates a program which executes user-chosen actions when a certain sentence form is encountered. However, it only handles part 1 languages, IE regular languages.

Ragel takes as input a source file written primarily in C/C++ (other languages are supported, but I don't use them) but which is marked with certain Ragel blocks. These blocks are used to define the machine and actions. The actions are written in the main language (C/C++ as I use it) and Ragel builds code which executes the state machine on an input stream and executes your actions as they come up.

It demonstrates a couple of interesting ideas, one of which is that you can just make up a new language and mix it into your existing source code.

Certain of the Ragel commands indicate where the state machine's tables should be written, where the state machine code should go, etc. Interfacing to the scanner is pretty simple.

A couple of examples below the fold:

This code prints 1 if the command-line argument is either foo or bar:

#include <string.h>
#include <stdio.h>

%%{
  machine foo;
  main := 
    ( 'foo' | 'bar' )
    0 @{res = 1;};
}%%

%% write data;

int main(int argc, char **argv) {
  int cs, res=0;
  if(argc > 1) {
    char *p=argv[1];
    char *pe=p+strlen(p)+1;
    %%write init;
    %%write exec;
  }
  printf("result=%i\n",res);
  return 0;
}

This is my NMEA logic. It implements the GGA, RMC, VTG, and ZDA sentences.

//Since we have a parser right here, handle conversion from decimal to integer here as
//well. Don't call stoi or stof, which require another pass over the data, and extra storage.
%%{
  machine nmea;

  action handleDollar {printAction("handleDollar");checksum=0;countChecksum=true;}
  action startInt {printAction("startInt");buildInt=0;}
  action handleIntDigit {printAction("handleIntDigit");buildInt=buildInt*10+(fc-'0');}
  action startFrac {printAction("startFrac");buildFrac=0;fracDigits=1;}
  action handleFracDigit {printAction("handleFracDigit");buildFrac=buildFrac*10+(fc-'0');fracDigits*=10;}
  action finishFrac {printAction("finishFrac");buildFrac/=fracDigits;buildFloat=buildInt+buildFrac;}
  action finishIntFloat {printAction("finishIntFloat");buildFrac=0;buildFloat=buildInt;}
  action handleHH {printAction("handleHH");shadowHH=buildInt;}
  action handleNN {printAction("handleNN");shadowNN=buildInt;}
  action handleSS {printAction("handleSS");shadowSS=buildFloat;}
  action handleDD {printAction("handleDD");shadowDD=buildInt;}
  action handleMM {printAction("handleMM");shadowMM=buildInt;}
  action handleYY {printAction("handleYYYY");shadowYYYY=buildInt+1900;if(shadowYYYY<1950)shadowYYYY+=100;}
  action handleYYYY {printAction("handleYYYY");shadowYYYY=buildInt;}
  action handleStar {printAction("handleStar");checksum^='*';countChecksum=false;}
  action checkChecksum1 {printAction("handleChecksum1");
    if((fc<'0' || fc>'9') && (fc<'A' || fc>'F')) {fgoto main;}
    int digit=(fc<='9')?fc-'0':fc-'A'+10;
    if((checksum & 0xF0)!=(digit<<4)) {fgoto main;}
  }
  action checkChecksum2 {printAction("handleChecksum2");
    if((fc<'0' || fc>'9') && (fc<'A' || fc>'F')) {fgoto main;}
    int digit=(fc<='9')?fc-'0':fc-'A'+10;
    if((checksum & 0x0F)!=(digit)) {fgoto main;}
  }
  action finishZDA {printAction("finishZDA");finishZDA();fgoto main;}
  action finishRMC {printAction("finishRMC");finishRMC();fgoto main;}
  action finishGGA {printAction("finishGGA");finishGGA();fgoto main;}
  action finishVTG {printAction("finishVTG");finishVTG();fgoto main;}
  action handleDeg {printAction("handleDeg");shadowDeg=buildInt;}
  action handleLat {printAction("handleLat");shadowLat=shadowDeg*M10+(buildInt*M10+int(buildFrac*1e7))/60;}
  action invertLat {printAction("invertLat");shadowLat=-shadowLat;}
  action handleLon {printAction("handleLon");shadowLon=shadowDeg*M10+(buildInt*M10+int(buildFrac*1e7))/60;}
  action invertLon {printAction("invertLon");shadowLon=-shadowLon;}
  action handleSpd {printAction("handleSpd");shadowSpd=buildFloat;}
  action handleHdg {printAction("handleHdg");shadowHdg=buildFloat;}
  action handleAlt {printAction("handleAlt");shadowAlt=buildFloat;}

  int = ([0-9]+)                 $ handleIntDigit > startInt;  #Integer of any length, pick it up at buildInt
  int2 = ([0-9][0-9])            $ handleIntDigit > startInt;  #Integer specifically with two characters
  int3 = ([0-9][0-9][0-9])       $ handleIntDigit > startInt;  #Integer specifically with three characters
  int4 = ([0-9][0-9][0-9][0-9])  $ handleIntDigit > startInt;  #Integer specifically with four characters
  intIgn = ([0-9]+);                                           #Integer of any length, ignored to save on performing actions
  int2Ign = ([0-9][0-9]);                                      #Ignored integer specifically with two characters

  int3Ign = ([0-9][0-9][0-9])   ;                              #Ignored integer specifically with three characters
  int4Ign = ([0-9][0-9][0-9][0-9]) ;                           #Ignored integer specifically with four characters
  frac = (('.')>startFrac)(([0-9]+) $ handleFracDigit % finishFrac) ; #Fractional part of floating point number, pick it up in buildFrac
  float = (int)((frac)?|('' % finishIntFloat)) ;                      #Float, pick it up in buildFloat, parts are still in buildInt and buildFrac
  floatIgn = (int)(('.'[0-9]+)?) ;                                         #Ignored float
  lat=(((int2 % handleDeg)(float)) % handleLat)(',N'|(',S' % invertLat)) ; #Latitude  in  DDMM.MMMMM,[NS], pick up degrees in shadowDeg, minutes in buildInt+buildFrac
  lon=(((int3 % handleDeg)(float)) % handleLon)(',E'|(',W' % invertLon)) ; #Longitude in DDDMM.MMMMM,[EW], pick up degrees in shadowDeg, minutes in buildInt+buildFrac
  checksum = [^*]* ( '*' $ handleStar ) (([A-F0-9] $ checkChecksum1)([A-F0-9] $ checkChecksum2 ))? ; #Handle (optional) checksum by skipping characters until the star, then skipping the star.
                                                                                                     #If checksum is present, check it one digit at a time.

  main := ('$' $ handleDollar (
         (('GPZDA,'
                    (int2 % handleHH) (int2 % handleNN) (float % handleSS) ','  #UTC time of last PPS in HHMMSS.SSS
                    (int2 % handleDD)','   #Day of month of last PPS
                    (int2 % handleMM)','   #Month of last PPS
                    (int4 % handleYYYY)',' #Year of last PPS
                    checksum ) %finishZDA)
       |
         (('GPRMC,'
                    (int2 % handleHH) (int2 % handleNN) (float % handleSS) ','  #UTC time of fix in HHMMSS.SSS
                    'A,' #Fix valid flag
                    (lat)',' #Latitude in   DDMM.MMMMM
                    (lon)',' #Longitude in DDDMM.MMMMM
                    (float % handleSpd)',' #Speed in knots
                    (float % handleHdg)',' #Heading in degrees East of North
                    (int2 % handleDD) (int2 % handleMM) (int2 % handleYY) ','#Date of fix in DDMMYY
                    checksum ) %finishRMC)
       |
         (('GPGGA,'
                    (int2 % handleHH) (int2 % handleNN) (float % handleSS) ','  #UTC time of fix in HHMMSS.SSS
                    (lat)',' #Latitude in   DDMM.MMMMM
                    (lon)',' #Longitude in DDDMM.MMMMM
                    '1,'                              #Fix valid flag

                    (intIgn)','                       #Number of satellites in fix (not used)
                    (floatIgn)','                     #HDOP (not used)
                    (float % handleAlt)',M'           #Altitude above MSL in m
                    checksum ) %finishGGA)
       |
         (('GPVTG,' (float % handleHdg)',T,'          #True heading in degrees East of North
                    (floatIgn)',M,'                   #Magnetic heading in degrees East of North
                    (float % handleSpd)',N,'          #Ground speed in knots
                    (floatIgn)',K'                    #Ground speed in km/hr
                    checksum ) %finishVTG)
      )[\r\n]+)+ $err{fgoto main;};

}%%

%% write data;

NMEA::NMEA() :checksum(0),writeZDA(false),writeGGA(false),writeVTG(false),writeRMC(false) {
  %% write init;
};

void NMEA::process(const char in) {
#ifndef LPC2148
  printf("%c %d %02x %d\n",in,countChecksum,checksum,cs);
#endif
  if(countChecksum) checksum ^= in;
  const char *p=&in;
  const char *pe=p+1;
  const char *eof=NULL;
  %% write exec;
  if(cs==0) {
    %% write init;
  }

}

We see a couple of things:
  1. The actions are broken out and named. This is just for readability. The program generates a graphviz representation of the state machine, and it helps to be able to see the names of the actions in that graph.
  2. The state machine is defined in pieces, which can be re-used. The definition of GGA and RMC both have latitude and longitude in the same form, so they use the same submachines, which in turn uses the submachine for numbers with decimal points, which in turn uses the submachine for integer numbers.
  3. Actions are attached, some to each character scanned in a submachine, some only when a submachine is finished. Ragel may end up building a machine that after the epsilon closures finish, end up with multiple actions on some transitions. These are executed in order, as indicated by the machine.
  4. Interfacing is pretty simple. You point at the beginning and (one past the) ending of the string to parse, then run the code Ragel generated for you at %%write exec; .You can handle as much or as little of the stream as you want. In this case we are only handling one character at a time.
  5. You can control the style of code generated, from a perfectly regular table-driven solution to a random spaghetti of goto and case statements. It doesn't matter how complicated or ugly the generated code is, because it is not source code. Only the Ragel blocks are source. The generated code is object code in a sense.
The drawbacks are as follows:
  1. I originally wished that the actions would be function pointers, so that everything could be done from tables and that the driver was constant. It turns out not to make sense, because of variable scope. It might make sense to do so in a class, but method pointers have their own issues. This means that even the table-driven scanner needs a case statement for the actions.
  2. The state machine is quite sensitive to parentheses and order of operations. Since none of the operators are conventional math, you have no intuition on the precedence. It makes it hard to read. This is counteracted by the fact that you can break down machines into named pieces and re-use them. However, it is still pretty much impossible to generate valid correct instructions the first time. It helps to have the state machine diagram. It also helps to have some test statements and a good tracer to see the transitions as they happen.
  3. The scanner itself is designed to parse more than one character at a time. When it is called multiple times, the goto-version (which is otherwise the most efficient code) needs quite a bit of code to re-find itself. The table version doesn't have this drawback.


No comments:

Post a Comment