The snow is falling over Saint Germain en Laye and it makes everything look really pretty, but I have absolutely no whish to go outside, so I sit in the flat and what it fall and write this blog entry instead.<?xml:namespace prefix = o ns = “urn:schemas-microsoft-com:office:office” />/o:p

 /o:p

I’ve been following Rico Mariani’s performance quizzes for a while now and always found them interesting. The latest quiz on the performance of a parse, reminded me of a famous CS quote that every large software project has badly implement lisp engine in it. However I know little about lisp so I thought I’d have a stab at reimplementing it in F#. OCaml, the language that inspired F#, has a reputation for being as fast as C++, so it would be great if F# could gain the reputation for being as fast C#, so that in mind I setout with ambition of beating Rico’s implementation. /o:p

 /o:p

After puzzling over the code for a short while I released while I could copy in F# it fairly easily, as it easy to write imperative style code in F#, it would be difficult translate it and keep a good functional style. So used fslex and fsyacc F#’s parse generator tools and quickly released the language was so simple I could implement it almost entirely inside fslex and fsyacc. /o:p

 /o:p

Here’s my lexer:/o:p

 /o:p

let whitespace = [’ ‘ ‘\t’ ]/o:p

let newline = (’\n’ | ‘\r’ ‘\n’)/o:p

 /o:p

rule token = parse/o:p

| whitespace      { token lexbuf }/o:p

| newline { newline lexbuf; token lexbuf }/o:p

| “|”      { OR }/o:p

| “&“     { AND }/o:p

| “!”                  { NOT }/o:p

| “(”                  { OPENBRACKET }/o:p

| “)”                  { CLOSEBRACKET }/o:p

| [‘a’-‘z’][‘a’-‘z”0’-‘9’]*               { ID(lexeme lexbuf) }/o:p

| eof   { EOF }/o:p

 /o:p

 /o:p

And here’s the parser:/o:p

 /o:p

start: expr EOF { $1 }/o:p

 /o:p

expr:  /o:p

|  expr OR expr { $1 or $3 } /o:p

|  expr AND expr { $1 & $3 } /o:p

|  NOT expr { not $2 } /o:p

|  OPENBRACKET expr CLOSEBRACKET { $2 } /o:p

|  ID { is_true2 $1 } /o:p

 /o:p

And that’s basically it theres a small F# module, which I used to pass in the predicated, but other wise its all in the fslex and fsyacc modules./o:p

 /o:p

Unfortunately the resulting times showed it to clocks in at 100 000 ms, about a factor of 10 slow than Rico’s implementation which clocks in at about 9 000 ms on my laptop./o:p

 /o:p

F… Fiddle sticks./o:p

 /o:p

Anyway, it obviously going to be difficult to micro tune the code generated code, but with a factor of 10 to remove we need a more radical approach than micro tuning. Two things occur to me: I vaguely remember from my CS class that that code gened lexers are generally much faster than hand coded lexers, but code generated parsers are generally slower. Also I notice that Rico strictly evaluates his predicates, but if they truly are predicate then they could be lazily evaluated. With that in mind a rip off the parser and replace it with a hand code recursive decent parser./o:p

 /o:p

Here is the parser definition:/o:p

 /o:p

 /o:p

type token = /o:p

    Or_token/o:p

   |  And_token/o:p

   | Not_token/o:p

   | Open_bracket/o:p

   | Close_bracket/o:p

   | Id of string/o:p

   | Eof/o:p

 /o:p

exception SyntaxError/o:p

                        /o:p

let rec eval (currentState : bool option) (t1 : token) (nextToken : unit -> token) = /o:p

            match t1 with/o:p

            | And_token -> /o:p

                        (match currentState with /o:p

                        | Some(true) -> let next = nextToken() in eval currentState next nextToken/o:p

                        | Some(false) ->  false/o:p

                        | None ->  raise SyntaxError)/o:p

            | Or_token ->  /o:p

                        (match currentState with /o:p

                        | Some(true) -> false/o:p

                        | Some(false) -> let next = nextToken() in eval currentState next nextToken /o:p

                        | None ->  raise SyntaxError)/o:p

            | Not_token -> not (let next = nextToken() in eval currentState next nextToken)/o:p

            | Open_bracket ->  let next = nextToken() in eval None next nextToken/o:p

            | Close_bracket ->  (match currentState with | Some(x) -> x | None ->  raise SyntaxError) /o:p

            | Eof ->  /o:p

                        (match currentState with /o:p

                        | Some(x) -> x/o:p

                        | None ->  raise SyntaxError)/o:p

            | Id(s1) -> let next = nextToken() in eval (Some(is_true2 s1)) next nextToken  /o:p

 /o:p

The resulting program had an execution time of around 50 000ms, close but no cigar. However, now that I lazily evaluate the predicates I know that the results are heavily depend on the test data, complex predicates will favour my parser, while simpler ones will favour Rico’s and our current data set is pretty simple. I retest with various data sets and find that as complexity increase, the times for my parse rise much more slowly./o:p

 /o:p

I eventually find that the following data set:/o:p

 /o:p

let predicates = [|/o:p

                ”((fact1 & fact2) & (fact3 & fact4)) | ((fact5 & fact6) & (fact7 & fact8))”;/o:p

                ”((fact8 & fact7) & (fact6 & fact5)) | ((fact4 & fact3) & (fact2 & fact1))”;/o:p

                ”(fact2 & fact1) & (fact2 & fact1)”;/o:p

                ”!((fact5 & fact6) & (fact7 & fact8)) & ((fact9 & fact10) & (fact11 & fact12))”;/o:p

                ”(fact2 & fact1) | (fact2 & fact1)”;/o:p

                ”(fact3 & fact2) | ((fact3 & fact2) & (fact3 & fact2))”;/o:p

                ”(fact9 | fact10) | (fact11 | fact12)”;/o:p

                ”(!fact4 & fact5) & (((!fact6 & fact7) & (!fact8 & fact9)) & ((!fact10 & fact11) & (!fact12 & fact13)))”;/o:p

                ”(!(fact1 & fact2) | !fact3) | (((!(fact4 & fact5) | !fact6) | (!(fact7 & fact8) | !fact9)) | ((!(fact10 & fact11) | !fact12) | (!(fact13 & fact14) | !fact15)))”/o:p

            |]/o:p

 /o:p

Which takes Rico’s parser 72 398 ms and mine 64 407 ms. I win. Well sort of./o:p

 /o:p

In all serious, it would be a nicer test to write a small program that can randomly generate predicates of various complexity and measure the results as the complexity increase.

/o:p 

The full source is available here /o:p

/o:p 

Well maybe I’ll do that later but … but now it’s looking like there maybe enough snow to build a snow man, so I might venture outside after all.

/o:p  /o:p

Update: I’ve also added to the download to include a version of the parser rewritten by Don Syme which was originally posted to the F# mailing list. Now just an impressive 78 lines of code versus 259 for the original C# version, which is a pretty impressive line count saving. This is especially true when you consider that the code is still very readable, in fact for my money it much easier to understand the F# version of the code as its much clearer what each component of the code is doing./o:p

Feedback:

Feedback was imported from my only blog engine, it’s no longer possible to post feedback here.

re: Rico Mariani’s performance quiz 8 - Parsers - Robert Pickering

If your reading this post you proably should take a look at this too:
http://strangelights.com/blog/archive/2005/11/29/1269.aspx

More to follow soon.

re: Rico Mariani’s performance quiz 8 - Parsers - Paul Hatcher

There were some papers published in Software Practice & Experience about 1985 by Uni Karlruhr regarding the cost of generated lexers/parsers.

The trick on the lexer is to touch the characters as little as possible and to this end you switch on the first character of the lexeme, this recognises a large number tokens that you typically get in programming languages and then have a second switch if its trying to recognise identifiers/numbers etc.

Also important is to use the symbol table as the means of recognising keywords rather than explicitly recognising them in the lexer.

This gives you a core lexer that is better than most hand-cranked ones and is configurable