Thursday, December 18, 2014

Compiling regular expressions (I)

This post picks up from here which was concerned with parsing - obtaining representations of regular expressions as abstract syntax trees. The ultimate goal is, given a string representation of a regular expression $e$ , produce a 'recognizer' for the expression (referred to as compiling a regular expression). That is, a function string -> bool that can be used to categorize strings as either belonging to the language $\mathcal{L_{e}}$ or not.

Having produced an abstract syntax tree for a regular expression $e$, the first step in compiling the expression is to compute an abstract syntax tree of the corresponding augmented regular expression $(e)\#$. This augmented regular expression is the original expression $e$ concatenated with a unique end-marker $\#$. For the given expression $e$, the accepting state for $e$ is given a transition on $\#$. This is a device that allows us to "forget" about accepting states as the computation of a recognizer proceeds; when the construction is complete, any state with a transition on $\#$ must be an accepting state.

Leaves in the abstract syntax tree of the augmented regular expression $(e)\#$ are labeled by $\epsilon$ or a symbol from from $\mathcal{A}$. For those non-$\epsilon$ leaves we attach a unique integer. Accordingly, we will need functions to generate unique integers (positions) that we will employ as we transform the AST of $e$ into the AST of the augmented expression $(e)\#$ leading to our first code example.

let reset_label, generate_label =
 let r = ref (-1) in
 ((fun () -> r := (-1)), (fun () -> r := !r + 1; !r))

As we construct the syntax tree of the $(e)\#$ we compute four functions : null_pos, first_pos, last_pos and following:

  1. null_pos is $true$ for a syntax-tree node $n$ if and only if the sub-expression represented by $n$ has $\epsilon$ in its language. That is, $true$ if the regular sub-expression recognizes the empty string and $false$ otherwise;
  2. first_pos is the set of positions in the sub-tree rooted at $n$ that correspond to the first symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can begin a string recognized by the regular sub-expression;
  3. last_pos is the set of positions in the sub-tree rooted at the syntax-tree node $n$ that corresponds to the last symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can terminate a string recognized by the regular sub-expression;
  4. following, for a position $p$ is the set of positions $q$ in the entire syntax-tree such that there is some string $x = a_{1}a_{2} \cdots a_{n}$ in $\mathcal{L_{(e)\#}}$ such that for some $i$, there is some way to explain the membership of $x$ in $\mathcal{L_{(e)\#}}$ by matching $a_{i}$ to $p$ and $a_{i + 1}$ to a position in $q$.
Of these, following is the last to compute as it relies upon the values of first_pos and last_pos. If the definition is confusing for now, don't worry about it. The rules for computing following will come later and will be obvious at that point. We'll focus for now on null_pos, first_pos and last_pos.

The results of first_pos, last_pos and follow are sets of integers. Accordingly, we are going to need a type to represent these.

module Int_set : Set.S with type elt = int = Set.Make (
    let compare =
    type t = int
With this we can present the type of ASTs for augmented regular expressions.
type augmented_regexp =
  | Epsilon
  | Character of char * int
  | Sequence of augmented_regexp * augmented_regexp * pos
  | Alternative of augmented_regexp * augmented_regexp * pos
  | Repetition of augmented_regexp * pos
  | Accept of int
and pos = {

For a given node $n$, the values of its pos record depend only on the sub-expressions of that node. Assuming constructed augmented regular expression syntax trees, we can write null_pos, first_pos and last_pos like this.

let (null_pos : augmented_regexp -> bool)  =
  fun x ->
    match x with
    | Epsilon -> true
    | Character (_, i) -> false
    | Sequence (_, _, p) -> p.null
    | Alternative (_, _, p) -> p.null
    | Repetition (_, p) -> p.null
    | Accept _ -> false

let (first_pos : augmented_regexp -> Int_set.t) =
  fun x ->
    match x with
    | Epsilon -> Int_set.empty
    | Character (_, i) -> Int_set.add i (Int_set.empty)
    | Alternative (_, _, p) -> p.first
    | Repetition (_, p) -> p.first
    | Sequence (_, _, p) -> p.first
    | Accept i -> Int_set.add i (Int_set.empty)

let (last_pos : augmented_regexp -> Int_set.t) =
  fun x ->
    match x with
    | Epsilon -> Int_set.empty
    | Character (_, i) -> Int_set.add i (Int_set.empty)
    | Alternative (_, _, p) -> p.last
    | Repetition (_, p) -> p.last
    | Sequence (_, _, p) -> p.last
    | Accept i -> Int_set.add i (Int_set.empty)

Our strategy in building the syntax-tree of $(e)\#$ from the syntax tree of $e$ will be to visit each node of $e$ and invoke a function to construct the corresponding node of $(e)\#$ inductively. These functions will include the generation of unique integers for the non-$\epsilon$ leaves and encode the rules for building the pos records:

  • null
    • Sequence $(e_{1}, e_{2})$ : null_pos $e_{1}$ and null_pos $e_{2}$
    • Alternative $(e_{1}, e_{2})$ : null_pos $e_{1}$ or null_pos $e_{2}$
    • Repetition : $true$
  • first
    • Alternative $(e_{1}, e_{2})$ : first_pos $e_{1} \cup$ first_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{1}$ then first_pos $e_{1} \cup$ first_pos $e_{2}$ else first_pos $e_{1}$
    • Repetition $e$ : first_pos $e$
  • last
    • Alternative $(e_{1}, e_{2})$ : last_pos $e_{1} \cup$ last_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{2}$ then last_pos $e_{1} \cup$ last_pos $e_{2}$ else last_pos $e_{2}$
    • Repetition $e$ : last_pos $e$

Here then are the augmented regular expression syntax-tree node constructor functions.

let (epsilon : unit -> augmented_regexp) = 
  fun () -> 

and (character : char -> augmented_regexp) = 
  fun c ->
    Character (c, generate_label ())

and (repetition : augmented_regexp -> augmented_regexp) = 
  fun e -> 
    Repetition (e, {null=true;first=first_pos e; last=last_pos e})

and (alternative : augmented_regexp -> augmented_regexp -> augmented_regexp)  = 
  fun e1 e2 ->
    Alternative (e1, e2, 
                 {null=null_pos e1 || null_pos e2;
                  first=Int_set.union (first_pos e1)(first_pos e2); 
                  last=Int_set.union (last_pos e1) (last_pos e2)})

and (sequence : augmented_regexp -> augmented_regexp -> augmented_regexp) = 
  fun e1 e2 ->
    let b1 = null_pos e1 
    and b2 = null_pos e2 in
    Sequence (e1, e2, 
              {null=b1 && b2;
                  if b1 then
                    Int_set.union (first_pos e1)(first_pos e2)
                    first_pos e1; 
                  if b2 then 
                    Int_set.union (last_pos e1) (last_pos e2)
                    last_pos e2})

let (accept : augmented_regexp -> augmented_regexp) = 
  fun e ->
    sequence e (Accept (generate_label ()))

We are now in a position to write the function that transforms a syntax tree of the regular expression $e$ into the syntax tree of the augmented regular expression $(e)\#$.

let rec (augmented_regexp : Syntax.regular_expression -> augmented_regexp) =
  fun x ->
    match x with
    | Syntax.Epsilon -> epsilon ()
    | Syntax.Character i ->  character (Char.chr i)
    | Syntax.Sequence (x, y) -> 
    (*Be very careful here. Evaluation order matters!*)
      let x' = (augmented_regexp x)
      and y' = (augmented_regexp y) in
      sequence x' y'
    | Syntax.Alternative (x, y) -> 
    (*Be very careful here. Evaluation order matters!*)
      let x' = (augmented_regexp x)
      and y' = (augmented_regexp y) in
      alternative x' y'
    | Syntax.Repetition x -> repetition (augmented_regexp x)

We can wrap all of the above up in a convenience function parse_augmented_regexp which first parses a string to build the syntax tree of the regular expression it represents and then transforms the result into the syntax tree of the corresponding augmented regular expression.

let (parse_augmented_regexp : string-> augmented_regexp * int)  =
  fun s ->
    let () = reset_label () in
    let ast = regexp_of_string s in
    let re1 = augmented_regexp ast in
    let re2 = accept re1 in
    let count = generate_label () in
    (re2, count)
Notice that this function returns a pair of the syntax-tree and the number of positions it contains.

The next step in compiling a recognizer from the expression $(e)\#$ is to compute the follow function. To do this we "unite" the information encoded by the first_pos and last_pos functions. Put plainly, follow is a function that takes each symbol (position) in the regular expression to the (set of) symbols (positions) that can follow it. The information is stored in an array of length equal to the number of symbols appearing in the regular expression. There are only two ways a position in a regular expression can follow another:

  • If $n$ is a Sequence node with left child $c_{1}$ and right child $c_{2}$, then for every position $i$ in lastpos $c_{1}$, all positions in firstpos $c_{2}$ are in follow_pos $c_{i}$
  • If $n$ is a Repition and $i$ a position in lastpos $n$, then all positions in first_pos $n$ are in follow_pos $i$
In addition to computing follow the code below also stores the association between positions and characters of the regular expression. That information goes into an array. The elements of the array have type char option since the Accept symbol has a position but no character associated with it.
let (compute_follow : Int_set.t array -> char option array -> augmented_regexp -> unit) =
  fun follow chars x ->
    let rec compute x = 
      match x with
      | Sequence (e1, e2, p) ->
        compute e1; compute e2;
        let first2 = first_pos e2 in
        let f i =
          follow.(i) <- Int_set.union first2 (follow.(i)) in
        Int_set.iter f (last_pos e1)
      | Repetition (e, p) ->
        compute e;
        let f i =
          follow.(i) <- Int_set.union (p.first) (follow.(i)) in
        Int_set.iter f (p.last)
      | Alternative (e1, e2, p) -> compute e1; compute e2
      | Epsilon -> ()
      | Accept i -> chars.(i) <- None
      | Character (c, i) -> chars.(i) <- Some c in
    compute x

Now the computation of the augmented regular expression syntax-tree and all four of the associated functions together with the mapping from positions to symbols of $\mathcal{A}$ can be wrapped up in another "high-level" convenience function.

let (regexp_follow : string -> augmented_regexp * Int_set.t array * char option array) =
  fun s ->
    let re, n = parse_augmented_regexp s in
    let follow = Array.make n (Int_set.empty) in
    let chars = Array.make n None in
    compute_follow follow chars re;
    (re, follow, chars)

We're in good shape - but a hop, skip and a jump to computing a recognizer from a regular expression. We'll leave off here on this local maxima for today and finish off the job in the next post!

Saturday, December 13, 2014


In my last post, I gave an OCaml program to parse regular expressions. I intend however to show you, over the coming weeks, not just how to parse them, but how to compile them to recognizers too. Doing so will require me to share quite a lot of code that I gleaned from the book The Functional Approach to Programming by Guy Cousineau and Michel Mauny from which I learned how to do this. In doing so, I will along the way, provide updated code in modern OCaml as the book is presented in the Caml language a predecessor of today's OCaml, and not everywhere equivalent. Additionally, there will be functions of my own "filling in" gaps left as exercises in the book. That said, I don't think there's enough effort of my own to justify "plagiarizing" the material so blatantly so am determined to add at least a little value, where my limited skills allow, in order that I might redeem myself.

So today, let us start with a minimal combinator library implementing "recognizers" for recursive descent parsers, based on Cousineau and Mauny transliterated to the C++ programming language. I hope on reading the following, that you will agree, the multi-paradigm language C++-11 admits "The Functional Approach to Programming" as a "first class citizen" and the result here might be viewed as the beginning of a mini Boost.Spirit! Let's begin.

We are concerned with the problem of "recognizing" phrases belonging to a given language. We will produce programs (functions) that accept character strings as input which decide to accept or reject them. First some headers.

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/utility.hpp>
#include <boost/range.hpp>
#include <boost/detail/lightweight_test.hpp>

#include <list>
#include <functional>
#include <numeric>
#include <string>

Recognizers work on lists. When recognition succeeds, part of the list is consumed and the part of the list remaining is returned or, recognition fails. So the type remaining<A> is a sum type with two cases, where A is the type of "tokens" (symbols) contained by the list (typically but not necessarily char).

//Recognition succeeded
template <class A> 
struct remains{ 
  std::list<A> left;
  template <class ItT>
    remains (ItT begin, ItT end) 
      : left (begin, end) 

//Recognition failed
template <class A>
struct recognition_fails {};

//Result of a recognizer. Indicates the result of attempting to
//recognize something from a list
template <class A> using remaining = 
  boost::variant<remains <A>, recognition_fails<A> >;
Recognizers then are but functions from lists of tokens to values of remaining<A>.
//A 'recognizer' is a function from a list to a value of remaining<A>
template <class A> using recognizer = 
  std::function<remaining<A>(std::list<A> const&)>;
Here's an example. This function (constant) produces a recognizer that matches the empty string. The recognizer it produces always succeeds and never consumes any input.
//A recognizer that recognizes the empty string. It always succeeds and
//no input is ever consumed
template <class A>
recognizer<A> epsilon ()
  return [](std::list<A> const& cs) -> remaining<A>
      return remains<A> (cs.begin (), cs.end ());
This next factory function produces recognizers that recognize elements that satisfy a user provided predicate.
//Given a predicate, 'recognizer_of_token' produces the recognizer
//associated with the elements that satisfy this predicate
template <class A, class F/*bool(A)*/>
recognizer<A> recognizer_of_token (F test)
    [=] (std::list<A> const& cs) -> remaining<A>
      if (cs.empty ())
        return recognition_fails<A> ();

      if (test (cs.front ()))
        return remains<A> (boost::next (cs.begin()), cs.end());

      return recognition_fails<A>();
Recognizers can be composed. This function takes two recognizers and returns a recognizer that will, when presented with a list of tokens, attempt recognition via the first and if that doesn't work out, backtrack and attempt to recognize via the second. It's the "if then else" of recognizers.
//Recognizer disjunction
template <class A>
recognizer<A> compose_or (recognizer<A> p, recognizer<A> q)
  return [=](std::list<A> const& toks) -> remaining<A>
      remaining <A> res = p (toks);
      if (remains<A>* rem = boost::get<remains<A> >(&res)) return *rem;

      return q (toks);
This next function is the builder of the corresponding "and then" sort of recognizer.
//Recognizer conjunction
template <class A>
recognizer<A> compose_and (recognizer<A> p, recognizer<A> q)
  return [=](std::list<A> const& toks) -> remaining<A>
      remaining <A> res = p (toks);
      if (remains<A>* rem = boost::get<remains<A> >(&res)) 
        return q (rem->left);

      return recognition_fails<A> ();
With disjunction and conjunction of recognizers, we can implement a function to compute, given a recognizer, the corresponding Kleene star recognizer. That is, zero or more occurrences of a given phrase.
//The '*' iterator (Kleene star)
template <class A>
recognizer<A> zero_or_more (recognizer<A> p)
  return [=] (std::list<A> const& toks) -> remaining <A>
      return compose_or (
         compose_and (p, zero_or_more<A> (p)) , epsilon<A> ()) (toks);
Well, let's now get concrete. Here's a function to produce a recognizer of a specific token.
//A function to produce a recognizer that recognizes only the
//given character
template <A>
recognizer<A> char_ (A c)
  return recognizer_of_token<A>([=](A x) -> bool { return c == x; });
The composite recognizer for a given "string" of tokens can be constructed then like this.
//A function to produce a recognizer of a specific string
template <class C>
recognizer<typename boost::range_value<C>::type> string_ (C s)
  typedef typename boost::range_value<C>::type value;

  std::list <recognizer<value> > rs;

  typedef std::back_insert_iterator<std::list<recognizer<value> > > it_t;

  std::accumulate (
      boost::begin (s)
    , boost::end (s)
    , std::back_inserter (rs)
    , [](it_t dst, value c) -> it_t 
      { *dst++ = char_ (c); return dst; });

  return std::accumulate (
      boost::next (rs.begin ())
    , rs.end ()
    , rs.front ()
    , [](recognizer<value> acc, recognizer<value> r) -> recognizer<value>
      { return compose_and (acc, r); });
That will do for our mini-recognizer library for today. Let's turn attention to testing. First some utilities.
//Match on a remaining<A> returns 'true' if it contains a 'remains<A>'
//value with all input consumed, 'false' if it contains a 'recognition_fails<A>' value
template <class A>
struct accept_visitor
  typedef bool result_type;
  bool operator () (remains<A> const& r) const { return r.left.empty (); }
  bool operator () (recognition_fails<A> const& r) const { return false; }

//Function to determine if recognition was achieved
template <class A>
bool accept (remaining<A> const& r)
  return boost::apply_visitor( accept_visitor<A> (), r);

//Test if the provided recognizer can recognize the given string
bool parse (recognizer<char> parser, std::string const& s)
  return accept (parser (std::list<char>(s.begin (), s.end())));
Now, a little test.
int main ()
  BOOST_TEST( parse (char_ ('a'), "a") );
  BOOST_TEST( !parse (char_ ('b'), "a") );
  BOOST_TEST( parse (char_ ('b'), "b") );

  BOOST_TEST( parse (zero_or_more (char_ ('a')), "aa") );
  BOOST_TEST( parse (zero_or_more (char_ ('a')), "") );
  BOOST_TEST( !parse (zero_or_more (char_ ('a')), "ab") );

  BOOST_TEST (parse (string_ (std::string ("foo")), "foo") );
  BOOST_TEST (!parse (string_ (std::string ("foo")), "bar") );
  return boost::report_errors ();
Et voilĂ . L' approche fonctionnelle de C++! See you soon for more about regular expressions. Happy holidays!

Sunday, December 7, 2014

Parsing regular expressions

Here's a brief refresher on regular expressions derived from Cosineau and Mauny.

Given an alphabet $\mathcal{A}$, a regular expression $e$ over $\mathcal{A}$ is a concise way to describe a set $\mathcal{L}$ of strings of elements of $\mathcal{A}$. The set $\mathcal{L}_{e}$ denoted by an expression $e$ is termed the regular language denoted by $e$.

In their most basic form, regular expressions can be defined like this:

  • $\epsilon$ represents the set $\mathcal{L}_{\epsilon} = \left\{\epsilon\right\}$ containing only the empty string
  • A symbol $c$ of $\mathcal{A}$ denotes $\mathcal{L}_{c} = \left\{c\right\}$
  • $e_{1}e_{2}$ is the set $\mathcal{L}_{e_{1}}\cdot\mathcal{L}_{e_{2}} = \left\{s_{1}s_{2} | s_{1} \in \mathcal{L}_{e_{1}}, s_{2} \in \mathcal{L}_{e_{2}}\right\}$
  • $e_{1}|e_{2}$ is the set $\mathcal{L}_{e_{1}}\bigcup \mathcal{L}_{e_{2}}$
  • $e^{*}$ is the union $\bigcup_{i=0}^{\infty}\mathcal{L}_{e}^{i}$ where $\mathcal{L}_{e}^{0} = \mathcal{L}_{\epsilon}$ and $\mathcal{L}_{e}^{i} = \mathcal{L}_{e}^{i-1}\cdot\mathcal{L}_{e}$
This last operator $*$ is called the Kleene closure operator and $e^{*}$ equivalent to the infinite regular expression $\epsilon|e|ee|eee|\dots$

Regular expressions provide a convenient syntax for searching plain text data for lines matching specific patterns as demonstrated by the venerable Unix function 'grep' (I was taught grep is an acronym for generalized regular expression pattern-matching). To produce a function like grep requires a program for analyzing a regular expression $e$ and computing a recognizer for $\mathcal{L}_{e}$, that is a function string -> bool, that can be used to classify a string $s$ (say) in terms of whether or not $s \in \mathcal{L}_{e}$. This procedure of computing a recognizer for $\mathcal{L}_{e}$ from $e$ is called compiling a regular expression.

Writing a regular expression compiler is not for the faint-of-heart! Doing so requires more work than can be reasonably described in one blog-post so here, I'll focus only on the first part - parsing a regular expression. That is, given a string $s$ a program to produce an abstract syntax tree describing the regular expression $e$ represented by $s$.

To begin, we define the type of regular expressions.

type regular_expression =
  | Epsilon
  | Character of int
  | Sequence of regular_expression * regular_expression
  | Alternative of regular_expression * regular_expression
  | Repetition of regular_expression
We could proceed from here by hand-crafting a recursive descent parser or we might take advantage of a parser generator system like ocamllex/ocamlyacc which is the choice I favored. On reaching that decision it occurred to me that, since ocamllex itself treats regular expressions as inputs, the source code for ocamllex itself might contain a suitable grammar for the expression type above and indeed it does! Things at this point become marvelously recursive - to build a regular expression compiler we can bootstrap (at both the source and binary levels) from an existing regular expression compiler (which itself is a bootstrap by the way)!

The code that follows is an adaptation of the parser and lexer input files for ocamllex (the copyright of Xavier Leroy).


open Syntax

%token <int> Tchar

%token Tend Tor Tlbracket Trbracket
%token Tstar Tmaybe Tplus Tlparen Trparen

%left Tor
%nonassoc CONCAT
%nonassoc Tmaybe Tstar Tplus
%nonassoc Tchar Tlbracket Tlparen

%start main
%type <Syntax.regular_expression> main

 | regexp Tend
     { $1 }
 | Tchar
     { Character $1 }
 | regexp Tstar
     { Repetition $1 }
 | regexp Tmaybe
     { Alternative (Epsilon, $1) }
 | regexp Tplus
     { Sequence (Repetition ($1), $1)}
 | regexp Tor regexp
     { Alternative ($1, $3) }
 | regexp regexp %prec CONCAT
     { Sequence ($1, $2) }
 | Tlparen regexp Trparen
     { $2 }
Here is the corresponding lexer input file.
open Parser

let incr_loc lexbuf delta =
  let pos = lexbuf.Lexing.lex_curr_p in
    lexbuf.Lexing.lex_curr_p <- { pos with
    Lexing.pos_lnum = pos.Lexing.pos_lnum + 1;
    Lexing.pos_bol = pos.Lexing.pos_cnum - delta;

let char_for_backslash = function
  | 'n' -> '\010'
  | 'r' -> '\013'
  | 'b' -> '\008'
  | 't' -> '\009'
  | c   -> c


let special_chars = 
  ['|' '*' '?' '+' '(' ')']

let backslash_escapes =
  ['\\' '\'' '"' 'n' 't' 'b' 'r' ' ']

rule main = parse
    [' ' '\013' '\009' '\012' ] +
    { main lexbuf }
  | '\010'
    { incr_loc lexbuf 0;
      main lexbuf }
  | [^ '\\' '|' '*' '?' '+' '(' ')']
    { Tchar(Char.code (Lexing.lexeme_char lexbuf 0)) }
  | '\\' backslash_escapes
    { Tchar (Char.code (char_for_backslash (Lexing.lexeme_char lexbuf 1))) }
  | '\\' special_chars
    { Tchar (Char.code (Lexing.lexeme_char lexbuf 1)) }
  | '|'  { Tor }
  | '*'  { Tstar }
  | '?'  { Tmaybe }
  | '+'  { Tplus }
  | '('  { Tlparen }
  | ')'  { Trparen }
  | eof  { Tend }

The function to produce an abstract syntax tree from a string then goes like this.
let (regexp_of_string : string -> Syntax.regular_expression) =
  fun s ->
    let parse_buf lexbuf =
        Parser.main Lexer.main lexbuf
      | Parsing.Parse_error ->
          let curr = lexbuf.Lexing.lex_curr_p in
          let line = curr.Lexing.pos_lnum in
          let cnum = curr.Lexing.pos_cnum - curr.Lexing.pos_bol in
          let tok = Lexing.lexeme lexbuf in
          raise (Failure
                    "file \"\", line %d, character %d\n\
                     Error : Syntax error \"%s\"" line cnum tok
    in parse_buf (Lexing.from_string s)

Now, the obvious definition for the symmetric string_of_regexp function.

let rec (string_of_regexp : Syntax.regular_expression -> string) = 
  fun re ->
    match re with
    | Syntax.Epsilon -> 
    | Syntax.Character c -> 
      Printf.sprintf "Character '%c'" (Char.chr c)
    | Syntax.Sequence (p, q) -> 
      Printf.sprintf "Sequence (%s, %s)"
         (string_of_regexp p) (string_of_regexp q)
    | Syntax.Alternative (p, q) -> 
      Printf.sprintf "Alternative (%s, %s)" 
         (string_of_regexp p) (string_of_regexp q)
    | Syntax.Repetition r -> 
      Printf.sprintf "Repetition (%s)" (string_of_regexp r)

With these, a quick little program to glimpse what the AST for an expression like $\left(a|b\right)^*abb$ looks like is easy.
Sequence (
  Repetition (
    Alternative (
      Character 'a',
      Character 'b'
  Sequence (
    Character 'a', 
    Sequence (
      Character 'b', 
      Character 'b'

Thursday, November 13, 2014

Dimensional Analysis in OCaml

Dimensional analysis in OCaml

In 1994, Barton and Nackman in their book 'Scientific Engineering in C++' [1] demonstrated how one could encode the rules of dimensional analysis into the C++ type system enabling compile-time checking (no run-time cost) of the plausibility (at least up to the dimensional correctness) of computations.

In 2004, Abrahams & Gurtovy in 'C++ Template Metaprogramming' [2] showed the Barton Nackman technique to be elegantly implementable using compile time type sequences encoding integer constants. The key properties of the technique are:

  • Encoding of integers as types;
  • Compile time manipulation of sequences of these integer encodings to deduce/produce new derived types.

For a good while there it escaped me how to approach this problem in OCaml and it bothered me somewhat. I turned to the caml-list for guidance and I'm happy to say some notable 'Camlers' there helped me out (thank-you Octachron, Mario Alvarez Picallo, Thomas Gazagnaire, Roberto Di Cosmo and David Mentre)!

The key idea in the solution to follow is the encoding of integers into the type-system as differences between two Peano numbers. The details of the approach are presented in the excellent paper "Many Holes in Hindley-Milner" by Sam Lindley of the University of Edinburgh.

Credit for the code that follows goes to Mario Alvarez Picallo. I generalized Mario's program to the extent that I could do the "force on a laptop" exercise (as presented in the online Boost.MPL tutorial).

The module interface is where all the work is - getting the "type-math" correct.

module type S = sig

  type +'a s = 'a * 'a
  type (+'a,+'b,+'c,+'d,+'e,+'f,+'g,+'h,+'i,+'j,+'k,+'l,+'m,+'n) t

  (*Base dimensions*)

  val mass : 
    float -> ('a,'a s,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val length : 
    float -> ('a,'a,'b,'b s,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val time : 
    float -> ('a,'a,'b,'b,'c,'c s,'d,'d,'e,'e,'f,'f,'g,'g) t
  val charge : 
    float -> ('a,'a,'b,'b,'c,'c,'d,'d s,'e,'e,'f,'f,'g,'g) t
  val temperature : 
    float -> ('a,'a,'b,'b,'c,'c,'d,'d,'e,'e s,'f,'f,'g,'g) t
  val intensity : 
    float -> ('a,'a,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f s,'g,'g) t
  val angle :
     float -> ('a,'a,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g s) t

  (*Composite dimensions*)

  val velocity :
    float -> ('a,'a,'b,'b s,'c s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val acceleration :
    float -> ('a,'a,'b,'b s,'c s s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val momentum :
    float -> ('a,'a s,'b,'b s,'c s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val force :
    float -> ('a,'a s,'b,'b s,'c s s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t


  val ( + ) : 
    ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
      ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
        ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t
  val ( - ) :
    ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
      ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
        ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t
  val ( * ) :
    ('a0,'b0,'c0,'d0,'e0,'f0,'g0,'h0,'i0,'j0,'k0,'l0,'m0,'n0) t -> 
      ('b0,'b1,'d0,'d1,'f0,'f1,'g0,'h1,'i0,'j1,'k0,'l1,'m0,'n1) t -> 
        ('a0,'b1,'c0,'d1,'e0,'f1,'g0,'h1,'i0,'j1,'k0,'l1,'m0,'n1) t
  val ( / ) :
    ('a0,'b0,'c0,'d0,'e0,'f0,'g0,'h0,'i0,'j0,'k0,'l0,'m0,'n0) t -> 
      ('a1,'b0,'c1,'d0,'e1,'f0,'g1,'h0,'i1,'j0,'k1,'l0,'m1,'n0) t -> 
        ('a0,'a1,'c0,'c1,'e0,'e1,'g0,'g1,'i0,'i1,'k0,'k1,'m0,'m1) t

  (*Conversion to float*)

  val value : ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> float

That's the hard part, the module implementation itself is trivial.

module Dim : S = struct

  type +'a s = 'a * 'a
  type (+'a,+'b,+'c,+'d,+'e,+'f,+'g,+'h,+'i,+'j,+'k,+'l,+'m,+'n) t = float

  let mass x = x
  let length x = x
  let time x = x
  let charge x = x
  let temperature x = x
  let intensity x = x
  let angle x = x

  let velocity x = x
  let acceleration x = x
  let momentum x = x
  let force x = x

  let ( + ) = ( +. )
  let ( - ) = ( -. )
  let ( * ) = ( *. )
  let ( / ) = ( /. )

  let value x = x


And the motivating "force on a laptop" calculation? Well in the top-level it proceeds like this.

# open Dim ;;
# let m = mass 5.0 ;;
val m : ('a,'a Dim.s,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) Dim.t =
# let a = acceleration 9.8 ;;
val a :
  ('a,'a,'b,'b Dim.s,'c Dim.s Dim.s,'c,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = <abstr>
# let f = m * a ;;
val f :
  ('a,'a Dim.s,'b,'b Dim.s,'c Dim.s Dim.s,'c,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = <abstr>
Now to verify the result.
# let m2 = f / a ;;
val m2 :
  ('a,'a Dim.s,'b,'b,'c Dim.s Dim.s,'c Dim.s Dim.s,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = <abstr>
If we got things right, then we'd expect that the difference m2 - m be close to zero (within rounding error).
# value (m2 - m) ;;
- : float = 0.
Indeed it is as we hoped.

The key test though is this, if we had written a/f instead of f/a we want that there be type-check failure preventing the mistake from propagating through the program.

# let m2 = a / f (*oops*) ;;
val m2 :
  ('a Dim.s,'a,'b,'b,'c Dim.s Dim.s,'c Dim.s Dim.s,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = 
# m2 - m ;;
Characters 5-6:
  m2 - m ;;
  This expression has type
   ('a Dim.s,'a Dim.s Dim.s,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) Dim.t
  but an expression was expected of type
   ('a Dim.s,'a,'h,'h,'i Dim.s Dim.s,'i Dim.s Dim.s,'j,'j,'k,'k,'l,'l,'m,'m) Dim.t
  The type variable 'a occurs inside 'a Dim.s * 'a Dim.s
And there it is. Happy days!

[1] John J. Barton and Lee R. Nackman. Scientific and Engineering C++: an Introduction with Advanced Techniques and Examples. Reading, MA: Addison Wesley. ISBN 0-201-53393-6. 1994.

[2] David Abrahams and Aleksey Gurtovy C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ in Depth Series), Addison-Wesley Professional. ISBN:0321227255. 2004.

Sunday, November 9, 2014



When you begin to program in OCaml, one of the first pieces of advice you encounter is to prefer the '::' operator over the '@' operator for list construction. The question is, does it matter? Even knowing that '@' exhibits complexity $O(N)$ as opposed to $O(1)$ for '::' should you care? I mean, how much of a difference can it make really?

The answer of course is yes. In practical terms, it matters. No, it really, really matters. Let's quantify it. Here's a version of the range function that uses '@'.

let range s e =
  let rec loop acc s e =
    if s >= e then acc
    else loop (acc @ [s]) (s + 1) e 
  in loop [] s e
As an aside, you'll note that it's been written to be tail-recursive (so as to avoid inducing stack overflow).

Timing this function for building lists of $2^{10} = 1,024$ elements to $2^{16} = 65,536$ elements produces the following table:

ntime (s)

Ouch! To build a list of just $65,536$ elements, the program takes over 2 minutes!?!

Ok, here's the version of range that uses '::' to build the list (and necessarily does a List.rev on the result in order that the elements don't come out "back to front"):

let range s e =
  let rec loop acc s e =
    if s >= e then acc
    else loop (s :: acc) (s + 1) e 
  in List.rev (loop [] s e)

And the timings for this one?

ntime (s)

That is, this one builds the list of $65,536$ elements in just $5$ milliseconds!


Sunday, October 19, 2014



Stack overflow refers to a condition in the execution of a computer program whereby the stack pointer exceeds the address space allocated for the stack. The usual result of "blowing the stack" is swift and brutal abnormal termination of the program.

The amount of memory allocated by the operating system for a given program's stack is finite and generally little can be done by the programmer to influence the amount that will be made available. The best the programmer can really do is to use what's given wisely.

We can get a sense of the limits of the stack in practical terms with a program like the following.

let rec range s e = 
  if s >= e then [] 
  else s :: range (s + 1) e

let rec loop i =
  let n = 2.0 ** (i |> float_of_int) |> int_of_float in
    let _ = range 0 n in
    loop (i + 1)
  | Stack_overflow -> 
    Printf.printf "Stack overflow at i = %d, n = %d\n" i n
let () = loop 0
range is a function that produces the half-open range $\left[s, e\right)$ - the ordered sequence $\left\{s, s + 1, s + 2, \dots, e - 2, e - 1\right\}$. Note that range is defined in terms of itself, that is, it is a recursive function. The idea is to use it in an unbounded loop to build sequences of increasing lengths of powers of $2$ : ${2^0, 2^1, 2^2, \dots}$. We set it off and when we encounter stack overflow, terminate the program gracefully reporting on the power of $2$ found to give rise to the condition. In my case I found that I was able to make $\approx 2^{19} = 524,288$ recursive calls to range before the stack limit was breached. That's very limiting. For realistic programs, one would hope to be able to produce sequences of lengths much greater than that!

What can be done? The answer lies in the definition of range and that thing called tail-recursion. Specifically, range is not tail-recursive. To be a tail-recursive function, the last thing the function needs do is to make a recursive call to itself. That's not the case for range as written as the recursive call to itself is the second to last thing it does before it returns (the last thing it does is to 'cons' a value onto the list returned by the recursive call).

Why being tail-recursive is helpful is that tail-calls can be implemented by the compiler without requiring the addition of a new "stack frame" to the stack. Instead, the current frame can be replaced in setting up the tail-call being modified as necessary and effectively the recursive call is made to be a simple jump. This is called tail-call elimination and its effect is to allow tail-recursive functions to circumvent stack overflow conditions.

Here's a new definition for range, this time implemented with tail-calls.
let range s e = 
  let rec aux acc s e = 
    if s >= e then acc
  else aux (s :: acc) (s + 1) e
  in List.rev (aux [] s e)
With this definition for range I find I can build sequences of length up to around $\approx 2^{26} = 67,108,864$ elements long without any sign of stack overflow which is a huge improvement! At around this point though, my sequence building capabilities start to be limited by the amount of physical memory present on my PC but that's a different story entirely.

Sunday, October 12, 2014

Dimensional analysis (and the units of the universal gas constant 'R')

Dimensional analysis

The problem at hand is to find by dimensional analysis, the SI units of the universal gas constant $R$ (forgive me - whilst this entry is not explicitly about computer programming - it is in fact one of my daughter's homework problems - the obvious relationship to type systems makes it seem to me at least tangentially relevant).

$R$ is defined by the Ideal Gas Law: $PV = nRT$ were $P$ is the absolute pressure of the gas, $V$ is the volume of the gas, $n$ is the amount of substance of gas (measured in moles), and $T$ is the absolute temperature of the gas.

The obvious dimensions are as follows :

$\left[P\right]$: $M \cdot L^{-1}\cdot T^{-2}$, $\left[V\right]$: $L^3$ and $\left[T\right]$: $\Theta$.

Now, one mole of a substance is defined to be $6.0221367\times 10^{23}$ atoms of that substance (Avogardro's number) but even dimensionless numbers can be part of a dimensioned system. The trick is to realize that if one quantity in an equation is "per mole" then so too must be any quantity added to it. Accordingly, if we define a (pseudo) dimension $\Lambda$ for the amount $n$ we can reason that $\left[R\right]$: $M \cdot L^{2} \cdot T^{-2} \cdot \Theta^{-1} \cdot \Lambda^{-1}$. This is enough for us to say the fundamental units for $R$ are

$kg \cdot m^{2} \cdot s^{-2} \cdot K^{-1} \cdot mol^{-1}$.

We can go a little further though. Since $work = force \times length$ we see that $M \cdot L \cdot T^{-2}$ can be expressed in units of energy and indeed $1J = kg \cdot m^{2} \cdot s^{-2}$. Thus we arrive at our final conclusion. $R$ can be written with units

$J \cdot K^{-1} \cdot mol^{-1}$.

The beautiful thing though is this. The physical interpretation of the ideal gas law is saying that, for an ideal gas, of any kind, "the energy per degree per mole" is a constant (that constant being $\approx 8.3144 \frac{^J/_K}{mol}$)!