regexps

Stephen Weeks MLton@sourcelight.com
Tue, 12 Jun 2001 12:42:33 -0700


> Long long ago Rico had a fancy version of egrep he wrote which did on-the-fly
> DFA construction.  One tweak which I told him to do and which saved a ton
> was to first divide all characters into equivalence classes where 2 chars are
> equivalent iff all NFA states do the same thing on them.  This pretty much
> makes singleton classes for the chars that appear.  Then the constructed DFA
> ran by taking the next character and mapping it to its equivalence class
> (via a single array lookup in a 256 byte array).  This shrunk things by a lot.

An extremely cool hack.  I have added it to the implementation.  There are only 7
classes for the regexmatch benchmark.  I'll have to send in an updated one
sometime (something of a pain since I have to extricate the code from
dependences on my libraries).

Now that all our benchmarks are there, check out the shootout
	http://www.bagley.org/~doug/shootout/
and the fine scorecard
	http://www.bagley.org/~doug/shootout/craps.shtml
MLton is faring quite well.

> The fancy incremental part kept all the sets of NFA states seen so far.  No
> state minimization.

I think I'll hold off on this for now.  One should use the DFA if it's small
enough, and if it's not, fall back to the NFA.  Hopefully the NFA isn't too much
slower than the incremental approach (although now that I think of it, there
probably is a factor of 5 or so).

> The canonical bad case for the DFA is, using egrep notation,
> 	a.......$
> The DFA must remember for the last n (number of dots) characters if it was an
> a or not: size 2^n, and this is clearly minimal.

Agreed.