new regexp library

Stephen Weeks MLton@sourcelight.com
Thu, 21 Jun 2001 09:04:06 -0700


--VJ5SYgeoVM
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit


> Dot  is  certainly  a fine program, but this thing that it produced is pretty
> bad.  For one thing, it is horizontal  instead  of  vertical.   

I chose that on purpose (randir = "LR") because my screen has more width than
height.

> This  is  bad
> beause  you  a  page  has more space vertically than horizontally.  Also, the
> size is 9221 accross, which is just under 128 inches.  A bit wide compared to
> any paper I have around here.

There are other dot options to tell it to split the graph onto multiple pages,
to shrink the separation between columns of nodes, etc.  Usually, I find it
easier to scroll around with gv.

> In  the  graph,  I  don't  understand  DFA transitions labelled with an empty
> character class (e.g., n39 -> n38). 

Sorry, that is actually a character class with one element, the space character.

Anyways, I've appended the dot source to this message so you can reformat if you
want.

> Yes, look at state n4: It can go to state N3 on a tab character, but it looks
> like it can also go to state n1 on a  tab  (assuming  that  the  `^'  in  the
> n4 -> n1   character   class  really  means  the  complement  of  the  listed
> characters.

Huh?  n4 -> n3 does include tab, but n4 -> n1 is a complement that lists tab and
therefore does not include tab.

> I  don't think I would bother with returning the list of matches yet.  What I
> usually do is have a regexp that just handles  one  initial  element  of  the
> list,  and then, using that, I extract the element and the rest of the string
> and loop.  This really pushes you to have some kind of  match-regexp-against-
> substring kind of thing, but that should be no problem.

If I understand you correctly, that's what mlprof does now.  There is one
toplevel regexp that matches the line and another that is used to loop and
extract the labels.  I now notice that the later could be done with
String.fields and friends, so I might change it to that.

> This division can often be more efficient than matching everything in one big
> regular expression any way.  If you think about it, it is very  analogous  to
> Prolog's  cut  (`!').  The notion is that once you see the start of the list,
> you are NOT willing to backtrack around this choice.  If you write it as  one
> big regular expression, you can't express that, and so the matching can often
> take longer because it is going to keep these other possibilities alive.   In
> order  to  avoid any order n^2 thing, you have to stop scanning when you come
> to a state which can't possibly succeed.  This shouldn't be hard even with an
> NFA, let alone with a DFA.

Makes sense.  For this particular regexp, we're OK.  The nodes that handle the
sequence of labels are n6, n8, n9, n10, n11, n12, and n19.



--VJ5SYgeoVM
Content-Type: application/octet-stream
Content-Disposition: attachment;
	filename="symbol.dot"
Content-Transfer-Encoding: base64

ZGlncmFwaCAiZGZhIiB7CmxhYmVsID0gImRmYSI7IHJhbmtkaXIgPSAiTFIiOyB7IHJhbmsgID0g
Im1pbiI7IG4wIH0KbjEgW3NoYXBlID0gImVsbGlwc2UiXQpuMSAtPiBuMSBbbGFiZWwgPSAiW15c
XFxcdFxcXFxuICFcIiMkJigpLS4vOzw9Pj9AW11dIC0tICgoMCwgMCwgKCkpKSJdCm4xIC0+IG4y
IFtsYWJlbCA9ICJcXFxcbiAtLSAoKDAsIDAsIChGaW5pc2ggU2F2ZSB+MSkpKSJdCm4xIC0+IG4z
IFtsYWJlbCA9ICJbXFxcXHQgXSAtLSAoKDAsIDAsIChGaW5pc2ggU2F2ZSB+MSkpKSJdCm4yIFtz
aGFwZSA9ICJib3giXQpuMyBbc2hhcGUgPSAiZWxsaXBzZSJdCm4zIC0+IG4yIFtsYWJlbCA9ICJc
XFxcbiAtLSAoKDAsIDAsICgpKSkiXQpuMyAtPiBuMyBbbGFiZWwgPSAiW1xcXFx0IF0gLS0gKCgw
LCAwLCAoKSkpIl0KbjQgW3NoYXBlID0gImVsbGlwc2UiXQpuNCAtPiBuMSBbbGFiZWwgPSAiW15c
XFxcdFxcXFxuICFcIiMkJigpLS4vOzw9Pj9AW11dIC0tICgoMCwgMCwgKCkpKSJdCm40IC0+IG4y
IFtsYWJlbCA9ICJcXFxcbiAtLSAoKDAsIDAsIChGaW5pc2ggU2F2ZSB+MSkpKSJdCm40IC0+IG4z
IFtsYWJlbCA9ICJbXFxcXHQgXSAtLSAoKDAsIDAsIChGaW5pc2ggU2F2ZSB+MSkpKSJdCm41IFtz
aGFwZSA9ICJlbGxpcHNlIl0KbjUgLT4gbjQgW2xhYmVsID0gIlteXFxcXHRcXFxcbiAhXCIjJCYn
KCktLi8wMTIzNDU2Nzg5Ozw9Pj9AW11dIC0tICgoMCwgMCwgKFN0YXJ0IFNhdmUgfjEpKSkiXQpu
NiBbc2hhcGUgPSAiZWxsaXBzZSJdCm42IC0+IG43IFtsYWJlbCA9ICJCIC0tICgoMCwgMCwgKEZp
bmlzaCBTYXZlIH4xKSkpIl0KbjYgLT4gbjggW2xhYmVsID0gIlswMTIzNDU2Nzg5XSAtLSAoKDAs
IDAsICgpKSkiXQpuNiAtPiBuOSBbbGFiZWwgPSAiWy5dIC0tICgoMCwgMCwgKCkpKSJdCm4xMCBb
c2hhcGUgPSAiZWxsaXBzZSJdCm4xMCAtPiBuMTAgW2xhYmVsID0gIlteXFxcXHRcXFxcbiAhXCIj
JCYoKS0uLzs8PT4/QFtdXSAtLSAoKDAsIDAsICgpKSkiXQpuMTAgLT4gbjExIFtsYWJlbCA9ICIk
IC0tICgoMCwgMCwgKCkpKSJdCm4xMSBbc2hhcGUgPSAiZWxsaXBzZSJdCm4xMSAtPiBuNiBbbGFi
ZWwgPSAiJCAtLSAoKDAsIDAsICgpKSkiXQpuMTIgW3NoYXBlID0gImVsbGlwc2UiXQpuMTIgLT4g
bjEwIFtsYWJlbCA9ICJbXlxcXFx0XFxcXG4gIVwiIyQmKCktLi87PD0+P0BbXV0gLS0gKCgwLCAw
LCAoKSkpIl0KbjEyIC0+IG4xMSBbbGFiZWwgPSAiJCAtLSAoKDAsIDAsICgpKSkiXQpuMTMgW3No
YXBlID0gImJveCJdCm4xNCBbc2hhcGUgPSAiZWxsaXBzZSJdCm4xNCAtPiBuMTMgW2xhYmVsID0g
IlxcXFxuIC0tICgoMCwgMCwgKCkpKSJdCm4xNCAtPiBuMTQgW2xhYmVsID0gIltcXFxcdCBdIC0t
ICgoMCwgMCwgKCkpKSJdCm4xNSBbc2hhcGUgPSAiZWxsaXBzZSJdCm4xNSAtPiBuMTMgW2xhYmVs
ID0gIlxcXFxuIC0tICgoMCwgMCwgKCkpKSJdCm4xNSAtPiBuMTQgW2xhYmVsID0gIltcXFxcdCBd
IC0tICgoMCwgMCwgKCkpKSJdCm4xNiBbc2hhcGUgPSAiZWxsaXBzZSJdCm4xNiAtPiBuMTUgW2xh
YmVsID0gIm4gLS0gKCgwLCAwLCAoKSkpIl0KbjE3IFtzaGFwZSA9ICJlbGxpcHNlIl0KbjE3IC0+
IG4xNiBbbGFiZWwgPSAiaSAtLSAoKDAsIDAsICgpKSkiXQpuMTggW3NoYXBlID0gImVsbGlwc2Ui
XQpuMTggLT4gbjE3IFtsYWJlbCA9ICJnIC0tICgoMCwgMCwgKCkpKSJdCm43IFtzaGFwZSA9ICJl
bGxpcHNlIl0KbjcgLT4gbjE4IFtsYWJlbCA9ICJlIC0tICgoMCwgMCwgKCkpKSJdCm44IFtzaGFw
ZSA9ICJlbGxpcHNlIl0KbjggLT4gbjggW2xhYmVsID0gIlswMTIzNDU2Nzg5XSAtLSAoKDAsIDAs
ICgpKSkiXQpuOCAtPiBuOSBbbGFiZWwgPSAiWy5dIC0tICgoMCwgMCwgKCkpKSJdCm45IFtzaGFw
ZSA9ICJlbGxpcHNlIl0KbjkgLT4gbjEyIFtsYWJlbCA9ICJbXlxcXFx0XFxcXG4gIVwiIyQmJygp
LS4vMDEyMzQ1Njc4OTs8PT4/QFtdXSAtLSAoKDAsIDAsICgpKSkiXQpuMTkgW3NoYXBlID0gImVs
bGlwc2UiXQpuMTkgLT4gbjcgW2xhYmVsID0gIkIgLS0gKCgwLCAwLCAoU3RhcnQgU2F2ZSB+MSxG
aW5pc2ggU2F2ZSB+MSkpKSJdCm4xOSAtPiBuOCBbbGFiZWwgPSAiWzAxMjM0NTY3ODldIC0tICgo
MCwgMCwgKFN0YXJ0IFNhdmUgfjEpKSkiXQpuMTkgLT4gbjkgW2xhYmVsID0gIlsuXSAtLSAoKDAs
IDAsIChTdGFydCBTYXZlIH4xKSkpIl0KbjIwIFtzaGFwZSA9ICJlbGxpcHNlIl0KbjIwIC0+IG4y
MCBbbGFiZWwgPSAiWzAxMjM0NTY3ODldIC0tICgoMCwgMCwgKCkpKSJdCm4yMCAtPiBuMjEgW2xh
YmVsID0gIiQgLS0gKCgwLCAwLCAoKSkpIl0KbjIxIFtzaGFwZSA9ICJlbGxpcHNlIl0KbjIxIC0+
IG4xOSBbbGFiZWwgPSAiJCAtLSAoKDAsIDAsICgpKSkiXQpuMjIgW3NoYXBlID0gImVsbGlwc2Ui
XQpuMjIgLT4gbjIwIFtsYWJlbCA9ICJbMDEyMzQ1Njc4OV0gLS0gKCgwLCAwLCAoKSkpIl0KbjIy
IC0+IG4yMSBbbGFiZWwgPSAiJCAtLSAoKDAsIDAsICgpKSkiXQpuMjMgW3NoYXBlID0gImVsbGlw
c2UiXQpuMjMgLT4gbjIyIFtsYWJlbCA9ICJlIC0tICgoMCwgMCwgKCkpKSJdCm4yNCBbc2hhcGUg
PSAiZWxsaXBzZSJdCm4yNCAtPiBuMjMgW2xhYmVsID0gImwgLS0gKCgwLCAwLCAoKSkpIl0KbjI1
IFtzaGFwZSA9ICJlbGxpcHNlIl0KbjI1IC0+IG4yNCBbbGFiZWwgPSAiaSAtLSAoKDAsIDAsICgp
KSkiXQpuMjYgW3NoYXBlID0gImVsbGlwc2UiXQpuMjYgLT4gbjI1IFtsYWJlbCA9ICJmIC0tICgo
MCwgMCwgKCkpKSJdCm4yNyBbc2hhcGUgPSAiZWxsaXBzZSJdCm4yNyAtPiBuMjYgW2xhYmVsID0g
Im8gLS0gKCgwLCAwLCAoKSkpIl0KbjI4IFtzaGFwZSA9ICJlbGxpcHNlIl0KbjI4IC0+IG4yNyBb
bGFiZWwgPSAiciAtLSAoKDAsIDAsICgpKSkiXQpuMjkgW3NoYXBlID0gImVsbGlwc2UiXQpuMjkg
LT4gbjI4IFtsYWJlbCA9ICJQIC0tICgoMCwgMCwgKCkpKSJdCm4zMCBbc2hhcGUgPSAiZWxsaXBz
ZSJdCm4zMCAtPiBuMjkgW2xhYmVsID0gIm4gLS0gKCgwLCAwLCAoKSkpIl0KbjMxIFtzaGFwZSA9
ICJlbGxpcHNlIl0KbjMxIC0+IG4zMCBbbGFiZWwgPSAibyAtLSAoKDAsIDAsICgpKSkiXQpuMzIg
W3NoYXBlID0gImVsbGlwc2UiXQpuMzIgLT4gbjMxIFtsYWJlbCA9ICJ0IC0tICgoMCwgMCwgKCkp
KSJdCm4zMyBbc2hhcGUgPSAiZWxsaXBzZSJdCm4zMyAtPiBuMzIgW2xhYmVsID0gIkwgLS0gKCgw
LCAwLCAoKSkpIl0KbjM0IFtzaGFwZSA9ICJlbGxpcHNlIl0KbjM0IC0+IG4zMyBbbGFiZWwgPSAi
TSAtLSAoKDAsIDAsICgpKSkiXQpuMzUgW3NoYXBlID0gImVsbGlwc2UiXQpuMzUgLT4gbjM0IFts
YWJlbCA9ICJbIF0gLS0gKCgwLCAwLCAoRmluaXNoIFNhdmUgfjEpKSkiXQpuMzYgW3NoYXBlID0g
ImVsbGlwc2UiXQpuMzYgLT4gbjUgW2xhYmVsID0gIlsgXSAtLSAoKDAsIDAsIChGaW5pc2ggU2F2
ZSB+MSkpKSJdCm4zNyBbc2hhcGUgPSAiZWxsaXBzZSJdCm4zNyAtPiBuMzcgW2xhYmVsID0gIlsw
MTIzNDU2Nzg5QUJDREVGYWJjZGVmXSAtLSAoKDEsIDEsICgpKSwoMCwgMCwgKCkpKSJdCm4zNyAt
PiBuMzggW2xhYmVsID0gIlsgXSAtLSAoKDEsIDEsIChGaW5pc2ggU2F2ZSB+MSkpLCgwLCAwLCAo
RmluaXNoIFNhdmUgfjEpKSkiXQpuMzggW3NoYXBlID0gImVsbGlwc2UiXQpuMzggLT4gbjM1IFts
YWJlbCA9ICJ0IC0tICgoMCwgMCwgKFN0YXJ0IFNhdmUgfjEpKSkiXQpuMzggLT4gbjM2IFtsYWJl
bCA9ICJUIC0tICgoMSwgMCwgKFN0YXJ0IFNhdmUgfjEpKSkiXQpuMzkgW3NoYXBlID0gImVsbGlw
c2UiXQpuMzkgLT4gbjM3IFtsYWJlbCA9ICJbMDEyMzQ1Njc4OUFCQ0RFRmFiY2RlZl0gLS0gKCgx
LCAxLCAoKSksKDAsIDAsICgpKSkiXQpuMzkgLT4gbjM4IFtsYWJlbCA9ICJbIF0gLS0gKCgxLCAx
LCAoRmluaXNoIFNhdmUgfjEpKSwoMCwgMCwgKEZpbmlzaCBTYXZlIH4xKSkpIl0KbjQwIFtzaGFw
ZSA9ICJlbGxpcHNlIl0KbjAgW3NoYXBlID0gImRpYW1vbmQiLCBsYWJlbCA9ICJzdGFydCJdCm4w
IC0+IG4zOSBbbGFiZWwgPSAiWzAxMjM0NTY3ODlBQkNERUZhYmNkZWZdIC0tICgoMCwgMSwgKFN0
YXJ0IFNhdmUgfjEpKSwoMCwgMCwgKFN0YXJ0IFNhdmUgfjEpKSkiXQp9

--VJ5SYgeoVM--