[TRE-general] Matching (^) question

Chris Kuklewicz tre-general at list.mightyreason.com
Fri Jan 5 23:56:39 EET 2007


Hello,

I am tinkering more with the haskell regex-* packages.  In particular I have
been considering implementing a tagged dfa in pure Haskell.

In doing so there are some regex corner cases, independent of implementation,
that you have already chosen behavior for in libtre.

Thus my hard questions is: What was you rational for handling patterns with (^)
alternatives in them?

I can see examples by running libtre (version 0.7.4 via my Haskell interface):

>> "searchme" =~ "((s^)|(s)|(^)|($)|(^.))*" :: Array Int (MatchOffset,MatchLength)
> array (0,6) [(0,(0,1)),(1,(0,1)),(2,(-1,0)),(3,(0,1)),(4,(1,0)),(5,(1,0)),(6,(-1,0))

In the above, the (^) and ($) capturing sub-expressions (the 4th and 5th)
succeed in matching after the first 's' has been consumed. This is the
"(4,(1,0)),(5,(1,0))" part of the (MatchOffset,MatchLength).

Both (s^) and (^.) fail as expected, but it looks like (s) then (^) matched.

There is also strange behavior for "(^|())" :

>> "searchme" =~ "s(()|^)e" :: Array Int (MatchOffset,MatchLength)
> array (0,2) [(0,(0,2)),(1,(1,0)),(2,(1,0))]

The above looks sane but re-ordering the alternative causes the match to fail:

>> "searchme" =~ "s(^|())e" :: Array Int (MatchOffset,MatchLength)
> array (1,0) []

Writing "(^)?" also seems to have strange or wrong behavior:

>> "searchme" =~ "s()?e" :: Array Int (MatchOffset,MatchLength)
> array (0,1) [(0,(0,2)),(1,(1,0))]

The above is correct, but this confuses me:

>> "searchme" =~ "s(^)?e" :: Array Int (MatchOffset,MatchLength)
> array (1,0) []

The success of later iterations is also not canceling the capture of the
previous iteration if the last iteration matches 0 characters:

>> "searchme" =~ "((s)|(e)|(a))*" :: Array Int (MatchOffset,MatchLength)
> array (0,4) [(0,(0,3)),(1,(2,1)),(2,(-1,0)),(3,(-1,0)),(4,(2,1))]

The above is correct, but this confuses me:

>> "searchme" =~ "((s)|(e)|())*" :: Array Int (MatchOffset,MatchLength)
> array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(1,1)),(4,(2,0))]

Here it looks like the empty () alternative was matched after (s) and (e).  The
(s) has been unset as expected.  The 1st capturing group matched the same as (e).

Do you have any insight in how to think (or not to think) about these examples?

Happy New Year,
  Chris Kuklewicz


More information about the TRE-general mailing list