[TRE-general] libtre capture nonsense

Chris Kuklewicz tre-general at list.mightyreason.com
Sun Jan 28 13:59:06 EET 2007


Hi Ville,

  Thanks to your Master thesis I seem to have a tagged DFA that passes all the
POSIX tests.  Concentrating on correctness I have not made it very efficient
yet, so it is 4 times slower that TRE for the one speed benchmark I have.

Ville Laurikari wrote:
> On Mon, Jan 15, 2007 at 08:45:33PM +0000, Chris Kuklewicz wrote:
>> Specifically, this behavior from libtre seems like nonsense:
>>
>>> let r = makeRegex "(b*|c(c*))*" :: Text.Regex.TRE.Regex
>>   in match r "cbb" :: MatchArray
>>
>> array (0,2) [(0,(0,3)),(1,(1,2)),(2,(1,2))]
>> But \2 is (c*) so this is nonsense.
> 
> This is definitely a bug in TRE.  I added this test case in the
> regression tests and did some debugging.  The bug seems to be in the
> part which adds tags in the syntax tree.  It's probably easy enough to
> fix, I'll see what I can do about it.

I can give you more test cases.  Since I have Haskell's QuickCheck generating
random tests I can trigged such bugs in new ways.

I did not bother posting them all to the the list, but since you are back to
fixing bugs, I will post a selection below.  Glen Fowler <gsf at research.att.com>
has asked me for some test cases to add to the
http://www.research.att.com/~gsf/testregex/ site.

This is probably the same bug:

"yyyyyy" TRE.=~ "(yyy|(x?)){2,4}"

array (0,2) [(0,("yyyyyy",(0,6))),(1,("yyy",(3,3))),(2,("yyy",(3,3)))],"")

The array reports (group #, (text captured, (start offset, length)))

Here TRE reports that the second group (x?) which should only match "" or "x"
somehow matched "yyy".  This is even more wrong than the regex-posix bug above.

The correct answer, where the second group does not match:
array (0,2) [(0,("yyyyyy",(0,6))),(1,("yyy",(3,3))),(2,("",(-1,0)))],"")

This may be that other bug instead:

> *Main> matchit' "xxx" "($)|()"
> ("Posix  ",("",array (0,2) [(0,("",(0,0))),(1,("",(-1,0))),(2,("",(0,0)))],"xxx"))
> ("TDFA   ",("",array (0,2) [(0,("",(0,0))),(1,("",(-1,0))),(2,("",(0,0)))],"xxx"))
> ("TRE    ",("xxx",array (0,2) [(0,("",(3,0))),(1,("",(3,0))),(2,("",(3,0)))],""))

Here matchit' also reports on
(string before match, array as above, string after match)
The "Posix" engine is from using regex.h on Mac OS X 10.4 (PPC).
The "TDFA" engine is my Haskell code.

And in the above, TRE does not succeed in matching the empty () at the front of
the string and stop, but rather keeps looking until it hits the end of the input
(which could be quite long in practice).  And at the end it claims to have
matched both alternatives at the same time, which is also wrong.

This is also seen in:

> *Main> matchit' "ac\n" "$()|^()"
> ("Posix  ",("",array (0,2) [(0,("",(0,0))),(1,("",(-1,0))),(2,("",(0,0)))],"ac\n"))
> ("TDFA   ",("",array (0,2) [(0,("",(0,0))),(1,("",(-1,0))),(2,("",(0,0)))],"ac\n"))
> ("TRE    ",("ac",array (0,2) [(0,("",(2,0))),(1,("",(2,0))),(2,("",(2,0)))],"\n"))
and its twin:
> *Main> matchit' "ac\n" "^()|$()"
> ("Posix  ",("",array (0,2) [(0,("",(0,0))),(1,("",(0,0))),(2,("",(-1,0)))],"ac\n"))
> ("TDFA   ",("",array (0,2) [(0,("",(0,0))),(1,("",(0,0))),(2,("",(-1,0)))],"ac\n"))
> ("TRE    ",("",array (0,2) [(0,("",(0,0))),(1,("",(0,0))),(2,("",(0,0)))],"ac\n"))

This behavior can alter the whole match as well:

> *Main> matchit'  "__" "($)?(.)"
> ("Posix  ",("",array (0,2) [(0,("_",(0,1))),(1,("",(-1,0))),(2,("_",(0,1)))],"_"))
> ("TDFA   ",("",array (0,2) [(0,("_",(0,1))),(1,("",(-1,0))),(2,("_",(0,1)))],"_"))
> ("TRE    ",("__",array (1,0) [],""))

The above should have skipped the $ and then matched.  Various other examples:

> *Main> matchit' "c"  "(.|()|())*"
> ("Posix  ",("",array (0,3) [(0,("c",(0,1))),(1,("",(1,0))),(2,("",(1,0))),(3,("",(-1,0)))],""))
> ("TDFA   ",("",array (0,3) [(0,("c",(0,1))),(1,("c",(0,1))),(2,("",(-1,0))),(3,("",(-1,0)))],""))
> ("TRE    ",("",array (0,3) [(0,("c",(0,1))),(1,("c",(0,1))),(2,("",(1,0))),(3,("",(1,0)))],""))

The above shows that TRE matched both group 2 and group 3 which is inconsistent.
  My code uses a Tagged DFA as well, and this looks like a similar or identical
bug in assigning tag in the syntax tree.  And the regex.h "Posix" engine
annoyingly decides to look and match 0 characters if it can at the end of each
*, making it non-Posix. This regex.h "Posix" library has problems of a more
serious nature as well:

> *Main> matchit' "ab"  "((a)|(b)){2,}"
> ("Posix  ",("",array (0,3) [(0,("ab",(0,2))),(1,("b",(1,1))),(2,("a",(0,1))),(3,("b",(1,1)))],""))
> ("TDFA   ",("",array (0,3) [(0,("ab",(0,2))),(1,("b",(1,1))),(2,("",(-1,0))),(3,("b",(1,1)))],""))
> ("TRE    ",("",array (0,3) [(0,("ab",(0,2))),(1,("b",(1,1))),(2,("",(-1,0))),(3,("b",(1,1)))],""))

There is no way to justify the regex.h "Posix" returning a match for group 2 and
for group 3, since group 2 is not nested inside group 1 as it should be.  I can
see the identical bug in "sed" since I uses the same regex.h library.

And another test case:

> *Main> matchit' "" ".()|((.)?)"
> ("Posix  ",("",array (0,3) [(0,("",(0,0))),(1,("",(-1,0))),(2,("",(0,0))),(3,("",(-1,0)))],""))
> ("TDFA   ",("",array (0,3) [(0,("",(0,0))),(1,("",(-1,0))),(2,("",(0,0))),(3,("",(-1,0)))],""))
> ("TRE    ",("",array (0,3) [(0,("",(0,0))),(1,("",(0,0))),(2,("",(0,0))),(3,("",(-1,0)))],""))

Which looks like the same tag assignment bug.

Good Luck,
  Chris Kuklewicz




More information about the TRE-general mailing list