[TRE-general] Repeat histories

Rici Lake rici at ricilake.net
Thu Feb 22 01:37:38 EET 2007


On 21-Feb-07, at 6:16 PM, Chris Kuklewicz wrote:

>> So in the automaton for (a|b|ab)*, the transitions are (using ^ as the
>> position marker because centre-dot is too hard to type):
>>
>> State     input     Goto(s)
>> ^a|b|ab     a       ^a|b|ab  a|^b|ab  a|b|^ab  +Follow set
>> a|^b|ab     b       ^a|b|ab  a|^b|ab  a|b|^ab  +Follow set
>> a|b|a^b     b       ^a|b|ab  a|^b|ab  a|b|^ab  +Follow set
>> a|b|^ab     a       a|b|a^b
>
> I think I understand the notation.  The ^ indicate "about to accept the
> character to the right of the ^".

Yes, that's what I meant. In the Glushkov automaton, the states would
have subscripts indicating their position, but the centre-dot notation
is more like descriptions of LL or LR parsing. However, centre-dots
are not on my keyboard :)

> I do no understand "+Follow set"...it seems
> to mean that the * has iterated.

Umm, that was sloppy. I was talking about (a|b|ab)* in the middle
of some larger regex, so there is an outgoing transition to the
follow set of the enclosing regex.
>
>> Consequently, the parse of "ab" as (a)(b) is not possible.
>
> Unfortunately, I do not understand how to draw that conclusion from 
> your
> transition table.  But if you are minimizing the iteration count then 
> this may
> be correct.

Yes, quite right. The code is actually quite a bit simpler than
my description of the code, which is probably wrong.

The (a)(b) parse gets eliminated because (ab) is better; once
it has been eliminated, (ab) is the last available match, so
it's what you get. That's not exactly counting iterations, but
it is similar.

The code I have is written in Lua, and I use it to parse sequences
which are not, strictly speaking, strings (the atoms are actually
predicates on elements of the sequence). I'll clean it up a bit
and post it, possibly for further public humiliation. :) About
all I can say it its defense at this point is that it has actually
been useful for me, but I've never been happy enough with it to
make it public.

You're right that R'* (R) is not the Posix equivalent to (R)+; it
would have been better to write it as R'*? (R), which is, I think,
the end result of the transformation. So I was completely out to
lunch with the description, and I'll give it another try when
my brain is more fully in gear.



More information about the TRE-general mailing list