[TRE-general] Repeat histories
Rici Lake
rici at ricilake.net
Thu Feb 22 00:19:46 EET 2007
On 21-Feb-07, at 4:16 PM, Chris Kuklewicz wrote:
> Rici Lake wrote:
>>
>> If I get really enthusiastic, I might even code that to see if
>> it works :)
>>
>
> Thanks from bringing more C.S. knowledge into this. As a physicist I
> bring very
> little formal C.S. to the table.
>
> And Rici Lake wrote:
>>> a bunch of handwaving
>>
>> I forgot to mention, although I suppose it's obvious:
>
> :) "Proof by intimidation"
> http://www.everything2.com/index.pl?node_id=961354&lastnode_id=0
:)
>>
>> If we establish that a given sub-expression (R)* is
>> right-matchable, then we do not need ordinals at all;
>> we can get the Posix semantics by using the shortest-
>> non-empty-match rule for the last capture.
>
> I am unlikely to believe that: (right-matchable property) implies (the
> shortest
> final iteration implies Posix semantics).
Sorry, you're quite right. I was, as usual, being imprecise. I left
out one transformation, which is below.
It's not quite sufficient to just say shortest-last-match (and I
usually try to say shortest-non-empty-match, although that doesn't
quite capture the meaning either.) It's also necessary to prefer
not repeating. The algorithm I presented originally in my rough
outline of Posix actually captures the semantics correctly, and
is implementable without maintaining repeat histories.
Specifically, I decompose (R)+ into R'* (R)
where R' is R with all captures eliminated.
(R)* is decomposed into (?:R'* (R))? where the
outer parentheses are not captures. If R is right-matchable,
then this decomposition generates the correct capture with
standard Posix concatenation semantics; that capture is the
one for which R'* is the longest, and consequently (R) is
the shortest.
In addition, when I construct the NFA, I eliminate any transition
which involves an "unnecessary" repeat. That's actually a natural
result of the Glushkov construction, but I don't quite use a pure
Glushkov automaton because I do need to notice repeats. I keep
meaning to write up the actual algorithm, which is somewhat similar
to the one in the Frisch paper, I think. So in the automaton for
(a|b|ab)*, the transitions are (using ^ as the position marker
because centre-dot is too hard to type):
State input Goto(s)
^a|b|ab a ^a|b|ab a|^b|ab a|b|^ab +Follow set
a|^b|ab b ^a|b|ab a|^b|ab a|b|^ab +Follow set
a|b|a^b b ^a|b|ab a|^b|ab a|b|^ab +Follow set
a|b|^ab a a|b|a^b
Consequently, the parse of "ab" as (a)(b) is not possible. That
doesn't affect the language recognized, and I don't believe it
affects Posix semantics either. (Finite repetitions require
different handling.)
With that automaton, shortest-final-match is correct and, I think,
intuitive.
It's interesting to compare the regular expressions:
(a*b*)* and (a|b)*
which often show up as an example. They recognize the same language,
and the first one is not in star-normal form so many people would
like to reduce it to the second one. However, the (Posix-semantic)
captures are different: the first one should capture the last complete
phrase consisting of some number of a's then some number (possibly 0)
b's;
it will capture (b*) only if the target sequence had no a's in it.
The second one simply captures the last a or b in the target sequence.
Those are both interesting but distinct behaviours, and in my opinion
the
semantics are intuitive in both cases. The shortest-non-empty match
algorithm, using the NFA as outlined above, produces the correct result
in both cases. (Another distinct possibile regex for the same language
is (a*|b*)* which produces yet a different capture.)
I regard that as interesting rather than useful because in almost
any practical application, you would want the complete set of
captures; not just the last one. However, these problems are clearly
related.
I should note that my implementation does not (as yet) correctly
handle non-right-matchable repetitions, which I've always regarded
as a defect and is the main reason I've never made it available,
although I use it a lot. However, thanks to you I'm now somewhat
more motivated to fix it if I can find the time. :)
> Further comment: The longest-final-match and shortest-final-match
> often imply perverse things about the previous sequence of matches.
I think shortest-non-empty is the intuitive behaviour for greedy
matches (with the proviso that empty matches are
More information about the TRE-general
mailing list