[TRE-general] Repeat histories
Rici Lake
rici at ricilake.net
Wed Feb 21 22:07:39 EET 2007
On 21-Feb-07, at 10:15 AM, Chris Kuklewicz wrote:
>
> What this proposal does is compress the Ordinal and list of positions
> to a list of bits.
Yep. It has a nice minimalism to it :)
>
> But what if there are 32 states in the NFA? then you have 5 fewer bits
> to go before there is a need to resort. So the best thing to do is
> have
> two fields:
> An ordinal and a 64-bit integer as the list of bits. That takes
> not much more storage and decouples the number of NFA states from the
> bit array.
Perhaps, but I'm skeptical. Keeping the single ordinal tag as an
ordinary tag value (i.e. the same type, presumably uint32_t, as any
other tag) simplifies a lot of the code, particularly storage
management; the only thing that is special about ordinal tags is the
update function (and the subsequent resort). The benefit of a
specialized representation is at most a constant factor reduction,
between 0.3 and 0.5 in typical regexes, in the number of resorts.
Since the common case (should not) involve ordinal tags at all,
I'd be inclined to go with code simplicity. The key, as you point
out, is minimizing the need for ordinal tag updates (and, if possible,
for ordinal tags.)
>
> There is two cases that creates a need to renumber the ordinals:
> (1) an NFA state splits and some children loop back while others do
> not
> (2) some tied NFA states loop back while others do not
>
> So there is no need to do any doubling unless at least one state loops
> back.
Sippu & Soisalon-Soininen discuss this issue very briefly in Parsing
Theory, Vol. I, p.100, where they talk about lexical scanners. I'm not
quite following their definitions, but this is my interpretation:
A "right-biased" match of a regular expression R* consists of
repeatedly matching the longest possible R. In general, the
language recognized by a right-biased matcher, Lm(R*), is a subset of
the language L(R), but in many cases it is the same language.
I'll call R "right-matchable" if Lm(R*) = L(R*). (S&S-S call it
"well-formed" but there are a lot of uses of that phrase.)
Unfortunately, S&S-S leave their theorem 3.50 as an exercise
for the reader: that theorem asserts that the property which
I call "right-matchable" is decidable. (It's not quite the same,
but if they'd provided a construction, it would have been
adaptable to this case.) So we have to do the exercise, which
is good for the soul and bad for effective use of spare time. :)
My examples here, by the way, are the regular expressions:
ab|cd|abcd ab|cde|abcd ab|cd|abcde
The first one is ambiguous but right-matchable; the second one
is unambiguous but not right-matchable; the third one is
unambiguous and right-matchable.
Here's my proposed solution, whose formal proof is left as an
exercise for the reader :) (and I hope it exists):
Construct the Glushkov automaton G for R (not R*), and then
construct the DFA G' by applying the subset construction to G.
Now, if there is some non-accepting state q in G, and some
accepting state r in G' for which q is an element of r, then
R is potentially not right-matchable; that is, it's a necessary
but not sufficient condition. (Unproved lemma 1)
Referring to the above examples, this test identifies the
states b3 in all three cases as potentially not right-matchable,
as well as the state d2 in the third example.
Now, take the derivative of G from state q (i.e. the same
machine but with q as the starting state, eliminating
unreachable states and corresponding transitions). Call that dG(q).
Construct the union of dG(q) with G'; if that union contains a
state which includes an accepting state in dG(q) and a
non-accepting state in G', then R was not right-matchable, and
any transition from q to First(R) requires renumbering ordinals.
In practice, we don't really need to construct the derivative
and the union. If we actually construct the DFA for R* instead of
R, then we can do both tests based on it. The accepting states
in R* are the same as the accepting states in R (and those
states also have backwards transitions). The second test consists
of finding the accepting states in G which are reachable from q in G,
and then searching the states in G*' for a non-accepting state
which includes one of those states.
If I get really enthusiastic, I might even code that to see if
it works :)
R.
More information about the TRE-general
mailing list