[TRE-general] Repeat histories
Rici Lake
rici at ricilake.net
Tue Feb 20 22:12:58 EET 2007
Chris Kuklewicz wrote:
> Further optimizations I have yet to implement:
>
> () When there is only one NFA state that has an orbit record for a
> *-operator
> that starts at position x0 then there is not yet any ambiguity, and
> even if this
> splits into different NFA states in the future their shared history
> will not
> disambiguate them. Thus the orbit record can be compressed to just
> the most
> recent iteration's offset as long as there is only one NFA state for
> this
> *-operator with the original starting position x0.
That's very clever. Congrats.
Here's a possible improvement:
At any moment in the scan, there are at most M active histories, where
M is the number of states in the machine. Each repetition operator can
appear at most in each history, so there are at most M*R active
repetition operators.
Suppose at any moment, these are correctly sorted by ordinal. Not all
the histories are comparable, of course (they might start at different
points), but we're only interested in ensuring that the comparable ones
are in order, so "correctly sorted" only needs to be true for
comparable histories.
Now, consider a single history, as we compute the state transitions for
an incoming character. There are four possible transitions:
1) Loop back to another repeat of a repetition
2) Continue in the same repetition
3) Finish a repetition
4) Start a new repetition
The interesting cases are (1) and (2). Suppose both are possible; then
the history corresponding to (1) is certainly worse than the repetition
corresponding to (2). We only need to ensure that the histories are
correctly ordered with respect to this pair of transitions.
So we compute the new ordinal as twice the old ordinal in case (1) and
as twice the old ordinal plus one in case (2). That will correctly
maintain the ordering relationship, but it doubles the range of the
ordinals so we cannot do it indefinitely. Consequently, when some
transformed ordinal gets close to overflow, say 2^30, we need to
renumber all of the ordinals for that repetition back to the minimum
range.
Note that in case (4) we can start the new history with an arbitrary
ordinal, say 1, since all repetitions created at the same position are
equal.
The cost will be the cost of the periodic sorts, which in the worst
case is O(R*M*log M). But we can reasonably hope that these will be
relatively rare, and that there will be considerably fewer than M
active histories for a particular repetition at a particular moment. In
particular, in well-behaved regular expressions, there will only be one
active history, and the sort becomes trivial.
More information about the TRE-general
mailing list