[TRE-general] Repeat histories
Chris Kuklewicz
tre-general at list.mightyreason.com
Wed Feb 21 17:15:02 EET 2007
Rici Lake wrote:
> Chris Kuklewicz wrote:
>
>> Further optimizations I have yet to implement:
>>
>> () When there is only one NFA state that has an orbit record for a
>> *-operator
>> that starts at position x0 then there is not yet any ambiguity, and
>> even if this
>> splits into different NFA states in the future their shared history
>> will not
>> disambiguate them. Thus the orbit record can be compressed to just
>> the most
>> recent iteration's offset as long as there is only one NFA state for
>> this
>> *-operator with the original starting position x0.
>
>
> That's very clever. Congrats.
>
> Here's a possible improvement:
>
> At any moment in the scan, there are at most M active histories, where
> M is the number of states in the machine. Each repetition operator can
> appear at most in each history, so there are at most M*R active
> repetition operators.
>
> Suppose at any moment, these are correctly sorted by ordinal. Not all
> the histories are comparable, of course (they might start at different
> points), but we're only interested in ensuring that the comparable ones
> are in order, so "correctly sorted" only needs to be true for
> comparable histories.
>
> Now, consider a single history, as we compute the state transitions for
> an incoming character. There are four possible transitions:
>
> 1) Loop back to another repeat of a repetition
> 2) Continue in the same repetition
> 3) Finish a repetition
> 4) Start a new repetition
>
> The interesting cases are (1) and (2). Suppose both are possible; then
> the history corresponding to (1) is certainly worse than the repetition
> corresponding to (2). We only need to ensure that the histories are
> correctly ordered with respect to this pair of transitions.
>
> So we compute the new ordinal as twice the old ordinal in case (1) and
> as twice the old ordinal plus one in case (2). That will correctly
> maintain the ordering relationship, but it doubles the range of the
> ordinals so we cannot do it indefinitely. Consequently, when some
> transformed ordinal gets close to overflow, say 2^30, we need to
> renumber all of the ordinals for that repetition back to the minimum
> range.
>
> Note that in case (4) we can start the new history with an arbitrary
> ordinal, say 1, since all repetitions created at the same position are
> equal.
>
> The cost will be the cost of the periodic sorts, which in the worst
> case is O(R*M*log M). But we can reasonably hope that these will be
> relatively rare, and that there will be considerably fewer than M
> active histories for a particular repetition at a particular moment. In
> particular, in well-behaved regular expressions, there will only be one
> active history, and the sort becomes trivial.
>
What this proposal does is compress the Ordinal and list of positions to a list
of bits.
There is two cases that creates a need to renumber the ordinals:
(1) an NFA state splits and some children loop back while others do not
(2) some tied NFA states loop back while others do not
So there is no need to do any doubling unless at least one state loops back.
One could detect any loop back for a given (*-operator,start position) and only
in this case (double) or (double+1) the ordinal.
But what if there are 32 states in the NFA? then you have 5 fewer bits to go
before there is a need to resort. So the best thing to do is have two fields:
An ordinal and a 64-bit integer as the list of bits. That takes not much more
storage and decouples the number of NFA states from the bit array.
But I think the design of this list-of-positions or list-of-bits storage needs
to be optimized alongside the optimization of the compression operation.
If we do enough bookkeeping to make the set of NFA states with the same
(*-operator,start pos) easy to traverse then it may be easy to detect (1) and
(2) directly.
But I must get back to work...
--
Chris
More information about the TRE-general
mailing list