[TRE-general] Repeat histories

Rici Lake rici at ricilake.net
Tue Feb 20 22:12:58 EET 2007


Chris Kuklewicz wrote:

> Further optimizations I have yet to implement:
>
> () When there is only one NFA state that has an orbit record for a 
> *-operator
> that starts at position x0 then there is not yet any ambiguity, and 
> even if this
> splits into different NFA states in the future their shared history 
> will not
> disambiguate them.  Thus the orbit record can be compressed to just 
> the most
> recent iteration's offset as long as there is only one NFA state for 
> this
> *-operator with the original starting position x0.


That's very clever. Congrats.

Here's a possible improvement:

At any moment in the scan, there are at most M active histories, where 
M is the number of states in the machine. Each repetition operator can 
appear at most in each history, so there are at most M*R active 
repetition operators.

Suppose at any moment, these are correctly sorted by ordinal. Not all 
the histories are comparable, of course (they might start at different 
points), but we're only interested in ensuring that the comparable ones 
are in order, so "correctly sorted" only needs to be true for 
comparable histories.

Now, consider a single history, as we compute the state transitions for 
an incoming character. There are four possible transitions:

1) Loop back to another repeat of a repetition
2) Continue in the same repetition
3) Finish a repetition
4) Start a new repetition

The interesting cases are (1) and (2). Suppose both are possible; then 
the history corresponding to (1) is certainly worse than the repetition 
corresponding to (2). We only need to ensure that the histories are 
correctly ordered with respect to this pair of transitions.

So we compute the new ordinal as twice the old ordinal in case (1) and 
as twice the old ordinal plus one in case (2). That will correctly 
maintain the ordering relationship, but it doubles the range of the 
ordinals so we cannot do it indefinitely. Consequently, when some 
transformed ordinal gets close to overflow, say 2^30, we need to 
renumber all of the ordinals for that repetition back to the minimum 
range.

Note that in case (4) we can start the new history with an arbitrary 
ordinal, say 1, since all repetitions created at the same position are 
equal.

The cost will be the cost of the periodic sorts, which in the worst 
case is O(R*M*log M). But we can reasonably hope that these will be 
relatively rare, and that there will be considerably fewer than M 
active histories for a particular repetition at a particular moment. In 
particular, in well-behaved regular expressions, there will only be one 
active history, and the sort becomes trivial.



More information about the TRE-general mailing list