[TRE-general] libtre capture nonsense

Chris Kuklewicz tre-general at list.mightyreason.com
Mon Feb 19 12:53:25 EET 2007


> I'm interested, how did you make it handle cases such as this
> correctly:
> 
> regex: (a|ab|aba|baab)*
> 
> The string "abaab" can be matched in different ways:
>   1.  match "aba", then "ab"
> or
>   2.  match "ab", then "a", then "ab"
> or
>   3.  match "a", then "baab"
> 
> Of these, the correct match according to POSIX is number 1.  This is
> because repetitions should be treated so that each iteration matches
> as many characters as possible and earlier repetitions take precedence
> over later repetitions.
> 
> When your tagged DFA has consumed the last character, how does it
> determine that the submatch to return is "ab", and not "baab"?
> 

I spent a bunch of time at a whiteboard and decided that I was not clever enough
to make a tagged DFA to handle the specification.  So I backed up and figure out
what I could do that would definitely work, and optimized from there.

For each NFA state, if you keep track of every offset when it
enters/re-enters/leaves a * operator then you can compare histories, much like
with your tags.  I call these loops around the *-operator "orbits".  Let tag 5
measures when the *-operator is encountered and 6 measures when it is finally
left behind. Tag 5's offset is minimized and tag 6's offset is maximized.  I
made a new tag 7 associated with this specific *-operator that is neither
maximized not minimized but "orbit" to indicate than this is the precedence for
comparing the orbit history: after the whole set of iterations and before the
details of the sub-patterns.

Since this is after tag 6, the entire collection of orbits is maximized before
comparing the orbit histories.  Each * operator is assigned a special tag slot
labeled "orbit", the orbit record is stored in an auxiliary data structure.
These orbit histories are simple to compare, since you want to make each orbit
as long as possible.

Keeping offsets for each iteration might mean keeping data that scales linearly
with the length of the string being matched.  This is not necessary -- it can be
compressed because the orbit data only disambiguates for submatch capture:

() If the submatch data has be turned off via options then this will also turn
off the orbit record keeping.

() When the *-operator contains no parenthesized submatches then there is no
need to keep a record of the orbits since it will not affect the output.

() When the *-operator's subpattern can only match a specific finite number of
characters per iteration, like "(abc|def)*", then there can be no ambiguity in
the history and there is no need to keep orbit data.

() When a *-operator is entered or iterated during matching, all the contained
*-operator's in its subpattern may have their data erased, since any data they
contain is for an old iteration which will not be visible in the submatches.

Further optimizations I have yet to implement:

() When there is only one NFA state that has an orbit record for a *-operator
that starts at position x0 then there is not yet any ambiguity, and even if this
splits into different NFA states in the future their shared history will not
disambiguate them.  Thus the orbit record can be compressed to just the most
recent iteration's offset as long as there is only one NFA state for this
*-operator with the original starting position x0.

() When there are multiple NFA states that have an orbit record for a *-operator
that all start at the same position x0 there is another optimization.  The
program can pre-compare the orbit histories for this *-operator and starting
position x0 and compress the orbit logs into an ordinal representing their
sorted order.  This could be done continuously or periodically.

I am currently optimizing the performance of the regular tag data manipulation,
after which I will try and optimize the orbit data.


More information about the TRE-general mailing list