[TRE-general] Feature-wish: named subexpressions and/or return "token-id" for top-level-union

skaller skaller at users.sourceforge.net
Fri Apr 27 01:52:17 EEST 2007


On Thu, 2007-04-26 at 20:28 +0200, Setzer, Sebastian (ext) wrote:
> Hi,
> I'd like to use tre similar to lex (or flex...): Given a set of regular
> expressions and a text, tell me which of the expressions matches first.
> 
> I can do this like this:
> - build a big expression out of the given set of expressions: put
> brackets around each of them and join them with "|".
> - use regex.re_nsub to determine how many subexpresions there are in
> every one of them in order to get the index at which they start in the
> big union-expression.
> - look at every of these indexes, test if the subexpression matched.
> 
> Test-code which does this is appended (I hope the mailing list doesn't
> filter this out).
> Please note that it contains a memory leak and probably other bugs...
> 
> If the tre-API had a function which does this, you could probably do it
> better because
> - you need only tags for the top-level-subexpression (the ones for which
> I collected the indexes) if the user isn't interested in the
> sub-subexpessions (...REG_NOSUB)
> - if you precompute a DFA, you can return the "token-id" in
> O(text-length) without needing to iterate through the expression-set
> 
> A related feature are named subexpressions (for example with the syntax
> of python: "(?P<name>...)").
> If you have them, you don't need to worry about subexpression-indexes.
> 
> Sebastian Setzer


More generally, I would be happy to see a proper combinatorial
API which treats regexps abstractly.

You would then use constructors to create new regexps out of old
ones, for example:

	re = alt ( star (x), alt ( y, re_of_string ("hello")))

Perl can almost do this due to string interpolation .. regexps
in other languages without interpolation suck.

The basic combinators are  of course:

	alt, seq, star, epsilon

although one might like

	plus, opt

and we need ground constructors:

	re_of_string s   // as sequence of chars
	re_one_of s      // a set of chars
	re_of_char i     // one char

It may be useful to also provide a 'charset' type,
so you can do unions, intersections, and negations,
similar to the regexp syntax [a-d,kv].

Combinatorial regular processing allows regular definitions
as in a lex like tool, which is *mandatory* for complex
regexps. By that I mean: it is utterly impossible for anyone
to read, write, or comprehend string based regular expressions.

The classic example is one of the earlier Shootout tests,
which was actually quite a simple match, and EVERYONE got
it wrong except me -- because I used Felix which has
built-in support for lex style regular definitions it
was immediately obvious I got it right and the Perl like
regexps everyone else used were actually wrong (Felix
failed the test because the expected result file was
actually incorrect).

As Sebastian notes -- the hard bit here is how to handle
captures. Creating them is easy:

	capture (x)

and you may want

	uncapture(x)

to remove any nested captures, but in general it is
hard to manage them because the linear encoding
scheme used is simply wrong. A scheme like:

	x.y.z

where x is the top level nesting, y the second level, etc
seems appealing but is also probably wrong.

Named captures are wrong too -- unless you provide
a namespace control mechanism so you can literally
write something like

	x.x

as the name of the capture x nested inside another
capture also named x.

None of these problems exist for regexps -- they only
arise with regular definitions.

The "proper" solution to this is to recognize regexps
with captures are rubbish mathematically -- the complexity
of Ville's mathematics -- required to even specify *some*
kind of semantics -- really highlights this.

Alain Frisch tree automata are a bit better because they
properly represent the destructors (captures) as actual
parse trees (but that is still an extension of the Laurikari
algorithm)

In the long run the real solution is to use a proper
parser: existing tools seem mainly to lack sufficient
sugar to be convenient.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


More information about the TRE-general mailing list