Regex Syntax

This document describes the POSIX 1003.2 extended RE (ERE) syntax and the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style notation is used to describe the grammar.

ERE Syntax

Alternation operator

extended-regexp ::= branch
                |   extended-regexp "|" branch

An extended regexp (ERE) is one or more branches, separated by |. An ERE matches anything that matches one or more of the branches.

Catenation of REs

branch ::= piece
       |   branch piece

A branch is one or more pieces concatenated. It matches a match for the first piece, followed by a match for the second piece, and so on.

piece ::= atom
      |   atom repeat-operator
      |   atom approx-settings

A piece is an atom possibly followed by a repeat operator or an expression controlling approximate matching parameters for the atom.

atom ::= "(" extended-regexp ")"
     |   bracket-expression
     |   "."
     |   assertion
     |   literal
     |   back-reference
     |   "(?#" comment-text ")"
     |   "(?" options ")" extended-regexp
     |   "(?" options ":" extended-regexp ")"

An atom is either an ERE enclosed in parenthesis, a bracket expression, a . (period), an assertion, or a literal.

The dot (.) matches any single character. If the REG_NEWLINE compilation flag (see API manual) is specified, the newline character is not matched.

Comment-text can contain any characters except for a closing parenthesis ). The text in the comment is completely ignored by the regex parser and it used solely for readability purposes.

Repeat operators

repeat-operator ::= "*"
                |   "+"
                |   "?"
                |   bound
                |   "*?"
                |   "+?"
                |   "??"
                |   bound ?

An atom followed by * matches a sequence of 0 or more matches of the atom. + is similar to *, matching a sequence of 1 or more matches of the atom. An atom followed by ? matches a sequence of 0 or 1 matches of the atom.

A bound is one of the following, where m and m are unsigned decimal integers between 0 and RE_DUP_MAX:

{m,n}
{m,}
{m}

An atom followed by [1] matches a sequence of m through n (inclusive) matches of the atom. An atom followed by [2] matches a sequence of m or more matches of the atom. An atom followed by [3] matches a sequence of exactly m matches of the atom.

Adding a ? to a repeat operator makes the subexpression minimal, or non-greedy. Normally a repeated expression is greedy, that is, it matches as many characters as possible. A non-greedy subexpression matches as few characters as possible. Note that this does not (always) mean the same thing as matching as many or few repetitions as possible. Also note that minimal repetitions are not currently supported for approximate matching.

Approximate matching settings

approx-settings ::= "{" count-limits* ","? cost-equation? "}"

count-limits ::= "+" number?
             |   "-" number?
             |   "#" number?
             |   "~" number?

cost-equation ::= ( cost-term "+"? " "? )+ "<" number

cost-term ::= number "i"
          |   number "d"
          |   number "s"

The approximate matching settings for a subpattern can be changed by appending approx-settings to the subpattern. Limits for the number of errors can be set and an expression for specifying and limiting the costs can be given.

The count-limits can be used to set limits for the number of insertions (+), deletions (-), substitutions (#), and total number of errors (~). If the number part is omitted, the specified error count will be unlimited.

The cost-equation can be thought of as a mathematical equation, where i, d, and s stand for the number of insertions, deletions, and substitutions, respectively. The equation can have a multiplier for each of i, d, and s. The multiplier is the cost of the error, and the number after < is the maximum allowed cost of a match. Spaces and pluses can be inserted to make the equation readable. In fact, when specifying only a cost equation, adding a space after the opening { is required.

Examples:

{~}: Sets the maximum number of errors to unlimited.
{~3}: Sets the maximum number of errors to three.
{+2~5}: Sets the maximum number of errors to five, and the maximum number of insertions to two.
{<3}: Sets the maximum cost to three.
{ 2i + 1d + 2s < 5 }: Sets the cost of an insertion to two, a deletion to one, a substitution to two, and the maximum cost to five.

Bracket expressions

bracket-expression ::= "[" item+ "]"
                   |   "[^" item+ "]"

A bracket expression specifies a set of characters by enclosing a nonempty list of items in brackets. Normally anything matching any item in the list is matched. If the list begins with ^ the meaning is negated; any character matching no item in the list is matched.

An item is any of the following:

A single character, matching that character.
Two characters separated by -. This is shorthand for the full range of characters between those two (inclusive) in the collating sequence. For example, [0-9] in ASCII matches any decimal digit.
A collating element enclosed in [. and .], matching the collating element. This can be used to include a literal - or a multi-character collating element in the list.
A collating element enclosed in [= and =] (an equivalence class), matching all collating elements with the same primary collation weight as that element, including the element itself.
The name of a character class enclosed in [: and :], matching any character belonging to the class. The set of valid names depends on the LC_CTYPE category of the current locale, but the following names are valid in all locales:
- alnum – alphanumeric characters
- alpha – alphabetic characters
- blank – blank characters
- cntrl – control characters
- digit – decimal digits (0 through 9)
- graph – all printable characters except space
- lower – lower-case letters
- print – printable characters including space
- punct – printable characters not space or alphanumeric
- space – white-space characters
- upper – upper case letters
- xdigit – hexadecimal digits

To include a literal - in the list, make it either the first or last item, the second endpoint of a range, or enclose it in [. and .] to make it a collating element. To include a literal ] in the list, make it either the first item, the second endpoint of a range, or enclose it in [. and .]. To use a literal - as the first endpoint of a range, enclose it in [. and .].

Assertions

assertion ::= "^"
          |   "$"
          |   "\" assertion-character

The expressions ^ and $ are called “left anchor” and “right anchor”, respectively. The left anchor matches the empty string at the beginning of the string. The right anchor matches the empty string at the end of the string. The behaviour of both anchors can be varied by specifying certain execution and compilation flags.

An assertion-character can be any of the following:

< – Beginning of word
> – End of word
b – Word boundary
B – Non-word boundary
d – Digit character (equivalent to [[:digit:]])
D – Non-digit character (equivalent to [^[:digit:]])
s – Space character (equivalent to [[:space:]])
S – Non-space character (equivalent to [^[:space:]])
w – Word character (equivalent to [[:alnum:]_])
W – Non-word character (equivalent to [^[:alnum:]_])

Literals

literal ::= ordinary-character
        |   "\x" ["1"-"9" "a"-"f" "A"-"F"]{0,2}
        |   "\x{" ["1"-"9" "a"-"f" "A"-"F"]* "}"
        |  "\" character

A literal is either an ordinary character (a character that has no other significance in the context), an 8 bit hexadecimal encoded character (e.g. \x1B), a wide hexadecimal encoded character (e.g. \x{263a}), or an escaped character. An escaped character is a \ followed by any character, and matches that character. Escaping can be used to match characters which have a special meaning in regexp syntax. A \ cannot be the last character of an ERE. Escaping also allows you to include a few non-printable characters in the regular expression. These special escape sequences include:

\a – Bell character (ASCII code 7)
\e – Escape character (ASCII code 27)
\f – Form-feed character (ASCII code 12)
\n – New-line/line-feed character (ASCII code 10)
\r – Carriage return character (ASCII code 13)
\t – Horizontal tab character (ASCII code 9)

An ordinary character is just a single character with no other significance, and matches that character. A { followed by something else than a digit is considered an ordinary character.

Back references

back-reference ::= "\" ["1"-"9"]

A back reference is a backslash followed by a single non-zero decimal digit d. It matches the same sequence of characters matched by the dth parenthesized subexpression.

Back references are not defined for POSIX EREs (for BREs they are), but many matchers, including TRE, implement back references for both EREs and BREs.

Options

options ::= ["i" "n" "r" "U"]* ("-" ["i" "n" "r" "U"]*)?

Options allow compile time options to be turned on/off for particular parts of the regular expression. The options equate to several compile time options specified to the regcomp API function. If the option is specified in the first section, it is turned on. If it is specified in the second section (after the -), it is turned off.

i – Case insensitive.
n – Forces special handling of the new line character. See the REG_NEWLINE flag in the API Manual.
r – Causes the regex to be matched in a right associative manner rather than the normal left associative manner.
U – Forces repetition operators to be non-greedy unless a ? is appended.

BRE Syntax

The obsolete basic regexp (BRE) syntax differs from the ERE syntax as follows:

| is an ordinary character, and there is no equivalent for its functionality. +, and ? are ordinary characters.
The delimiters for bounds are \{ and \}, with { and } by themselves ordinary characters.
The parentheses for nested subexpressions are $ and $, with ( and ) by themselves ordinary characters.
^ is an ordinary character except at the beginning of the RE or the beginning of a parenthesized subexpression. Similarly, $ is an ordinary character except at the end of the RE or the end of a parenthesized subexpression.