TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
TRE is not just yet another regexp matcher. TRE has some features which are not there in most free POSIX compatible implementations. Most of these features are not present in non-free implementations either, for that matter.
Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
TRE includes a version of the agrep (approximate grep) command line tool for approximate regexp matching in the style of grep. Unlike other agrep implementations (like the one by Sun Wu and Udi Manber from University of Arizona available here) TRE agrep allows full regexps of any length, any number of errors, and non-uniform costs for insertion, deletion and substitution.
Strict standard conformance
POSIX defines the behaviour of regexp functions precisely. TRE attempts to conform to these specifications as strictly as possible. TRE always returns the correct matches for subpatterns, for example. Few other implementations do this correctly.
The standard TRE tries to conform to is the IEEE Std 1003.1-2001, or Open Group Base Specifications Issue 6, commonly referred to as “POSIX”. It can be found online here. The relevant parts are the base specifications on regular expressions (and the rationale) and the description of the
For an excellent survey on POSIX regexp matchers, see the testregex pages by Glenn Fowler of AT&T Labs Research.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any
regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
Predictable and modest memory consumption
regexec() call never allocates memory from the heap. TRE allocates all the memory it needs during a
regcomp() call, and some temporary working space from the stack frame for the duration of the
regexec() call. The amount of temporary space needed is constant during matching and does not depend on the searched string. For regexps of reasonable size TRE needs less than 50K of dynamically allocated memory during the
regcomp() call, less than 20K for the compiled pattern buffer, and less than two kilobytes of temporary working space from the stack frame during a
regexec() call. There is no time/memory tradeoff. TRE is also small in code size; statically linking with TRE increases the executable size less than 30K (gcc-3.2, x86, GNU/Linux).
Wide character and multibyte character set support
TRE supports multibyte character sets. This makes it possible to use regexps seamlessly with, for example, Japanese locales. TRE also provides a wide character API.
Binary pattern and data support
TRE provides APIs which allow binary zero characters both in regexps and searched strings. The standard API cannot be easily used to, for example, search for printable words from binary data (although it is possible with some hacking). Searching for patterns which contain binary zeroes embedded is not possible at all with the standard API.
Completely thread safe
TRE is completely thread safe. All the exported functions are re-entrant, and a single compiled regexp object can be used simultaneously in multiple contexts; e.g. in
main() and a signal handler, or in many threads of a multithreaded application.
TRE is portable across multiple platforms. Here’s a table of platforms and compilers that have been successfully used to compile and run TRE:
|AIX 4.3.2 – 5.3.0||GCC, IBM xlc compilers|
|Compaq Tru64 UNIX V5.1A/B||Compaq C V6.4-014 – V6.5-011|
|Cygwin 1.3 – 1.5||GCC|
|Digital UNIX V4.0||DEC C V5.9-005|
|FreeBSD 4 and above||GCC|
|GNU/Linux systems on x86, x86_64, ppc64, s390||GCC|
|HP-UX 10.20 – 11.31 hppa/ia64||GCC, HP C Compilers|
|IRIX 6.5||GCC, MIPSpro Compilers 22.214.171.124m|
|Max OS X||GCC (XCode)|
|iPhone, iPod Touch||GCC (XCode)|
|NetBSD 1.5 and above||GCC, egcs|
|OpenBSD 3.3 and above||GCC|
|Solaris 2.7-10 sparc/x86||GCC, SUNWspro cc|
|Windows 98 up to Windows 7||Visual C++ 6.0, Visual Studio 2005/2008|
TRE 0.7.5 should compile without changes on all of the above platforms. Tell me if you are using TRE on a platform that is not listed above, and I’ll add it to the list. Also let me know if TRE does not work on a listed platform.
Depending on the platform, you may need to install libutf8 to get wide character and multibyte character set support.
TRE is released under a license which is essentially the same as the “2 clause” BSD-style license used in NetBSD. See the file LICENSE, included in the source code package, for details.
WARNING: The mailing list and archives are down at the moment. Apologies for the inconvenience.
- tre-general (archive)
- This list is for any discussion on the TRE software, including reporting bugs, feature requests, requests for help, and other things.
- tre-announce (archive)
- Subscribe to this list to get announcements of new releases of TRE. Alternatively, you can subscribe to the freshmeat.net project and get similar announcements that way, if you prefer.