`Text.Regex.Lazy`

Version 0.70 (2006-08-10)

By Chris Kuklewicz (TextRegexLazy (at) personal (dot) mightyreason (dot) com)

Changes from 0.66 to 0.70

regex-tre added for libtre backend (Text.Regex.TRE), see http://laurikari.net/tre/
regex-devel added for tests and benchmarks
Text.Regex.*.Wrap APIs improved: the exported wrap* functions never call fail or error under normal circumstances, and use Either types to report errors. Allocation failures are reported with fail.
Text.Regex.*.(ByteString|String) all should export compile/execute/regexec functions which report errors using Either.

Changes from 0.55 to 0.66

I broke this into many packages, regex-base for the interface and regex-pcre, regex-posix, regex-parsec, regex-dfa for the four backends and regex-compat to replace Text.Regex(.New)
The top level Makefile now can drive setup and installation of all the packages at once.

Changes from 0.44 to 0.55

JRegex has been assimilated: PCRE and PosixRE are here. The JRegex-style API rocks, see below and Context.hs and Example.hs
Haddock seems to run via ./setup haddock, but the documentation is very thin
./setup test runs TestTestRegexLazy binary if uncommented in cabal file
default is now to compile with -Wall -Werror -O2
You may need to point the cabal file's "Extra-Lib-Dirs" to point to pcre.
You may or may not need a "-lpcre" option to ghc when building projects that depend on Text.Regex.Lazy now.

Changes from 0.33 to 0.44

Cabal
Compile with -Wall -Werror
Change DFAEngineFPS from Data.FastPackedString to Data.ByteString

See the LICENSE file for details on copyright. See README for building instructions.
The new API is very close to JRegex and supports 4 backends:

Posix, the standard c regex library
PCRE, the Perl Compatible Regular Expressions c library
Full, the lazy Parsec based library (see old api below)
DFA, the fast lazy matching library (see old api below)

And for all backends, there are two types that can be used as a source of regular expressions or to match a regular expression against: String, and ByteString. The ByteString library will be in the next GHC and can be gotten from http://www.cse.unsw.edu.au/~dons/fps.html.

For simplest use of the new API: import Text.Regex.Lazy and one of

import Text.Regex.PCRE((=~),(=~~))
import Text.Regex.Parsec((=~),(=~~))
import Text.Regex.DFA((=~),(=~~))
import Text.Regex.PosixRE((=~),(=~~))
import Text.Regex.TRE((=~),(=~~))

The things you can demand of (=~) and (=~~) are all instance defined in Text.Regex.Impl.Context and they are used in Example.hs as well.

You can redefine (=~) and (=~~) to use different options by using makeRegexOpts:

(=~) :: (RegexMaker Regex CompOption ExecOption source,RegexContext Regex source1 target) => source1 -> source -> target
(=~) x r = let q :: Regex
               q = makeRegexOpts (some compoption) (some execoption) r
           in match q x

(=~~) ::(RegexMaker Regex CompOption ExecOption source,RegexContext Regex source1 target,Monad m) => source1 -> source -> m target
(=~~) x r = let q :: Regex
                q = makeRegexOpts (some compoption) (some execoption) r
            in matchM q x

There is a medium level API with functions compile/execute/regexec in all the Text.Regex.*.(String|ByteString) modules. These allow for errors to be reported as Either types when compiling or running.

The low level APIs are in the Text.Regex.*.Wrap modules. For the c-library backends these expose most of the c-api in wrap* functions that make the type more Haskell-like: CString and CStingLen and newtypes to specify compile and execute options. The actual foreign calls are not exported; it does not export the raw c api.

Also, Text.Regex.PCRE.Wrap will let you query if it was compiled with UTF8 suppor: configUTF8 :: Bool. But I do not provide a way to marshall to or from UTF8. (If you have a UTF8 ByteString then you would probably be able to make it work, assuming the indices PCRE uses are in bytes, otherwise look at the wrap* functions which are a thin layer over the pcreapi).

The old Text.Regex API is can be replaced. If you need to be drop in compatible with Text.Regex then you can import Text.Regex.New and report any infidelities as bugs. Some advantages of Text.Regex.Parsec over Text.Regex:

It does not marshal to and from c-code arrays, so it is much faster on large input strings.
It consumes the input String in a mostly lazy manner. This makes streaming from input to output possible.
It performs sanity checks so that subRegex and splitRegex don't loop or go crazy if the pattern matches an empty string -- it will just return the input.
If the String regex does not parse then you get a nicer error message.

Internally it uses Parsec to turn the string regex into a Pattern data type, simplify the Pattern, then transform the Pattern into a Parsec parser that accepts matching strings and stores the sub-strings of parenthesized groups.

All of this was motivated by the inability to use Text.Regex to complete the regex-dna benchmark on The Computer Language Shootout. The current entry there, by Don Stewart and Alson Kemp and Chris Kuklewicz, does not use this Parsec solution, but rather a custom DFA lexer from the CTK library.