[section Static Regexes] [h2 Overview] The feature that really sets xpressive apart from other C/C++ regular expression libraries is the ability to author a regular expression using C++ expressions. xpressive achieves this through operator overloading, using a technique called ['expression templates] to embed a mini-language dedicated to pattern matching within C++. These "static regexes" have many advantages over their string-based brethren. In particular, static regexes: * are syntax-checked at compile-time; they will never fail at run-time due to a syntax error. * can naturally refer to other C++ data and code, including other regexes, making it possible to build grammars out of regular expressions and bind user-defined actions that execute when parts of your regex match. * are statically bound for better inlining and optimization. Static regexes require no state tables, virtual functions, byte-code or calls through function pointers that cannot be resolved at compile time. * are not limited to searching for patterns in strings. You can declare a static regex that finds patterns in an array of integers, for instance. Since we compose static regexes using C++ expressions, we are constrained by the rules for legal C++ expressions. Unfortunately, that means that "classic" regular expression syntax cannot always be mapped cleanly into C++. Rather, we map the regex ['constructs], picking new syntax that is legal C++. [h2 Construction and Assignment] You create a static regex by assigning one to an object of type _basic_regex_. For instance, the following defines a regex that can be used to find patterns in objects of type `std::string`: sregex re = '$' >> +_d >> '.' >> _d >> _d; Assignment works similarly. [h2 Character and String Literals] In static regexes, character and string literals match themselves. For instance, in the regex above, `'$'` and `'.'` match the characters `'$'` and `'.'` respectively. Don't be confused by the fact that [^$] and [^.] are meta-characters in Perl. In xpressive, literals always represent themselves. When using literals in static regexes, you must take care that at least one operand is not a literal. For instance, the following are ['not] valid regexes: sregex re1 = 'a' >> 'b'; // ERROR! sregex re2 = +'a'; // ERROR! The two operands to the binary `>>` operator are both literals, and the operand of the unary `+` operator is also a literal, so these statements will call the native C++ binary right-shift and unary plus operators, respectively. That's not what we want. To get operator overloading to kick in, at least one operand must be a user-defined type. We can use xpressive's `as_xpr()` helper function to "taint" an expression with regex-ness, forcing operator overloading to find the correct operators. The two regexes above should be written as: sregex re1 = as_xpr('a') >> 'b'; // OK sregex re2 = +as_xpr('a'); // OK [h2 Sequencing and Alternation] As you've probably already noticed, sub-expressions in static regexes must be separated by the sequencing operator, `>>`. You can read this operator as "followed by". // Match an 'a' followed by a digit sregex re = 'a' >> _d; Alternation works just as it does in Perl with the `|` operator. You can read this operator as "or". For example: // match a digit character or a word character one or more times sregex re = +( _d | _w ); [h2 Grouping and Captures] In Perl, parentheses `()` have special meaning. They group, but as a side-effect they also create back\-references like [^$1] and [^$2]. In C++, parentheses only group \-\- there is no way to give them side\-effects. To get the same effect, we use the special `s1`, `s2`, etc. tokens. Assigning to one creates a back-reference. You can then use the back-reference later in your expression, like using [^\1] and [^\2] in Perl. For example, consider the following regex, which finds matching HTML tags: "<(\\w+)>.*?" In static xpressive, this would be: '<' >> (s1= +_w) >> '>' >> -*_ >> "> s1 >> '>' Notice how you capture a back-reference by assigning to `s1`, and then you use `s1` later in the pattern to find the matching end tag. [tip [*Grouping without capturing a back-reference] \n\n In xpressive, if you just want grouping without capturing a back-reference, you can just use `()` without `s1`. That is the equivalent of Perl's [^(?:)] non-capturing grouping construct.] [h2 Case-Insensitivity and Internationalization] Perl lets you make part of your regular expression case-insensitive by using the [^(?i:)] pattern modifier. xpressive also has a case-insensitivity pattern modifier, called `icase`. You can use it as follows: sregex re = "this" >> icase( "that" ); In this regular expression, `"this"` will be matched exactly, but `"that"` will be matched irrespective of case. Case-insensitive regular expressions raise the issue of internationalization: how should case-insensitive character comparisons be evaluated? Also, many character classes are locale-specific. Which characters are matched by `digit` and which are matched by `alpha`? The answer depends on the `std::locale` object the regular expression object is using. By default, all regular expression objects use the global locale. You can override the default by using the `imbue()` pattern modifier, as follows: std::locale my_locale = /* initialize a std::locale object */; sregex re = imbue( my_locale )( +alpha >> +digit ); This regular expression will evaluate `alpha` and `digit` according to `my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traits Localization and Regex Traits] for more information about how to customize the behavior of your regexes. [h2 Static xpressive Syntax Cheat Sheet] The table below lists the familiar regex constructs and their equivalents in static xpressive. [table Perl syntax vs. Static xpressive syntax [[Perl] [Static xpressive] [Meaning]] [[[^.]] [`_`] [any character (assuming Perl's /s modifier).]] [[[^ab]] [`a >> b`] [sequencing of [^a] and [^b] sub-expressions.]] [[[^a|b]] [`a | b`] [alternation of [^a] and [^b] sub-expressions.]] [[[^(a)]] [`(s1= a)`] [group and capture a back-reference.]] [[[^(?:a)]] [`(a)`] [group and do not capture a back-reference.]] [[[^\1]] [`s1`] [a previously captured back-reference.]] [[[^a*]] [`*a`] [zero or more times, greedy.]] [[[^a+]] [`+a`] [one or more times, greedy.]] [[[^a?]] [`!a`] [zero or one time, greedy.]] [[[^a{n,m}]] [`repeat(a)`] [between [^n] and [^m] times, greedy.]] [[[^a*?]] [`-*a`] [zero or more times, non-greedy.]] [[[^a+?]] [`-+a`] [one or more times, non-greedy.]] [[[^a??]] [`-!a`] [zero or one time, non-greedy.]] [[[^a{n,m}?]] [`-repeat(a)`] [between [^n] and [^m] times, non-greedy.]] [[[^^]] [`bos`] [beginning of sequence assertion.]] [[[^$]] [`eos`] [end of sequence assertion.]] [[[^\b]] [`_b`] [word boundary assertion.]] [[[^\B]] [`~_b`] [not word boundary assertion.]] [[[^\\n]] [`_n`] [literal newline.]] [[[^.]] [`~_n`] [any character except a literal newline (without Perl's /s modifier).]] [[[^\\r?\\n|\\r]] [`_ln`] [logical newline.]] [[[^\[^\\r\\n\]]] [`~_ln`] [any single character not a logical newline.]] [[[^\w]] [`_w`] [a word character, equivalent to set\[alnum | '_'\].]] [[[^\W]] [`~_w`] [not a word character, equivalent to ~set\[alnum | '_'\].]] [[[^\d]] [`_d`] [a digit character.]] [[[^\D]] [`~_d`] [not a digit character.]] [[[^\s]] [`_s`] [a space character.]] [[[^\S]] [`~_s`] [not a space character.]] [[[^\[:alnum:\]]] [`alnum`] [an alpha-numeric character.]] [[[^\[:alpha:\]]] [`alpha`] [an alphabetic character.]] [[[^\[:blank:\]]] [`blank`] [a horizontal white-space character.]] [[[^\[:cntrl:\]]] [`cntrl`] [a control character.]] [[[^\[:digit:\]]] [`digit`] [a digit character.]] [[[^\[:graph:\]]] [`graph`] [a graphable character.]] [[[^\[:lower:\]]] [`lower`] [a lower-case character.]] [[[^\[:print:\]]] [`print`] [a printing character.]] [[[^\[:punct:\]]] [`punct`] [a punctuation character.]] [[[^\[:space:\]]] [`space`] [a white-space character.]] [[[^\[:upper:\]]] [`upper`] [an upper-case character.]] [[[^\[:xdigit:\]]] [`xdigit`] [a hexadecimal digit character.]] [[[^\[0-9\]]] [`range('0','9')`] [characters in range `'0'` through `'9'`.]] [[[^\[abc\]]] [`as_xpr('a') | 'b' |'c'`] [characters `'a'`, `'b'`, or `'c'`.]] [[[^\[abc\]]] [`(set= 'a','b','c')`] [['same as above]]] [[[^\[0-9abc\]]] [`set[ range('0','9') | 'a' | 'b' | 'c' ]`] [characters `'a'`, `'b'`, `'c'` or in range `'0'` through `'9'`.]] [[[^\[0-9abc\]]] [`set[ range('0','9') | (set= 'a','b','c') ]`] [['same as above]]] [[[^\[^abc\]]] [`~(set= 'a','b','c')`] [not characters `'a'`, `'b'`, or `'c'`.]] [[[^(?i:['stuff])]] [`icase(`[^['stuff]]`)`] [match ['stuff] disregarding case.]] [[[^(?>['stuff])]] [`keep(`[^['stuff]]`)`] [independent sub-expression, match ['stuff] and turn off backtracking.]] [[[^(?=['stuff])]] [`before(`[^['stuff]]`)`] [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]] [[[^(?!['stuff])]] [`~before(`[^['stuff]]`)`] [negative look-ahead assertion, match if not before ['stuff].]] [[[^(?<=['stuff])]] [`after(`[^['stuff]]`)`] [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]] [[[^(?