Regular Expressions

From vWiki
Jump to: navigation, search

Most Regular Expression (Regex) implementations are based on or bare similarities to the PCRE library. The purpose of Regular Expressions are normally to replace or extract bits of data from a larger string. The syntax of its usage will vary between different languages (eg the functions or constructs that let you use Regex in Perl, PHP, PowerShell, etc), but the syntax of the actual Regex will generally be consistent. The secret to creating good Regex is good and thorough testing for things that shouldn't match as much as should match.

The following provides a run through of the general syntax (its not exhaustive, its a bit of a work in progress, but contains most of the common stuff).

See the Regex Examples page for some real world examples.

Metacharacters

Metacharacters define how an item should match (or not)

Character Meaning Example
\ Escape
^ Match the beginning of the line ^dog matches dog but not big dog
$ Match the end of the line dog$ matches dog but not dog run
. Match any character do. matches do, dog but not dogs
(...) Grouping, match entire set ...
[...] Bracketed Character class, any character contained in set ... [fld]og* matches fog, log, or dog only
| Or, alternation d|fog matches dog or fog
?= Lookahead dog(?=,) matches the dog in dog, but won't find a match in dog
?<= Look behind (?<=big )dog matches the dog in big dog but won't find a match in dog

Quantifiers

Quantifiers define how many times the preceding item should be matched.

Character Meaning Example
* Match 0 or more times dog* matches do, dog, dogg, or doggg, and so on
+ Match 1 or more times dog+ matches dog, dogg, or doggg, and so on; but not do
? Match 0 or 1 times only dog? matches do or dog only
{n} Match exactly n times dog{2} matches dogg only
{n,} Match n or more times dog{2,} matches dogg, or doggg, and so on
{n,m} Match n up to m times dog{2,4} matches dogg, doggg, or dogggg only

Character Classes

Character Classes define a set of characters that should be matched

Sequence Matches Example
[...] Any character contained in set ... [fld]og* matches fog, log, or dog only
[:...:] Any character defined by POSIX class ...
\w Any word character (alphanumeric and underscore)
\W Any non-word character
\s Any white-space character
\S Any non-whitespace character
\d Any decimal digit character
\D Any non-decimal digit character

POSIX Classes

POSIX classes are generally not supported by Microsoft implementations.

Class Matches Equivalent to
alpha Any alphabetical character [A-Za-z]
alnum Any alphanumeric character [A-Za-z0-9]
ascii Any ASCII character
blank A space or tab
cntrl Control characters
digit Any decimal digit [0-9] or \d
graph Any graphical/visible character
lower Any lower-case alphabetical character [a-z]
print Any printable character, including a space
punct Any punctuation character
space Any white-space character
upper Any upper-case character [A-Z]
word Any word character [A-Za-z0-9_]<\code> or <code>\w
xdigit Any hexadecimal digit [0-9a-fA-F]