Regular Expressions: Difference between revisions
(Re-arranged page) |
m (→Quantifiers: Typo fixes) |
||
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | Most Regular Expression (Regex) implementations are based on or bare similarities to the [[Acronyms#P|PCRE]] library. The purpose of Regular Expressions are normally to replace or extract bits of data from a larger string. The syntax of its usage will vary between different languages (eg the functions or constructs that let you use Regex in Perl, PHP, PowerShell, etc), but the syntax of the actual Regex will ''generally'' be consistent. The secret to creating good Regex is good and thorough testing for things that ''shouldn't'' match as much as ''should'' match. | ||
=== | |||
{| | The following provides a run through of the general syntax (its not exhaustive, its a bit of a work in progress, but contains most of the common stuff). | ||
|- | |||
! | See the [[Regex Examples]] page for some real world examples. | ||
== Metacharacters == | |||
Metacharacters define how an item should match (or not) | |||
{|class="vwikitable" | |||
|- | |||
! Character !! Meaning !! Example | |||
|- | |||
| <code>\</code> || Escape | |||
|- | |||
| <code>^</code> || Match the beginning of the line || <code>^dog</code> matches <code>dog</code> but not <code>big dog</code> | |||
|- | |||
| <code>$</code> || Match the end of the line || <code>dog$</code> matches <code>dog</code> but not <code>dog run</code> | |||
|- | |||
| <code>.</code> || Match any character || <code>do.</code> matches <code>do</code>, <code>dog</code> but not <code>dogs</code> | |||
|- | |||
| <code>(...)</code> || Grouping, match entire set <code>...</code> || | |||
|- | |||
| <code>[...]</code> || Bracketed [[#Character_Classes|Character class]], any character contained in set <code>...</code> || <code>[fld]og*</code> matches <code>fog</code>, <code>log</code>, or <code>dog</code> only | |||
|- | |||
| <code><nowiki>|</nowiki></code> || Or, alternation || <code>d<nowiki>|</nowiki>fog</code> matches <code>dog</code> or <code>fog</code> | |||
|- | |||
| <code>?=</code> || Lookahead || <code>dog(?=,)</code> matches the <code>dog</code> in <code>dog,</code> but won't find a match in <code>dog</code> | |||
|- | |||
| <code>?<=</code> || Look behind || <code>(?<=big )dog</code> matches the <code>dog</code> in <code>big dog</code> but won't find a match in <code>dog</code> | |||
|} | |||
== Quantifiers == | |||
Quantifiers define how many times the preceding item should be matched. | |||
{|class="vwikitable" | |||
|- | |||
! Character !! Meaning !! Example | |||
|- | |||
| <code>*</code> || Match 0 or more times || <code>dog*</code> matches <code>do</code>, <code>dog</code>, <code>dogg</code>, or <code>doggg</code>, and so on | |||
|- | |||
| <code>+</code> || Match 1 or more times || <code>dog+</code> matches <code>dog</code>, <code>dogg</code>, or <code>doggg</code>, and so on; but not <code>do</code> | |||
|- | |- | ||
| | | <code>?</code> || Match 0 or 1 times only || <code>dog?</code> matches <code>do</code> or <code>dog</code> only | ||
| <code>< | |- | ||
| <code>{n}</code> || Match exactly <code>n</code> times || <code>dog{2}</code> matches <code>dogg</code> only | |||
|- | |||
| <code>{n,}</code> || Match <code>n</code> or more times || <code>dog{2,}</code> matches <code>dogg</code>, or <code>doggg</code>, and so on | |||
|- | |||
| <code>{n,m}</code> || Match <code>n</code> up to <code>m</code> times || <code>dog{2,4}</code> matches <code>dogg</code>, <code>doggg</code>, or <code>dogggg</code> only | |||
|} | |} | ||
== | == Character Classes == | ||
Character Classes define a set of characters that should be matched | |||
{|class="vwikitable" | |||
|- | |||
! Sequence !! Matches !! Example | |||
|- | |||
| <code>[...]</code> || Any character contained in set <code>...</code> || <code>[fld]og*</code> matches <code>fog</code>, <code>log</code>, or <code>dog</code> only | |||
|- | |||
| <code>[:...:]</code> || Any character defined by [[#POSIX Classes|POSIX class]] <code>...</code> || | |||
|- | |||
| <code>\w</code> || Any word character (alphanumeric and underscore) || | |||
|- | |||
| <code>\W</code> || Any non-word character || | |||
|- | |||
| <code>\s</code> || Any white-space character || | |||
|- | |||
| <code>\S</code> || Any non-whitespace character || | |||
|- | |||
| <code>\d</code> || Any decimal digit character || | |||
|- | |||
| <code>\D</code> || Any non-decimal digit character || | |||
|} | |||
== | === POSIX Classes === | ||
{| | [[Acronyms#P|POSIX]] classes are generally not supported by Microsoft implementations. | ||
|- | {|class="vwikitable" | ||
! Matches | |- | ||
! Class !! Matches !! Equivalent to | |||
|- | |||
| <code>alpha</code> || Any alphabetical character || <code>[A-Za-z]</code> | |||
|- | |||
| <code>alnum</code> || Any alphanumeric character || <code>[A-Za-z0-9]</code> | |||
|- | |||
| <code>ascii</code> || Any [[Acronyms#A|ASCII]] character || | |||
|- | |||
| <code>blank</code> || A space or tab || | |||
|- | |||
| <code>cntrl</code> || Control characters || | |||
|- | |||
| <code>digit</code> || Any decimal digit || <code>[0-9]</code> or <code>\d</code> | |||
|- | |||
| <code>graph</code> || Any graphical/visible character || | |||
|- | |||
| <code>lower</code> || Any lower-case alphabetical character || <code>[a-z]</code> | |||
|- | |- | ||
| <code>print</code> || Any printable character, including a space || | |||
| <code>< | |||
|- | |- | ||
| <code>punct</code> || Any punctuation character || | |||
| <code> | |||
|- | |- | ||
| | | <code>space</code> || Any white-space character || | ||
| <code>< | |- | ||
| <code>upper</code> || Any upper-case character || <code>[A-Z]</code> | |||
|- | |||
| <code>word</code> || Any word character || <code>[A-Za-z0-9_]<\code> or <code>\w</code> | |||
|- | |||
| <code>xdigit</code> || Any hexadecimal digit || <code>[0-9a-fA-F]</code> | |||
|} | |} | ||
[[Category:Regex]] |
Latest revision as of 14:24, 5 June 2013
Most Regular Expression (Regex) implementations are based on or bare similarities to the PCRE library. The purpose of Regular Expressions are normally to replace or extract bits of data from a larger string. The syntax of its usage will vary between different languages (eg the functions or constructs that let you use Regex in Perl, PHP, PowerShell, etc), but the syntax of the actual Regex will generally be consistent. The secret to creating good Regex is good and thorough testing for things that shouldn't match as much as should match.
The following provides a run through of the general syntax (its not exhaustive, its a bit of a work in progress, but contains most of the common stuff).
See the Regex Examples page for some real world examples.
Metacharacters
Metacharacters define how an item should match (or not)
Character | Meaning | Example |
---|---|---|
\ |
Escape | |
^ |
Match the beginning of the line | ^dog matches dog but not big dog
|
$ |
Match the end of the line | dog$ matches dog but not dog run
|
. |
Match any character | do. matches do , dog but not dogs
|
(...) |
Grouping, match entire set ... |
|
[...] |
Bracketed Character class, any character contained in set ... |
[fld]og* matches fog , log , or dog only
|
| |
Or, alternation | d|fog matches dog or fog
|
?= |
Lookahead | dog(?=,) matches the dog in dog, but won't find a match in dog
|
?<= |
Look behind | (?<=big )dog matches the dog in big dog but won't find a match in dog
|
Quantifiers
Quantifiers define how many times the preceding item should be matched.
Character | Meaning | Example |
---|---|---|
* |
Match 0 or more times | dog* matches do , dog , dogg , or doggg , and so on
|
+ |
Match 1 or more times | dog+ matches dog , dogg , or doggg , and so on; but not do
|
? |
Match 0 or 1 times only | dog? matches do or dog only
|
{n} |
Match exactly n times |
dog{2} matches dogg only
|
{n,} |
Match n or more times |
dog{2,} matches dogg , or doggg , and so on
|
{n,m} |
Match n up to m times |
dog{2,4} matches dogg , doggg , or dogggg only
|
Character Classes
Character Classes define a set of characters that should be matched
Sequence | Matches | Example |
---|---|---|
[...] |
Any character contained in set ... |
[fld]og* matches fog , log , or dog only
|
[:...:] |
Any character defined by POSIX class ... |
|
\w |
Any word character (alphanumeric and underscore) | |
\W |
Any non-word character | |
\s |
Any white-space character | |
\S |
Any non-whitespace character | |
\d |
Any decimal digit character | |
\D |
Any non-decimal digit character |
POSIX Classes
POSIX classes are generally not supported by Microsoft implementations.
Class | Matches | Equivalent to |
---|---|---|
alpha |
Any alphabetical character | [A-Za-z]
|
alnum |
Any alphanumeric character | [A-Za-z0-9]
|
ascii |
Any ASCII character | |
blank |
A space or tab | |
cntrl |
Control characters | |
digit |
Any decimal digit | [0-9] or \d
|
graph |
Any graphical/visible character | |
lower |
Any lower-case alphabetical character | [a-z]
|
print |
Any printable character, including a space | |
punct |
Any punctuation character | |
space |
Any white-space character | |
upper |
Any upper-case character | [A-Z]
|
word |
Any word character | [A-Za-z0-9_]<\code> or |
xdigit |
Any hexadecimal digit | [0-9a-fA-F]
|