Regular Expressions: Difference between revisions

Latest revision as of 14:24, 5 June 2013

Most Regular Expression (Regex) implementations are based on or bare similarities to the PCRE library. The purpose of Regular Expressions are normally to replace or extract bits of data from a larger string. The syntax of its usage will vary between different languages (eg the functions or constructs that let you use Regex in Perl, PHP, PowerShell, etc), but the syntax of the actual Regex will generally be consistent. The secret to creating good Regex is good and thorough testing for things that shouldn't match as much as should match.

The following provides a run through of the general syntax (its not exhaustive, its a bit of a work in progress, but contains most of the common stuff).

See the Regex Examples page for some real world examples.

Metacharacters

Metacharacters define how an item should match (or not)

Character	Meaning	Example
`\`	Escape
`^`	Match the beginning of the line	`^dog` matches `dog` but not `big dog`
`$`	Match the end of the line	`dog$` matches `dog` but not `dog run`
`.`	Match any character	`do.` matches `do`, `dog` but not `dogs`
`(...)`	Grouping, match entire set `...`
`[...]`	Bracketed Character class, any character contained in set `...`	`[fld]og*` matches `fog`, `log`, or `dog` only
`\|`	Or, alternation	`d\|fog` matches `dog` or `fog`
`?=`	Lookahead	`dog(?=,)` matches the `dog` in `dog,` but won't find a match in `dog`
`?<=`	Look behind	`(?<=big )dog` matches the `dog` in `big dog` but won't find a match in `dog`

Quantifiers

Quantifiers define how many times the preceding item should be matched.

Character	Meaning	Example
`*`	Match 0 or more times	`dog*` matches `do`, `dog`, `dogg`, or `doggg`, and so on
`+`	Match 1 or more times	`dog+` matches `dog`, `dogg`, or `doggg`, and so on; but not `do`
`?`	Match 0 or 1 times only	`dog?` matches `do` or `dog` only
`{n}`	Match exactly `n` times	`dog{2}` matches `dogg` only
`{n,}`	Match `n` or more times	`dog{2,}` matches `dogg`, or `doggg`, and so on
`{n,m}`	Match `n` up to `m` times	`dog{2,4}` matches `dogg`, `doggg`, or `dogggg` only

Character Classes

Character Classes define a set of characters that should be matched

Sequence	Matches	Example
`[...]`	Any character contained in set `...`	`[fld]og*` matches `fog`, `log`, or `dog` only
`[:...:]`	Any character defined by POSIX class `...`
`\w`	Any word character (alphanumeric and underscore)
`\W`	Any non-word character
`\s`	Any white-space character
`\S`	Any non-whitespace character
`\d`	Any decimal digit character
`\D`	Any non-decimal digit character

POSIX Classes

POSIX classes are generally not supported by Microsoft implementations.

Class	Matches	Equivalent to
`alpha`	Any alphabetical character	`[A-Za-z]`
`alnum`	Any alphanumeric character	`[A-Za-z0-9]`
`ascii`	Any ASCII character
`blank`	A space or tab
`cntrl`	Control characters
`digit`	Any decimal digit	`[0-9]` or `\d`
`graph`	Any graphical/visible character
`lower`	Any lower-case alphabetical character	`[a-z]`
`print`	Any printable character, including a space
`punct`	Any punctuation character
`space`	Any white-space character
`upper`	Any upper-case character	`[A-Z]`
`word`	Any word character	`[A-Za-z0-9_]<\code> or \w`
`xdigit`	Any hexadecimal digit	`[0-9a-fA-F]`

@@ Line 1: / Line 1: @@
+Most Regular Expression (Regex) implementations are based on or bare similarities to the [[Acronyms#P|PCRE]] library.  The purpose of Regular Expressions are normally to replace or extract bits of data from a larger string.  The syntax of its usage will vary between different languages (eg the functions or constructs that let you use Regex in Perl, PHP, PowerShell, etc), but the syntax of the actual Regex will ''generally'' be consistent.  The secret to creating good Regex is good and thorough testing for things that ''shouldn't'' match as much as ''should'' match.
-== Useful/Standard RegEx ==
+The following provides a run through of the general syntax (its not exhaustive, its a bit of a work in progress, but contains most of the common stuff).
-{|cellpadding="2" cellspacing="0" border="1"
-|- style="background-color:#bbddff;"
+See the [[Regex Examples]] page for some real world examples.
-! Matches                  !!  Expression
+== Metacharacters ==
+Metacharacters define how an item should match (or not)
+{|class="vwikitable"
+|-
+! Character		!! Meaning			!! Example
+|-
+| <code>\</code>	|| Escape
+|-
+| <code>^</code>	|| Match the beginning of the line	|| <code>^dog</code> matches <code>dog</code> but not <code>big dog</code>
+|-
+| <code>$</code>	|| Match the end of the line	|| <code>dog$</code> matches <code>dog</code> but not <code>dog run</code>
+|-
+| <code>.</code>	|| Match any character		|| <code>do.</code> matches <code>do</code>, <code>dog</code> but not <code>dogs</code>
+|-
+| <code>(...)</code>	|| Grouping, match entire set <code>...</code>	||
 |-
-| '''IP Address'''
+| <code>[...]</code>	|| Bracketed [[#Character_Classes|Character class]], any character contained in set <code>...</code>	|| <code>[fld]og*</code> matches <code>fog</code>, <code>log</code>, or <code>dog</code> only
-| <code><nowiki> ^\b((25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)\.){3}(25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)\b </nowiki></code>
 |-
-| '''Hostname''' (no domain)
+| <code><nowiki>|</nowiki></code>	|| Or, alternation		|| <code>d<nowiki>|</nowiki>fog</code> matches <code>dog</code> or <code>fog</code>
-| <code><nowiki> \A(\w|-)+ </nowiki></code>
 |-
-| '''Email address'''
+| <code>?=</code>       || Lookahead                     || <code>dog(?=,)</code> matches the <code>dog</code> in <code>dog,</code> but won't find a match in <code>dog</code>
-| <code><nowiki> \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b </nowiki></code>
+|-
+| <code>?<=</code>      || Look behind                   || <code>(?<=big )dog</code> matches the <code>dog</code> in <code>big dog</code> but won't find a match in <code>dog</code>
 |}
-== Examples ==
+== Quantifiers ==
-=== Logfile Name ===
+Quantifiers define how many times the preceding item should be matched.
-''' <code> \d{4}-[A-Za-z]{3}-Week\d{1}.log </code> '''
+{|class="vwikitable"
+|-
+! Character		!! Meaning			!! Example
+|-
+| <code>*</code>	|| Match 0 or more times	|| <code>dog*</code> matches <code>do</code>, <code>dog</code>, <code>dogg</code>, or <code>doggg</code>, and so on
+|-
+| <code>+</code>	|| Match 1 or more times	|| <code>dog+</code> matches <code>dog</code>, <code>dogg</code>, or <code>doggg</code>, and so on; but not <code>do</code>
+|-
+| <code>?</code>	|| Match 0 or 1 times only	|| <code>dog?</code> matches <code>do</code> or <code>dog</code> only
+|-
+| <code>{n}</code>	|| Match exactly <code>n</code> times	|| <code>dog{2}</code> matches <code>dogg</code> only
+|-
+| <code>{n,}</code>	|| Match <code>n</code> or more times	|| <code>dog{2,}</code> matches <code>dogg</code>, or <code>doggg</code>, and so on
+|-
+| <code>{n,m}</code>	|| Match <code>n</code> up to <code>m</code> times	|| <code>dog{2,4}</code> matches <code>dogg</code>, <code>doggg</code>, or <code>dogggg</code> only
+|}
-Example matches...
+== Character Classes ==
-* <code> 2010-Feb-Week4.log </code>
+Character Classes define a set of characters that should be matched
-* <code> 2009-Dec-Week2.log </code>
+{|class="vwikitable"
-* <code> 1234-aBc-Week0.log </code>
+|-
+! Sequence		!! Matches			!! Example
+|-
+| <code>[...]</code>	|| Any character contained in set <code>...</code>	|| <code>[fld]og*</code> matches <code>fog</code>, <code>log</code>, or <code>dog</code> only
+|-
+| <code>[:...:]</code>	|| Any character defined by [[#POSIX Classes|POSIX class]] <code>...</code>	||
+|-
+| <code>\w</code>	|| Any word character (alphanumeric and underscore)	||
+|-
+| <code>\W</code>	|| Any non-word character	||
+|-
+| <code>\s</code>	|| Any white-space character	||
+|-
+| <code>\S</code>	|| Any non-whitespace character	||
+|-
+| <code>\d</code>	|| Any decimal digit character	||
+|-
+| <code>\D</code>	|| Any non-decimal digit character	||
+|}
-=== Between Parentheses ===
+=== POSIX Classes ===
-''' <code> (?<=\[)(.*?)(?=\]) </code> '''
+[[Acronyms#P|POSIX]] classes are generally not supported by Microsoft implementations.
+{|class="vwikitable"
-Matches everything between <code> [ </code> and <code> ]</code>, so for example...
+|-
-* <code> VMFS_SCSI_DS_01 </code> is matched from <code> [VMFS_SCSI_DS_01] My_VM/MyVM.vmdk
+! Class			!! Matches			!! Equivalent to
+|-
-Its essentially done via three chunks of the regex...
+| <code>alpha</code>	|| Any alphabetical character	|| <code>[A-Za-z]</code>
-# <code><nowiki> (?<=\[) </nowiki></code>
+|-
-#* Requires that <code> [ </code> immediately proceeds the match
+| <code>alnum</code>	|| Any alphanumeric character	|| <code>[A-Za-z0-9]</code>
-# <code> (.*?) </code>
+|-
-#* Matches everything
+| <code>ascii</code>	|| Any [[Acronyms#A|ASCII]] character	||
-# <code> (?=\]) </code>
+|-
-#* Requires that <code> [ </code> immediately follows the match
+| <code>blank</code>	|| A space or tab	||
+|-
-=== VMHBA LUN ID ===
+| <code>cntrl</code>	|| Control characters	||
-'''<code> (?<=:)([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])$ </code>
+|-
+| <code>digit</code>	|| Any decimal digit	|| <code>[0-9]</code> or <code>\d</code>
-Finds a number between 0 and 255 immediately after a <code>:</code> at the end of a line, specifically intended to get the LUN ID from a VMware canonical path, so for example...
+|-
-* <code>13</code> is matched from <code>vmhba3:0:13</code>
+| <code>graph</code>	|| Any graphical/visible character	||
+|-
-Stepping through the regex...
+| <code>lower</code>	|| Any lower-case alphabetical character	|| <code>[a-z]</code>
-# <code><nowiki> (?<=:) </nowiki></code>
+|-
-#* Requires that <code> : </code> immediately proceeds the match
+| <code>print</code>	|| Any printable character, including a space	||
-# <code>([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])</code>
+|-
-#* Matches any number between 0 and 255
+| <code>punct</code>	|| Any punctuation character	||
-# <code>$</code>
+|-
-#* Ensures that the proceeding match occurs at the end of a line
+| <code>space</code>	|| Any white-space character	||
+|-
-=== vRanger Backup text in VM Notes ===
+| <code>upper</code>	|| Any upper-case character	|| <code>[A-Z]</code>
-'''<code>\s?\bvRanger.*Repository \[.*\]\s?</code>'''
+|-
+| <code>word</code>	|| Any word character	|| <code>[A-Za-z0-9_]<\code> or <code>\w</code>
-Finds the text created by vRanger in a VM's notes (so you can strip it out)...
+|-
-* EG <code>vRanger Pro Backup: Type [Full] Result [Success] Time [27/09/2010 06:46:54] Repository [VC_Server]</code>
+| <code>xdigit</code>	|| Any hexadecimal digit	|| <code>[0-9a-fA-F]</code>
+|}
-Stepping through the regex...
+[[Category:Regex]]
-# <code>\s?</code>
-#* Matches any white-space at the start of the match
-# <code>\bvRanger.*</code>
-#* Matches <code>vRanger</code> at the start of a word and anything after until...
-# <code>Repository \[.*\]</code>
-#* Matches the end of a vRanger text segment, <code>Repository [hostname]</code>
-# <code>\s?</code>
-#* Matches any white-space at the end of the match

Regular Expressions: Difference between revisions

Latest revision as of 14:24, 5 June 2013

Contents

Metacharacters

Quantifiers

Character Classes

POSIX Classes

Navigation menu

Regular Expressions: Difference between revisions

Latest revision as of 14:24, 5 June 2013

Metacharacters

Quantifiers

Character Classes

POSIX Classes

Navigation menu

Search