Regular Expressions—The Full Story

Introduction

PART I: The Automaton

Transition Diagram

Two-up is a traditional Australian casino game.

Alphabet

Four different alphabets: coin, English lowercase, binary digits, and two dots.
Four different glyphs.
Every symbol has a unique index number.

States and Transitions

Transitions Table

Finite automaton: all strings that don’t end with two blue dots.
The intersection of the state B and the white dot symbol says: “Go to state A.”

Nondeterministic Finite Automaton

NFA and DFA: all strings in which no white dot can ever be preceded by a blue dot
NFA and DFA: any string that ends with a sequence of first one white, then one blue dot.
Venn diagram: All DFAs are NFAs. Some NFAs are DFAs.

Greedy and Reluctant

Automaton for Two-up. Symbol H represents heads. Symbol T represents tails.

NFA to DFA

Nondeterministic finite automaton.
Deterministic finite automaton.

Pushdown Automata

Pushdown automaton.

PART II: Two Operations and One Function

History of Regular Expressions

The regular expressions pioneers.

Match One Character

'a'.match /a/ #=> #<MatchData "a">
# string a matched by regex /a/
''.match /a/ #=> nil
# empty string not matched by /a/
'b'.match /a/ #=> nil
# string b not matched by /a/
'a'.match // #=> #<MatchData "">
# string a not matched by empty regex ε

Four More Rules

Operation №1: Concatenation

'moda'.match /m/ #=> #<MatchData "m">
'moda'.match /o/ #=> #<MatchData "o">
'moda'.match /mo/ #=> #<MatchData "mo"> m×o is mo
'moda'.match /da/ #=> #<MatchData "da"> d×a is da
'moda'.match /moda/ #=> #<MatchData "moda"> mo×da is moda
'moda'.match /mado/ #=> nil -- mado is not moda
'aaa'.match /aaa/ #=> #<MatchData "aaa">
'aaa'.match /a{3}/ #=> #<MatchData "aaa">
# yes, the string includes 3 concatenated a
'aaa'.match /a{4}/ #=> nil
# no, the string doesn't include 4 a
'aa'.match /a?/ #=> #<MatchData "a">
# optional match, written as question mark
'b'.match /a?/ #=> #<MatchData "">
# zero repeats of a matches empty string
'aa'.match /a{,2}/ #=> #<MatchData "aa"> at least two a
'aa'.match /a{1,2}/ #=> #<MatchData "aa">
# at least one a and at most two a
'a'.match /a{1,2}/ #=> #<MatchData "a">

Operation №2: Alternation

Venn diagram showing exclusive disjunction (XOR).
'a'.match /a|b/ #=> #<MatchData "a"> a is either a or b
'ab'.match /a|b/ #=> #<MatchData "a"> leftmost chosen
'ba'.match /a|b/ #=> #<MatchData "b"> leftmost chosen
'c'.match /a|b/ #=> nil -- here we found neither a nor b
'0'.match /0|1/ #=> #<MatchData "0">
'1'.match /0|1/ #=> #<MatchData "1">
'2'.match /0|1/ #=> nil
'10'.match /0|1/ #=> #<MatchData "1">
'10'.match /00|10|01|11/ #=> #<MatchData "10">
'01'.match /00|10|01|11/ #=> #<MatchData "01">
'12'.match /00|10|01|11/ #=> nil
'11'.match /00|10|01|11/ #=> #<MatchData "11">
'1210'.match /00|10|01|11/ #=> #<MatchData "10">
'10'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'01'.match /(0|1)(0|1)/ #=> #<MatchData "01">
'12'.match /(0|1)(0|1)/ #=> nil
'11'.match /(0|1)(0|1)/ #=> #<MatchData "11">
'1210'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'moda'.match /moda|/ #=> #<MatchData "moda">
# either moda or nothing is moda
'moda'.match /mado|/ #=> #<MatchData "">
# either mado (not moda) or nothing is nothing

The Function: Kleene Star

'110'.match /(0|1)*0(0|1)*/ #=> #<MatchData "110">
# all strings with at least one zero
'1111'.match /(0|1)*0(0|1)*/ #=> nil
'1001'.match /((10*1)|0*)*/ #=> #<MatchData "1001">
# all strings with an even number of ones
'11001'.match /((10*1)|0*)*/ #=> #<MatchData "1100">
''.match /((10*1)|0*)*/ #=> #<MatchData "">
# even empty string has even number of ones
'1001'.match /((10*1)|0)|/ #=> #<MatchData "1001">
# again 'an even number of ones'
'11001'.match /((10*1)|0)|/ #=> #<MatchData "11">
''.match /((10*1)|0)|/ #=> #<MatchData "">
'1'.match /((10*1)|0)|/ #=> #<MatchData "">
'010'.match /((10*1)|0)|/ #=> #<MatchData "0">
'01'.match /((10*1)|0)|/ #=> #<MatchData "0">

Precedence

'fifth row'.match /third|fifth row/
#=> #<MatchData "fifth row">
'third row'.match /third|fifth row/
#=> #<MatchData "third">
'fifth row'.match /(third|fifth) row/
#=> #<MatchData "fifth row">
'third row'.match /(third|fifth) row/
#=> #<MatchData "third row">
'third row'.match /(third|(four|fif)th) row/ 
#=> #<MatchData "third row">
'fourth row'.match /(third|(four|fif)th) row/
#=> #<MatchData "fourth row">
'fifth row'.match /(third|(four|fif)th) row/
#=> #<MatchData "fifth row">

Some Examples With *, |, and ×

'01101'.match /1*(0|)1*/ #=> #<MatchData "011">
'0111'.match /1*(0|)1*/ #=> #<MatchData "0111">
'1101'.match /1*(0|)1*/ #=> #<MatchData "1101">
'11010'.match /1*(0|)1*/ #=> #<MatchData "1101">
'101001'.match /(1|0)*00(1|0)*/ #=> #<MatchData "101001">
'10101'.match /(1|0)*00(1|0)*/ #=> nil
'1010100'.match /(1|0)*00(1|0)*/ #=> #<MatchData "1010100">
'1010100'.match /1*(011*)*(0|)/ #=> #<MatchData "101010">
'101001'.match /1*(011*)*(0|)/ #=> #<MatchData "1010">
'0010101'.match /1*(011*)*(0|)/ #=> #<MatchData "0">
'0110101'.match /1*(011*)*(0|)/ #=> #<MatchData "0110101">
'110101'.match /(0|1)*01/ #=> #<MatchData "110101">
'11010'.match /(0|1)*01/ #=> #<MatchData "1101">
'1'.match /(0|1)*01/ #=> nil
'01'.match /(0|1)*01/ #=> #<MatchData "01">
'010'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "010">
'011'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "011">
''.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "">
'1'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "1">
'01'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "0">
'101'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "10">
'0110101'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "0110101">
'00101100'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "0010110">
'11001011'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "110">
'1100'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "110">
'0011'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "0011">

Regular Expressions Are Finite Automata

How to design finite automata for concatenation (top left), alternation (top right), kleene star (bottom left), and a combined pattern (bottom right).

Traits

Architecture

fa = Regexp.compile('10*') #=> /10*/
rs = '1 100 00 10'.scan fa #=> ["1", "100", "10"]
rs.each { |x| puts x }
# 1
# 100
# 10
#=> ["1", "100", "10"]
'1 100 00 10'.scan(/10*/) { |x| puts x }
# 1
# 100
# 10
#=> "1 100 00 10"

Functions

Verify (e.g. a card number), find, replace, filter, and parse are five regex applications.
/^ab*$/ === 'abbb' #=> true
/^ab*$/ === 'baaa' #=> false
'ab b abb a ba'.scan /ab*/ #=> ["ab", "abb", "a", "a"]
'ab b abb a ba'.gsub(/ab*/, '¤') #=> "¤ b ¤ ¤ b¤"
'ab b abb a ba'.gsub(/ab*/, '') #=> " b   b"
'ab b abb a ba'.split /ab*/ #=> ["", " b ", " ", " b"]

PART III: Syntactic Sugar, Abstractions, and Extensions

Quantifiers

'caalery'.sub /a*/,'e' #=> "ecaalery"
'caalery'.sub /a+/,'e' #=> "celery"
'chickpea chicken chickpeas'.scan /chickpeas?/
#=> ["chickpea", "chickpeas"]
'LaLaLaLaLaLa'.sub /(La){1,4}/, 'oh'
#=> "ohLaLa"
'ohLaLaLaLaLaLa'.sub /(La){,4}/, 'oh'
#=> "ohohLaLaLaLaLaLa"
'OhLaLaLaLaLaLa'.sub /(La){,4}/, 'oh'
#=> "ohOhLaLaLaLaLaLa"
'LaLaLaLaLaLa'.sub /(La){1,}/, 'oh'
#=> "oh"
'LaLaLaLaLaLa'.sub /(La){2}/, 'oh'
#=> "ohLaLaLaLa"

Quantifier Equations

Reluctant Quantifiers

Greedy (left) and reluctant (right) kleene star.
'<div>a</div><span>c</span><div>b</div>'.scan /<div>.*<\/div>/
#=> ["<div>a</div><span>c</span><div>b</div>"]
'<div>a</div><span>c</span><div>b</div>' \
.scan /<div>.*?<\/div>/
#=> ["<div>a</div>", "<div>b</div>"]
'aa'.match /a?/ #=> #<MatchData "a">
'aa'.match /a??/ #=> #<MatchData "">
'aaaaa'.match /a{2,4}/
#=> #<MatchData "aaaa">
# at least 2, at most 4, as much as possible
'aaaaa'.match /a{2,4}?/
#=> #<MatchData "aa">
# at least 2, at most 4, as little as possible
'aaaaa'.match /a{2,}/ #=> #<MatchData "aaaaa">
'aaaaa'.match /a{2,}?/ #=> #<MatchData "aa">
'aaaaa'.match /a{,4}/ #=> #<MatchData "aaaa">
'aaaaa'.match /a{,4}?/ #=> #<MatchData "">

Possessive Quantifier

'b'.sub /a?+b/, '¤' #=> "¤"
'b'.sub /a?b/, '¤' #=> "¤"
'b'.sub /.?+b/, '¤' #=> "b"
'b'.sub /.?b/, '¤' #=> "¤"
'ab'.sub /.?+b/, '¤' #=> "¤"
'ab'.sub /.?b/, '¤' #=> "¤"
'b'.sub /a*+b/, '¤' #=> "¤"
'b'.sub /a*b/, '¤' #=> "¤"
'b'.sub /.*+b/, '¤' #=> "b"
'b'.sub /.*b/, '¤' #=> "¤"
'ab'.sub /.*+b/, '¤' #=> "ab"
'ab'.sub /.*b/, '¤' #=> "¤"
'b'.sub /a++b/, '¤' #=> "b"
'b'.sub /a+b/, '¤' #=> "b"
'b'.sub /.++b/, '¤' #=> "b"
'b'.sub /.+b/, '¤' #=> "b"
'ab'.sub /.++b/, '¤' #=> "ab"
'ab'.sub /.+b/, '¤' #=> "¤"
ruby> 'aab'.sub /.?+b/, '¤' #=> "a¤"
ruby> 'aab'.sub /.{0,1}+b/, '¤' #=> "¤"
# Warning: nested repeat operators '?' and '+'
# were replaced with '*' in regular expression: /.{0,1}+b/
scala> ".{0,1}+b".r.replaceAllIn("aab", "¤")
res3: String = a¤
scala> ".?+b".r.replaceAllIn("aab", "¤")
res4: String = a¤

Literal vs. Metacharacters

The twelve literals who doesn‘t match literally.
'kadıköy karaköy köyun'.scan /köy/
#=> ["köy", "köy", "köy"]
'Sentence.'.scan /\./
#=> ["."]
'Sentence.'.scan /./
#=> ["S", "e", "n", "t", "e", "n", "c", "e", "."]
user_input1 = 'first'
#=> "first"
regex = Regexp.compile('id="' + user_input1 + '"')
#=> /id="first"/
'<span name="secret" id="first"/>'.scan regex
#=> ["id=\"first\""]
user_input2 = '|name=".*?"|'
#=> "|name=\".*?\"|"
regex = Regexp.compile('id="' + user_input2 + '"')
#=> /id="|name=".*?"|"/
'<span name="secret" id="first"/>'.scan regex
#=> ["name=\"secret\"", "id=\"", "\""]
user_input2 = '|name=".*?"|'
#=> "|name=\".*?\"|"
regex = Regexp.compile('id="' + Regexp.escape(user_input2) + '"')
#=> /id="\|name="\.\*\?"\|"/
'<span name="secret" id="first"/>'.scan regex
#=> []
perl -e 'print "match" if "2.71828" =~ /\Q2.71\E/'
#=> match
perl -e 'print "match" if "2-71828" =~ /\Q2.71\E/'
#=> nothing
perl -e 'print "match" if "2-71828" =~ /2.71/'
#=> match

Character Code Points

'a ; a - a , Ω . a : A'.scan /Ω/ #=> ["Ω"]
'a ; a - a , Ω . a : A'.scan /\u2126/ #=> ["Ω"]
'123 ABC'.scan /\101/ #=> ["A"]
'123 ABC'.scan /\63/ #=> ["3"]
'123 ABC'.scan /\063/ #=> ["3"]
'123 45
6'.scan /\d\cJ\d/ #=> ["5\n6"]

Character Aliases

'123 45
6'.scan /\d\n\d/ #=> ["5\n6"]
'123 45
6'.scan /\d\r\d/ #=> []
"a\nb a\r\nb a\n\rb a\rb".scan /a\rb/ #=> ["a\rb"]
"a\nb a\r\nb a\n\rb a\rb".scan /a\nb/ #=> ["a\nb"]
"a\nb a\r\nb a\n\rb a\rb".scan /a\Rb/
#=> ["a\nb", "a\r\nb", "a\rb"]

The period symbol

The period symbol matches any character except line breaks.
'mama 2 ##'.gsub /a|2|#/, '¤' #=> "m¤m¤ ¤ ¤¤"
'mama 2 ##'.gsub /./, '¤' #=> "¤¤¤¤¤¤¤¤¤"
"grey gr y gr\ny gr\ry gray".scan /gr.y/
#=> ["grey", "gr y", "gr\ry", "gray"]
"grey gr y gr\ny gray gr\ry".scan /gr.y/
#=> ["grey", "gr y", "gray", "gr\ry"]
"grey gr y gr\ny gray gr\ry".scan /gr.y/m
#=> ["grey", "gr y", "gr\ny", "gray", "gr\ry"]
JavaScript> 'grey gr y gr\ny gray gr\ry'
.match(/gr.y/g);
[ 'grey', 'gr y', 'gray' ]
JavaScript> 'grey gr y gr\ny gray gr\ry'
.match(/gr[\s\S]y/g);
[ 'grey', 'gr y', 'gr\ny', 'gray', 'gr\ry' ]
'12:34 09.00 24.56.33'.scan /(\d\d.\d\d(.\d\d)?)/
#=> [["12:34 09"], ["00 24.56"]]
'12:34 09.00 24.56.33'.scan /(\d\d[.:]\d\d([.:]\d\d)?)/
#=> [["12:34"], ["09.00"], ["24.56.33"]]

Shrthnd

'L8 love 2 u 4-ever'.scan /\d/ #=> ["8", "2", "4"]
'123def789 müde'.scan /\d/
#=> ["1", "2", "3", "7", "8", "9"]
'123def789 müde'.scan /\D/
#=> ["d", "e", "f", " ", "m", "ü", "d", "e"]
'123def789 müde'.scan /\w/
#=> ["1", "2", "3", "d", "e", "f", "7", "8", "9",
# "m", "d", "e"]
'123def789 müde'.scan /\W/ #=> [" ", "ü"]
'123def789 müde'.scan /\s/ #=> [" "]
'123def789 müde'.scan /\S/
#=> ["1", "2", "3", "d", "e", "f", "7", "8", "9",
# "m", "ü", "d", "e"]
'123def789 müde'.scan /./
#=> ["1", "2", "3", "d", "e", "f", "7", "8", "9",
# " ", "m", "ü", "d", "e"]
"gr\ny".scan /./ #=> ["g", "r", "y"]
"gr\ny".scan /\s|\S/ #=> ["g", "r", "\n", "y"]

Unicode Categories, Scripts, and Blocks

'a; a-a, Ω. a: A€₺¥'.scan /\p{L}/
#=> ["a", "a", "a", "Ω", "a", "A"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{P}/
#=> [";", "-", ",", ".", ":"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{S}/
#=> ["€", "₺", "¥"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Z}/
#=> [" ", " ", " ", " "]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Ll}/
#=> ["a", "a", "a", "a"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Lu}/
#=> ["Ω", "A"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Sc}/
#=> ["€", "₺", "¥"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Sm}/
#=> []
'a; a-a, Ω. a: A€'.scan /\p{Greek}/
#=> ["Ω"]
'a; a-a, Ω. a: A€'.scan /\p{Latin}/
#=> ["a", "a", "a", "a", "A"]
'a; a-a, Ω. a: A€'.scan /\P{Ll}/
#=> [";", " ", "-", ",", " ", "Ω",
# ".", " ", ":", " ", "A", "€"]
'a; a-a, Ω. a: A€'.scan /\P{Latin}/
#=> [";", " ", "-", ",", " ", "Ω",
# ".", " ", ":", " ", "€"]
'a; a-a, Ω. a: A4€'.scan /[\p{L}\p{N}]/
#=> ["a", "a", "a", "Ω", "a", "A", "4"]
perl -e '"gr\ny" =~ /\X/; print $&' 
# g
perl -e '"gr\ny" =~ /\X+/; print $&'
# gr
# y

Generic Character Class

Alternation (top left), character class (bottom left), character class range (top right), and negated character class (bottom right).
'istanbul constantinople'.scan /a|e|i|o|u|y/
#=> ["i", "a", "u", "o", "a", "i", "o", "e"]
'istanbul constantinople'.scan /[aeiouy]/
#=> ["i", "a", "u", "o", "a", "i", "o", "e"]
'12d343ea3'.scan /[abcdef]/ #=> ["d", "e", "a"]
'12d343ea3'.scan /[a-f]/ #=> ["d", "e", "a"]
'john.smith@company.com'.scan /[7-d]/
#=> ["@", "c", "a", "c"]
'john.smith@company.com'.scan /[j7-dh]/
#=> ["j", "h", "h", "@", "c", "a", "c"]
'pera-beyoğlu'.scan /[a-z]/
#=> ["p", "e", "r", "a", "b", "e", "y", "o", "l", "u"]
'pera-beyoğlu'.scan /[-az]/ #=> ["a", "-"]
'pera-beyoğlu'.scan /[az-]/ #=> ["a", "-"]
'quick2 ball5 good4you1 money1'.scan /[a-z]+[0-9]/
#=> ["quick2", "ball5", "good4", "you1", "money1"]
'quick2 ball5 good4you1 money1'.scan /[a-z0-9]+[0-9]/
#=> ["quick2", "ball5", "good4you1", "money1"]

Generic Character Class Negated and Tweaked

'<span rel="info" class="people">'.scan /class=".*"/
#=> ["class=\"people\""]
'<span class="people" rel="info">'.scan /class=".*"/
#=> ["class=\"people\" rel=\"info\""]
'<span class="people" rel="info">'.scan /class="[^"]*"/
#=> ["class=\"people\""]
'<span class="people" rel="info">'.scan /class=['"][^'"]*"/
#=> ["class=\"people\""]
'Kadıköy^Chalcedon^Χαλκηδών'.scan /[^A-Za-z]/
#=> ["ı", "ö", "^", "^",
# "Χ", "α", "λ", "κ", "η", "δ", "ώ", "ν"]
'Kadıköy^Chalcedon^Χαλκηδών'.scan /[A-Z^a-z]/
#=> ["K", "a", "d", "k", "y", "^", "C", "h",
# "a", "l", "c", "e", "d", "o", "n", "^"]

Generic Character Class Escape

The five literals which does not always match literally inside a character class.
'I know that 3 - 2 is 1'.scan /[a-z]/
#=> ["k", "n", "o", "w", "t", "h", "a", "t", "i", "s"]
'I know that 3 - 2 is 1'.scan /[-az]/
#=> ["a", "-"]
'This is. That is.'.scan /[t.]/ #=> [".", "t", "."]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[3^]/ #=> ["^", "3", "^"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[^3]/
#=> ["1", " ", "-", " ", "2", " ", "^", " ", " ", "\\",
# " ", "4", " ", "[", " ", "5", " ", "^", " ", "6"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\^3]/ #=> ["^", "3", "^"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[]3]/ #=> ["3"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[3]]/ #=> []
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[3\]]/ #=> ["3"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[-24]]/ #=> []
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[-24]/ #=> ["-", "2", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[2-4]/ #=> ["2", "3", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[24-]/ #=> ["-", "2", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[2\-4]/ #=> ["-", "2", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\da-z]/
#=> ["1", "2", "3", "4", "5", "6"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\Sa-z]/
#=> ["1", "-", "2", "^", "3", "\\", "4", "[", "5",
# "^", "6"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\w-]/
#=> ["1", "-", "2", "3", "4", "5", "6"]
'Five $ (dollar) + one gull { *|. ?'.scan /(|)*?.{0}+$/
#=> [[nil]]
'Five $ (dollar) + one gull { *|. ?'.scan /[(|)*?.{+$]/
#=> ["$", "(", ")", "+", "{", "*", "|", ".", "?"]

Posix Character Class

'abc123efg'.scan /[[:digit:]]/ #=> ["1", "2", "3"]
'abc123efg'.scan /[:digit:]/ #=> ["g"]
'abc123efgå '.scan /[[:lower:]]/
#=> ["a", "b", "c", "e", "f", "g", "å"]
'abc123efgå '.scan /[\d\p{L}]/
#=> ["a", "b", "c", "1", "2", "3", "e", "f", "g", "å"]
'abc123efgå '.scan /[[:alnum:]]/
#=> ["a", "b", "c", "1", "2", "3", "e", "f", "g", "å"]

Grouping

'12(3)4'.gsub /(\d)\d/, '¤' #=> "¤(3)4"
'12(3)4'.gsub /\(\d\)\d/, '¤' #=> "12¤"
'1234'.gsub /1\d+/, '¤' #=> "¤"
'1234'.gsub /(1\d)+/, '¤' #=> "¤34"

Capture and Back Reference

'abcde' =~ /a((bc)((d)e))/ #=> 0
$1 #=> "bcde"
$2 #=> "bc"
$3 #=> "de"
$4 #=> "d"
'Rūmiyyat al-kubra'.sub /(.)\1/, "¤"
#=> "Rūmi¤at al-kubra"
'You may may do that'.sub /(\S+)\s\1/, "¤"
#=> "You ¤ do that"
'abcde' =~ /a((?:bc)((d)e))/ #=> 0
$1 #=> "bcde"
$2 #=> "de"
$3 #=> "d"
'12/31/1999'.sub %r!(\d\d)/(\d\d)/(\d{4})!, '\3-\1-\2'
#=> "1999-12-31"
'aba'.match /(\2b|(a)){2}/ #=> nil
'aab'.match /(\2b|(a)){2}/ #=> #<MatchData "aab">
'abb'.match /(a)(b\k<-2>)/ #=> nil
'aba'.match /(a)(b\k<-2>)/ #=> #<MatchData "aba">

Named groups

'12/31/1999'.sub \
%r!(?<month>\d\d)/(?<day>\d\d)/(?<year>\d{4})!,
'\k<year>-\k<month>-\k<day>'
#=> "1999-12-31"
'You may may do that'.sub /(?<word>\S+)\s\k<word>/, '\k<word>'
#=> "You may do that"
csharp> Regex regex = new Regex("v(?'letter'[aeiouy])|" +
"c(?'letter'[b-df-hj-np-tv-xz])");
csharp> regex.Match("va");
va
csharp> regex.Match("vb");
csharp> regex.Match("ca");
csharp> regex.Match("cb");
cb
csharp> Regex.Replace("ab", "(?<first>.)(.)",
"Group 1 is '$1'");
"Group 1 is 'b'"
ruby> 'ab'.sub /(?<first>.)(.)/, "Group 1 is '" + $1 + "'"
#=> "Group 1 is 'a'"

Atomic Groups

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab'.scan /(?>a+a+)+b/
#=> ["aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"]
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab'.scan /(a+a+)+b/
#=> [["aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"]]
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.scan /(?>a+a+)+b/
#=> []
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.scan /(a+a+)+b/
#=> []
'123° 456 789°'.scan /\d+°/ #=> ["123°", "789°"]
'123° 456 789°'.scan /(?>\d+)°/ #=> ["123°", "789°"]
'123° 456 789°'.scan /(?>.+)°/ #=> []

Anchors

'This to the fair Critias.'.scan /[A-Z][a-z]+/
#=> ["This", "Critias"]
'This to the fair Critias.'.scan /^[A-Z][a-z]+/
#=> ["This"]
"This to the fair\nCritias.".scan /^[A-Z][a-z]+/
#=> ["This", "Critias"]
'This to the fair Critias.'.scan /[A-Z][a-z]+$/
#=> []
"This to the fair Critias\n.".scan /[A-Z][a-z]+$/
#=> ["Critias"]
'This to the fair Critias.'.scan /^[A-Z][a-z]+$/
#=> []
'This to the fair Critias.'.scan /[A-Z][a-z]+/
#=> ["This", "Critias"]
'This to the fair Critias.'.scan /\A[A-Z][a-z]+/
#=> ["This"]
"This to the fair\nCritias.".scan /\A[A-Z][a-z]+/
#=> ["This"] — no Critias match after line break
'This to the fair Critias.'.scan /[A-Z][a-z]+\Z/
#=> []
"This to the fair Critias\n.".scan /[A-Z][a-z]+\Z/
#=> [] — no Critias match before line break
'This to the fair Critias.'.scan /\A[A-Z][a-z]+\Z/
#=> []
'This to the fair Critias.'.scan /\w+/
#=> ["This", "to", "the", "fair", "Critias"]
'This to the fair Critias.'.scan /t\w+/
#=> ["to", "the", "tias"]
'This to the fair Critias.'.scan /\bt\w+\b/
#=> ["to", "the"]

Lookarounds

“This clock-shaped machine solves the slash-slash problem. The square window at the top of the machine reads a tape — from left to right — consisting of the input string, i.e., the C code. We have a state machine of type Nondeterministic Finite Automaton (NFA). To visualize and represent the automaton, we use a transition graph, in which the vertices represent states and the edges represent transitions. The initial state is identified by an incoming unlabeled arrow not originating at any vertex. The acceptance state is surrounded by a circle. This is the graph that you see printed on the clock dial. Now, whenever a new symbol is read from the tape, the clock dial rotates so that the drooping peak, just below the reading window, points at the current state. Ah, and this particular machine also has a special lookahead feature: it’s a long arm with an eye in the end and a light bulb. This eye can look ahead and tell if there’s any double slash. If there is, the bulb will glow and the machine will understand that it doesn’t matter if the input ends in a pair of parentheses — it’s in a comment anyway.” Source: De Morgan to the Rescue.
'Cheese slicer invented by carpenter Thor'. scan  \
/\b\w+?\sThor/
#=> ["carpenter Thor"] -- too bad, Thor included
'Cheese slicer invented by carpenter Thor'. scan \
/\b\w+?(?=\s+Thor\b)/
#=> ["carpenter"] -- yes, Thor extracted from match
'The cheese slicer invented by carpenter Thor'. scan /c\w+/
#=> ["cheese", "cer", "carpenter"]
'The cheese slicer invented by carpenter Thor'. scan /\sc\w+/
#=> [" cheese", " carpenter"]
'The cheese slicer invented by carpenter Thor'. \
scan /(?<=\s)c\w+/
#=> ["cheese", "carpenter"] – preceded by whitespace
'The cheese slicer invented by carpenter Thor'. \
scan /(?<!\s)c\w+/
#=> ["cer"] – starting with c, but no whitespace before

Afterword

--

--

🌱 Twenty Years of Agile Coaching and Leadership • Monotasking and Pomodoro books (700.000 copies sold)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store