Regular Expressions—The Full Story

Introduction

PART I: The Automaton

Transition Diagram

Two-up is a traditional Australian casino game.
  • States: The small circles. We’re jumping from state to state. The only thing that’s certain is that at a particular time, exactly one state is the current one. The state with a small arrow pointing at it is the start state. All states surrounded by a ring are accept states.
  • Alphabet: Only two symbols exist in this alphabet: Heads and Tails. Now, you might wonder why they’re drawn multiple times. It’s because the same symbol can be part of many transitions.
  • Transitions: The long arrows between states are called transitions. They’re armed with at least one symbol from the alphabet. Transitions are the rules for how we can travel, e.g., we may go from the start state to the accept state in two transitions if we first get a tails, then another tails.

Alphabet

Four different alphabets: coin, English lowercase, binary digits, and two dots.
  • A coin alphabet comprising the symbols H (heads), T (tails).
  • The English lowercase alphabet comprises the symbols a, b, c, … z.
  • The binary digits alphabet comprises the symbols 𝟶 and 𝟷.
  • The ASCII alphabet comprises 128 symbols (characters), including some that are not printable.
  • A Five-Emojis alphabet comprising the symbols 😜, 🤔, 🙄, 🤭, 🙃.
  • The Unicode alphabet comprises more than 100,000 symbols (characters) from hundreds of scripts.
  • HTTHHT from the coin alphabet.
  • artichoke and grxsttt from the English lowercase alphabet.
  • 𝟶𝟷𝟷𝟶𝟷, 𝟷𝟶, and 𝟷 from the binary digits alphabet.
  • 4$, a@a, and BIG from the ASCII alphabet.
  • 😜🤔🤔, 🙄🤔, and 🙄🤭🙃 from the Five-Emojis alphabet.
  • fänkål, apple, and teşekkürler from the Unicode alphabet.
Four different glyphs.
Every symbol has a unique index number.

States and Transitions

  • Tickets are symbols.
  • A trip with departure and arrival in the same country is a loop.
  • The countries are the states of the automaton.
  • The ticket booklet is a string.
  • Your home country is the start state.
  • The countries where you have permanent residence are accept states.
  • A successful tour is a string (of symbols from your alphabet) that the automaton accepts.
  • A failed tour is a string that the automaton doesn’t accept.
  • The trips are termed transitions, and a tour is termed a walk.

Transitions Table

Finite automaton: all strings that don’t end with two blue dots.
The intersection of the state B and the white dot symbol says: “Go to state A.”

Nondeterministic Finite Automaton

  1. If we’re in a specific state and have a specific upcoming input symbol, then we specify a set of possible transitions.
  2. We have spontaneous transitions — denoted as ε — shifting from one state to another, without consuming any input symbol!
  3. The sets in (1) can be empty. When we’re in a specific state and have a specific symbol, we sometimes lack any possible transition away from there.
NFA and DFA: all strings in which no white dot can ever be preceded by a blue dot
NFA and DFA: any string that ends with a sequence of first one white, then one blue dot.
Venn diagram: All DFAs are NFAs. Some NFAs are DFAs.

Greedy and Reluctant

Automaton for Two-up. Symbol H represents heads. Symbol T represents tails.
  • With the greedy strategy, we try — as long as possible — to be in a self-transitioning loop of the current state. In this strategy, each state is greedy and tries to consume as much as possible from the input string.
  • With the reluctant strategy, we try — as quickly as possible — to transition from the current state to another state, ultimately to arrive early in an accept state. Each state is reluctant to consume more of the input string and, if possible, tries to pass the initiative to the next state instead.

NFA to DFA

  1. Only the integer zero can start with the digit 𝟶.
  2. The integer zero can’t be preceded by a plus or minus sign.
Nondeterministic finite automaton.
Deterministic finite automaton.
  • Transitioning in NFAs may include alternatives, but in DFAs the path is deterministic.
  • All DFAs are NFAs, but all NFAs are not DFAs.
  • We have an algorithm to transform any given NFA to a DFA.

Pushdown Automata

  • First Step Option 1: Read a symbol from the input string.
  • First Step Option 2: Pop a symbol from the top of the stack.
  • First Step Option 3: Read a symbol AND pop a symbol.
  • Second Step Option 1: Initiate a state transition.
  • Second Step Option 2: Push a symbol onto the top of the stack.
  • Second Step Option 3: Initiate a state transition AND push a symbol.
Pushdown automaton.
  1. Read the blue dot and make a transition from the start state, through the third stack icon from the top — i.e., a blue dot is pushed to the stack — and back to the start state.
  2. Read the white dot and make a transition from the start state, through the first stack icon — i.e., a blue dot is popped from the stack — and back to the start state. Now the stack is empty.
  3. Read the white dot and make a transition from the start state, through the second stack icon from the top — i.e., a white dot is pushed to the stack — and back to the start state.
  4. Read the blue dot and make a transition from the start state, through the fourth stack icon — i.e., a white dot is popped from the stack — and to the accept state. Now the stack is empty again, and all the input is read. It’s a match.

PART II: Two Operations and One Function

History of Regular Expressions

The regular expressions pioneers.

Match One Character

  • The binary alphabet {0, 1} contains only two symbols: 1 and 0.
  • The American Standard Code for Information Interchange (ASCII) contains 128 symbols, only 95 of which are printable. The others, such as Backspace (ASCII 8), are also symbols in our definition.
  • The set of 95 printable symbols in ASCII is also an alphabet.
  • Unicode contains over 100,000 symbols that include Arabic, Chinese, Latin, and many other types. However, even if it’s a very large number, it’s still a finite number of symbols. Therefore, Unicode is an alphabet.
  • The set of the Cyrillic symbols in Unicode is an alphabet.
  • We can construct an alphabet comprising a blue dot and a white dot, i.e., an alphabet comprising two symbols, just like the binary alphabet.
  • 0110 and 1001 are two strings from the binary alphabet.
  • The strings kadikoy and uskudar are constructed with symbols from the ASCII alphabet.
  • футбол is a string from the Cyrillic alphabet.
  • اللوز is a string from the Arabic alphabet.
  • From any alphabet, it’s possible to create an empty string, i.e., a string comprising zero symbols. By convention, we denote that string with the character ε.
  • All binary numbers ending in 0, e.g., 10, 100, 110, etc.
  • All palindromes — strings that are the same forward and backward — constructed from ASCII symbols.
  • All strings with an even number of symbols from Unicode.
  • A finite set {dog, cat, bird}.
  • An empty set, i.e., a language with no strings. Such an alphabet is by convention denoted as Ø.
  • The language {ε}, which comprises only one symbol: the empty string ε. (Note that {ε} is very different from Ø, for example in string cardinality.)
'a'.match /a/ #=> #<MatchData "a">
# string a matched by regex /a/
''.match /a/ #=> nil
# empty string not matched by /a/
'b'.match /a/ #=> nil
# string b not matched by /a/
'a'.match // #=> #<MatchData "">
# string a not matched by empty regex ε

Four More Rules

  • For each regular expression, we may construct at least one DFA and at least one NFA, so that all three (regular expression, DFA, and NFA) solve the same problem.
  • For every finite automaton — deterministic (DFA) as well as non-deterministic (NFA) — we may write a regular expression, so that both (automaton and regular expression) solve the same problem.

Operation №1: Concatenation

'moda'.match /m/ #=> #<MatchData "m">
'moda'.match /o/ #=> #<MatchData "o">
'moda'.match /mo/ #=> #<MatchData "mo"> m×o is mo
'moda'.match /da/ #=> #<MatchData "da"> d×a is da
'moda'.match /moda/ #=> #<MatchData "moda"> mo×da is moda
'moda'.match /mado/ #=> nil -- mado is not moda
  • A prefix is the substring we leave if we remove zero or more symbols from the end of a string. The strings “m”, “mo”, “mod”, and “moda” are all prefixes of the string “moda”. Even the empty string ε is a prefix of “moda”.
  • A suffix is the substring that remains if we remove zero or more characters from the beginning of the string. The strings “moda”, “oda”, “da”, “a” and ε are all possible suffixes of the string “moda”.
  • A substring is what we have left if we remove a prefix and a suffix from a string. Note that even ε is a valid prefix and suffix. Symbols in the substrings must be consecutive in the original string. The strings “od” and “moda”, but not “mda”, are substrings of “moda”.
'aaa'.match /aaa/ #=> #<MatchData "aaa">
'aaa'.match /a{3}/ #=> #<MatchData "aaa">
# yes, the string includes 3 concatenated a
'aaa'.match /a{4}/ #=> nil
# no, the string doesn't include 4 a
'aa'.match /a?/ #=> #<MatchData "a">
# optional match, written as question mark
'b'.match /a?/ #=> #<MatchData "">
# zero repeats of a matches empty string
'aa'.match /a{,2}/ #=> #<MatchData "aa"> at least two a
'aa'.match /a{1,2}/ #=> #<MatchData "aa">
# at least one a and at most two a
'a'.match /a{1,2}/ #=> #<MatchData "a">

Operation №2: Alternation

Venn diagram showing exclusive disjunction (XOR).
'a'.match /a|b/ #=> #<MatchData "a"> a is either a or b
'ab'.match /a|b/ #=> #<MatchData "a"> leftmost chosen
'ba'.match /a|b/ #=> #<MatchData "b"> leftmost chosen
'c'.match /a|b/ #=> nil -- here we found neither a nor b
'0'.match /0|1/ #=> #<MatchData "0">
'1'.match /0|1/ #=> #<MatchData "1">
'2'.match /0|1/ #=> nil
'10'.match /0|1/ #=> #<MatchData "1">
'10'.match /00|10|01|11/ #=> #<MatchData "10">
'01'.match /00|10|01|11/ #=> #<MatchData "01">
'12'.match /00|10|01|11/ #=> nil
'11'.match /00|10|01|11/ #=> #<MatchData "11">
'1210'.match /00|10|01|11/ #=> #<MatchData "10">
'10'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'01'.match /(0|1)(0|1)/ #=> #<MatchData "01">
'12'.match /(0|1)(0|1)/ #=> nil
'11'.match /(0|1)(0|1)/ #=> #<MatchData "11">
'1210'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'moda'.match /moda|/ #=> #<MatchData "moda">
# either moda or nothing is moda
'moda'.match /mado|/ #=> #<MatchData "">
# either mado (not moda) or nothing is nothing

The Function: Kleene Star

'110'.match /(0|1)*0(0|1)*/ #=> #<MatchData "110">
# all strings with at least one zero
'1111'.match /(0|1)*0(0|1)*/ #=> nil
'1001'.match /((10*1)|0*)*/ #=> #<MatchData "1001">
# all strings with an even number of ones
'11001'.match /((10*1)|0*)*/ #=> #<MatchData "1100">
''.match /((10*1)|0*)*/ #=> #<MatchData "">
# even empty string has even number of ones
'1001'.match /((10*1)|0)|/ #=> #<MatchData "1001">
# again 'an even number of ones'
'11001'.match /((10*1)|0)|/ #=> #<MatchData "11">
''.match /((10*1)|0)|/ #=> #<MatchData "">
'1'.match /((10*1)|0)|/ #=> #<MatchData "">
'010'.match /((10*1)|0)|/ #=> #<MatchData "0">
'01'.match /((10*1)|0)|/ #=> #<MatchData "0">

Precedence

'fifth row'.match /third|fifth row/
#=> #<MatchData "fifth row">
'third row'.match /third|fifth row/
#=> #<MatchData "third">
'fifth row'.match /(third|fifth) row/
#=> #<MatchData "fifth row">
'third row'.match /(third|fifth) row/
#=> #<MatchData "third row">
'third row'.match /(third|(four|fif)th) row/ 
#=> #<MatchData "third row">
'fourth row'.match /(third|(four|fif)th) row/
#=> #<MatchData "fourth row">
'fifth row'.match /(third|(four|fif)th) row/
#=> #<MatchData "fifth row">
  • Operator precedence is an ordered list of all operators. It tells us which operator must be executed before another operator in a regular expression. Several operators can have the same priority. In mathematics, the terms inside parentheses have the highest priority. Multiplication and division have a lower priority, and addition and subtraction have the lowest. This is why 6+6/(2+1) = 8.
  • Operator position indicates where the operands are located in relation to the operator. The position can be prefix, infix, or postfix. If the operator is prefix, then the operand resides to the right of the operator, as the mathematical unary minus sign in, e.g., -3. An infix operator has an operand on each side, as in addition, e.g., 1+2. Finally, a postfix operator stands to the right of its operand, as the exclamation point that represents the faculty operator, e.g., 5!.
  • Operator associativity tells us how to group two operators on the same precedence level. An infix operator can be right-associative, left-associative, or non-associative. In mathematics, the infix operations addition and subtraction have the same precedence. Considering that both are left-associative, the following equation holds: 1–2+3 = (1–2)+3 = 2. Prefix or postfix operators are either associative or non-associative. If they’re associative, we start with the operator closest to the operand. A non-associative operator can’t compete with operators of the same precedence.

Some Examples With *, |, and ×

'01101'.match /1*(0|)1*/ #=> #<MatchData "011">
'0111'.match /1*(0|)1*/ #=> #<MatchData "0111">
'1101'.match /1*(0|)1*/ #=> #<MatchData "1101">
'11010'.match /1*(0|)1*/ #=> #<MatchData "1101">
'101001'.match /(1|0)*00(1|0)*/ #=> #<MatchData "101001">
'10101'.match /(1|0)*00(1|0)*/ #=> nil
'1010100'.match /(1|0)*00(1|0)*/ #=> #<MatchData "1010100">
'1010100'.match /1*(011*)*(0|)/ #=> #<MatchData "101010">
'101001'.match /1*(011*)*(0|)/ #=> #<MatchData "1010">
'0010101'.match /1*(011*)*(0|)/ #=> #<MatchData "0">
'0110101'.match /1*(011*)*(0|)/ #=> #<MatchData "0110101">
'110101'.match /(0|1)*01/ #=> #<MatchData "110101">
'11010'.match /(0|1)*01/ #=> #<MatchData "1101">
'1'.match /(0|1)*01/ #=> nil
'01'.match /(0|1)*01/ #=> #<MatchData "01">
'010'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "010">
'011'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "011">
''.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "">
'1'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "1">
'01'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "0">
'101'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "10">
'0110101'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "0110101">
'00101100'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "0010110">
'11001011'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "110">
'1100'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "110">
'0011'.match /0*(100*)*1*(011*)*(0|)/
#=> #<MatchData "0011">

Regular Expressions Are Finite Automata

How to design finite automata for concatenation (top left), alternation (top right), kleene star (bottom left), and a combined pattern (bottom right).
  • The empty string ε is a regular expression corresponding to a finite automaton with a start state, a transition that accepts the empty string ε and leads from the start state to an accept state. We’ll call this an ε-transition.
  • The empty set Ø is equivalent to a regular expression that can’t match any single string — not even the empty string ε. It’s the same as a two-state automaton, with no single transition. One state is the start state, and the other is the accept state. Unfortunately, they’re not linked.
  • A regular expression that only matches the symbol 𝚋 corresponds to a finite automaton with two states: start and accept. There’s a transition from start to accept, and this transition only accepts the symbol 𝚋.
  • Concatenation of 𝚙 and 𝚚 means that we first match a string with 𝚙, directly followed by a string matched with 𝚚. To create this finite automaton, we first add ε-transitions from every accept state in 𝚜 to the start state of 𝚝, then we remove the accept status from all accept states in s. We also withdraw the start status of the start state in 𝚝, i.e., 𝚜 cannot accept a string, and 𝚝 cannot start reading. However, 𝚜 and 𝚝 now are connected, and we may start in 𝚜 and accept in 𝚝. In the top left graph above, this is the method we use to transform the two regular expressions, /○/ and /●/, into /○●/.
  • Alternation of 𝚙 and 𝚚, i.e., 𝚙|𝚚, is like a finite automaton with a new start state that has ε-transitions to all start states of 𝚜 and 𝚝. The new finite automaton also has a new accept state reached with ε-transitions from all accept states of 𝚜 and 𝚝. Thus, the former start and accept states of 𝚜 and 𝚝 aren’t start and accept states in our new automaton. In this way, we transform the two regular expressions, /○/ and /●/, into /○|●/ in the top right graph.
  • Kleene star is the concatenation closure. Assume that 𝚙 = 𝚚*. Then 𝚜 is the finite automaton that we get when we take 𝚝 and add two states and four transitions as follows: One new state is the start state, and the other is an accept state. All accept states of 𝚜 lose their accept status and instead get an ε-transition to the new accept state. We add two ε-transitions from the new initial state — one to the old start state and one to the new accept state. Furthermore, we insert one ε-transition from each of the old accept states to the old start state. Using this method helps us transform the regular expression /●/ into /●*/ in the lower left graph.

Traits

Architecture

  1. Compile a regex pattern.
  2. Feed input into an application.
  3. Iterate the matched results.
fa = Regexp.compile('10*') #=> /10*/
rs = '1 100 00 10'.scan fa #=> ["1", "100", "10"]
rs.each { |x| puts x }
# 1
# 100
# 10
#=> ["1", "100", "10"]
'1 100 00 10'.scan(/10*/) { |x| puts x }
# 1
# 100
# 10
#=> "1 100 00 10"

Functions

Verify (e.g. a card number), find, replace, filter, and parse are five regex applications.
/^ab*$/ === 'abbb' #=> true
/^ab*$/ === 'baaa' #=> false
'ab b abb a ba'.scan /ab*/ #=> ["ab", "abb", "a", "a"]
'ab b abb a ba'.gsub(/ab*/, '¤') #=> "¤ b ¤ ¤ b¤"
'ab b abb a ba'.gsub(/ab*/, '') #=> " b   b"
'ab b abb a ba'.split /ab*/ #=> ["", " b ", " ", " b"]

PART III: Syntactic Sugar, Abstractions, and Extensions

Quantifiers

'caalery'.sub /a*/,'e' #=> "ecaalery"
'caalery'.sub /a+/,'e' #=> "celery"
'chickpea chicken chickpeas'.scan /chickpeas?/
#=> ["chickpea", "chickpeas"]
'LaLaLaLaLaLa'.sub /(La){1,4}/, 'oh'
#=> "ohLaLa"
'ohLaLaLaLaLaLa'.sub /(La){,4}/, 'oh'
#=> "ohohLaLaLaLaLaLa"
'OhLaLaLaLaLaLa'.sub /(La){,4}/, 'oh'
#=> "ohOhLaLaLaLaLaLa"
'LaLaLaLaLaLa'.sub /(La){1,}/, 'oh'
#=> "oh"
'LaLaLaLaLaLa'.sub /(La){2}/, 'oh'
#=> "ohLaLaLaLa"
  • Concatenation: /𝚊𝚋{𝟷, 𝟸}/ equals /𝚊(𝚋{𝟷, 𝟸})/
  • Concatenation: /𝚊{𝟷, 𝟸}𝚋/ equals / (𝚊{𝟷, 𝟸})𝚋/
  • Alternation: /𝚊|𝚋{𝟷, 𝟸}/ equals /𝚊|(𝚋{𝟷, 𝟸})/
  • Alternation: /𝚊{𝟷, 𝟸}|𝚋/ equals /(𝚊{𝟷, 𝟸})|𝚋/

Quantifier Equations

  • 𝚊{𝟹} equals 𝚊{𝟹, 𝟹}
  • 𝚊* equals 𝚊{𝟶, }
  • 𝚊+ equals 𝚊{𝟷, }
  • 𝚊? equals 𝚊{𝟶, 𝟷}
  • (𝚊*)* equals 𝚊*
  • (𝚊+)+ equals 𝚊+
  • (𝚊?)? equals 𝚊?
  • (𝚊*)+ equals (𝚊+)* equals 𝚊*
  • (𝚊*)? equals (𝚊?)* equals 𝚊*
  • (𝚊+)? equals (𝚊?)+ equals 𝚊*
  • 𝚊* equals (𝚊+|)
  • 𝚊? equals (𝚊|)? equals (𝚊|)
  • (𝚊|)* equals 𝚊*

Reluctant Quantifiers

Greedy (left) and reluctant (right) kleene star.
'<div>a</div><span>c</span><div>b</div>'.scan /<div>.*<\/div>/
#=> ["<div>a</div><span>c</span><div>b</div>"]
  • Positive closure function with reluctant modifier: at least one, as few as possible: +?
  • Optional quantifier function with reluctant modifier: zero or one, preferably zero: ??
  • Generic quantifier function with reluctant modifier: between three and five and as few as possible: {𝟹, 𝟻}?
'<div>a</div><span>c</span><div>b</div>' \
.scan /<div>.*?<\/div>/
#=> ["<div>a</div>", "<div>b</div>"]
'aa'.match /a?/ #=> #<MatchData "a">
'aa'.match /a??/ #=> #<MatchData "">
'aaaaa'.match /a{2,4}/
#=> #<MatchData "aaaa">
# at least 2, at most 4, as much as possible
'aaaaa'.match /a{2,4}?/
#=> #<MatchData "aa">
# at least 2, at most 4, as little as possible
'aaaaa'.match /a{2,}/ #=> #<MatchData "aaaaa">
'aaaaa'.match /a{2,}?/ #=> #<MatchData "aa">
'aaaaa'.match /a{,4}/ #=> #<MatchData "aaaa">
'aaaaa'.match /a{,4}?/ #=> #<MatchData "">

Possessive Quantifier

'b'.sub /a?+b/, '¤' #=> "¤"
'b'.sub /a?b/, '¤' #=> "¤"
'b'.sub /.?+b/, '¤' #=> "b"
'b'.sub /.?b/, '¤' #=> "¤"
'ab'.sub /.?+b/, '¤' #=> "¤"
'ab'.sub /.?b/, '¤' #=> "¤"
'b'.sub /a*+b/, '¤' #=> "¤"
'b'.sub /a*b/, '¤' #=> "¤"
'b'.sub /.*+b/, '¤' #=> "b"
'b'.sub /.*b/, '¤' #=> "¤"
'ab'.sub /.*+b/, '¤' #=> "ab"
'ab'.sub /.*b/, '¤' #=> "¤"
'b'.sub /a++b/, '¤' #=> "b"
'b'.sub /a+b/, '¤' #=> "b"
'b'.sub /.++b/, '¤' #=> "b"
'b'.sub /.+b/, '¤' #=> "b"
'ab'.sub /.++b/, '¤' #=> "ab"
'ab'.sub /.+b/, '¤' #=> "¤"
ruby> 'aab'.sub /.?+b/, '¤' #=> "a¤"
ruby> 'aab'.sub /.{0,1}+b/, '¤' #=> "¤"
# Warning: nested repeat operators '?' and '+'
# were replaced with '*' in regular expression: /.{0,1}+b/
scala> ".{0,1}+b".r.replaceAllIn("aab", "¤")
res3: String = a¤
scala> ".?+b".r.replaceAllIn("aab", "¤")
res4: String = a¤

Literal vs. Metacharacters

The twelve literals who doesn‘t match literally.
'kadıköy karaköy köyun'.scan /köy/
#=> ["köy", "köy", "köy"]
  • Caret ^ and dollar $ assert a position at the beginning or the end.
  • Left and right parantheses ( and ) start and end a capturing group.
  • Asterisk *, plus +, questionmark ?, and left brace { repeat the preceding subexpression.
  • Period . matches any character except line breaks.
  • Left square bracket [ starts a character class.
  • Backslash \ lets a metacharacter be matched literally.
  • Vertical line | is an alternation.
'Sentence.'.scan /\./
#=> ["."]
'Sentence.'.scan /./
#=> ["S", "e", "n", "t", "e", "n", "c", "e", "."]
user_input1 = 'first'
#=> "first"
regex = Regexp.compile('id="' + user_input1 + '"')
#=> /id="first"/
'<span name="secret" id="first"/>'.scan regex
#=> ["id=\"first\""]
user_input2 = '|name=".*?"|'
#=> "|name=\".*?\"|"
regex = Regexp.compile('id="' + user_input2 + '"')
#=> /id="|name=".*?"|"/
'<span name="secret" id="first"/>'.scan regex
#=> ["name=\"secret\"", "id=\"", "\""]
user_input2 = '|name=".*?"|'
#=> "|name=\".*?\"|"
regex = Regexp.compile('id="' + Regexp.escape(user_input2) + '"')
#=> /id="\|name="\.\*\?"\|"/
'<span name="secret" id="first"/>'.scan regex
#=> []
perl -e 'print "match" if "2.71828" =~ /\Q2.71\E/'
#=> match
perl -e 'print "match" if "2-71828" =~ /\Q2.71\E/'
#=> nothing
perl -e 'print "match" if "2-71828" =~ /2.71/'
#=> match

Character Code Points

  • Grapheme: The smallest units of meaning in written language, e.g., letters or numbers, are called graphemes. A single grapheme can be visualized in different ways, i.e., the grapheme for the number seven can be crossed or not crossed. Thus, grapheme is something semantic, but not anything visual.
  • Glyph: The different ways we can visualize a grapheme on paper or on screen are called glyphs. A glyph is a shape associated with an abstract character, e.g., the grapheme for the Ohm sign may have a glyph that looks like this: Ω. Several different glyphs may represent the same grapheme.
  • Code Point: A code point normally is assigned to an abstract character, which is a level above encodings of bits and bytes. Think of it as an index in an array of characters. In Unicode, the Ohm sign code point is U+2126, which is hexadecimal. Thus, the decimal index is 8486.
  • Character encoding: A pairing of codes, e.g., bitmaps or natural numbers and characters is called character encoding. The most popular encodings are UTF-8, UTF-16, and UCS-2. Unicode can be implemented with different encodings. The ohm sign is encoded as 𝟶𝚡𝙴𝟸 𝟶𝚡𝟾𝟺 𝟶𝚡𝙰𝟼 in UTF-8 and as 𝟶𝚡𝟸𝟷𝟸𝟼 in UTF-16.
'a ; a - a , Ω . a : A'.scan /Ω/ #=> ["Ω"]
'a ; a - a , Ω . a : A'.scan /\u2126/ #=> ["Ω"]
  • Python: \𝚡𝙳𝙵
  • Perl: \𝚡𝙳, \𝚡𝙳𝙵, and \𝚡{𝟸𝟷𝟸𝟼}
  • Java and C#: \𝚡𝙳𝙵 and \𝚞𝟸𝟷𝟸𝟼
'123 ABC'.scan /\101/ #=> ["A"]
'123 ABC'.scan /\63/ #=> ["3"]
'123 ABC'.scan /\063/ #=> ["3"]
'123 45
6'.scan /\d\cJ\d/ #=> ["5\n6"]

Character Aliases

  • \𝚊 — Sound alert, ASCII index 0x07
  • \𝚎 — Escape-character, ASCII index 0x1B (Visible as \.)
  • \𝚏 — Form feed, ASCII index 0x0C
  • \𝚗 — Newline, ASCII index 0x0A on Windows/Linux and 0x0D on OSX
  • \𝚛 — Carriage return, ASCII index 0x0D on Windows/Linux and 0x0A on OSX
  • \𝚁 — Any line break and also CRLF as a combo
  • \𝚝 — Horizontal tab, ASCII index 0x09
  • \𝚟 — Vertical tab, ASCII index 0x0B
'123 45
6'.scan /\d\n\d/ #=> ["5\n6"]
'123 45
6'.scan /\d\r\d/ #=> []
"a\nb a\r\nb a\n\rb a\rb".scan /a\rb/ #=> ["a\rb"]
"a\nb a\r\nb a\n\rb a\rb".scan /a\nb/ #=> ["a\nb"]
"a\nb a\r\nb a\n\rb a\rb".scan /a\Rb/
#=> ["a\nb", "a\r\nb", "a\rb"]

The period symbol

The period symbol matches any character except line breaks.
'mama 2 ##'.gsub /a|2|#/, '¤' #=> "m¤m¤ ¤ ¤¤"
'mama 2 ##'.gsub /./, '¤' #=> "¤¤¤¤¤¤¤¤¤"
  1. The character class period . and the kleene star function * are together and separately the most abused features of regex. If you use them frequently without thinking, you’ll often end up with overly general regexes — sometimes they are even incorrect. Every time you intend to write period ., asterisk *, or even the combo .*, you first may want to consider whether you really mean something more specific.
  2. Most regex books, including the most popular ones, are unclear or even entirely incorrect, as they claim that “period symbol matches any character.” In most cases, the period symbol matches “any character except line breaks,” which is a very, very important difference.
"grey gr y gr\ny gr\ry gray".scan /gr.y/
#=> ["grey", "gr y", "gr\ry", "gray"]
"grey gr y gr\ny gray gr\ry".scan /gr.y/
#=> ["grey", "gr y", "gray", "gr\ry"]
"grey gr y gr\ny gray gr\ry".scan /gr.y/m
#=> ["grey", "gr y", "gr\ny", "gray", "gr\ry"]
JavaScript> 'grey gr y gr\ny gray gr\ry'
.match(/gr.y/g);
[ 'grey', 'gr y', 'gray' ]
JavaScript> 'grey gr y gr\ny gray gr\ry'
.match(/gr[\s\S]y/g);
[ 'grey', 'gr y', 'gr\ny', 'gray', 'gr\ry' ]
  • Time should always include hours and minutes, sometimes even seconds.
  • Hours, minutes, and seconds should always be written with two digits.
  • We don’t have to ignore impossible numbers, such as minute 61.
  • In between hours, minutes, and seconds, there should be one of the separators period . or colon :.
'12:34 09.00 24.56.33'.scan /(\d\d.\d\d(.\d\d)?)/
#=> [["12:34 09"], ["00 24.56"]]
'12:34 09.00 24.56.33'.scan /(\d\d[.:]\d\d([.:]\d\d)?)/
#=> [["12:34"], ["09.00"], ["24.56.33"]]

Shrthnd

  • Period symbol . matches any character except line breaks.
  • \𝚜 matches any whitespace character, such as tabs, line breaks, and spaces.
  • \𝚂 matches any character that’s not a whitespace; it’s the complement of \𝚜.
  • \𝚍 matches any digit, e.g., 3 or 9.
  • \𝙳 matches any character that isn’t a digit; it’s the complement of \𝚍.
  • \𝚠 means word character and is normally equivalent to the character class [𝚊-𝚣𝙰-𝚉𝟶-𝟿_], but some regex dialects also include letters and digits from other scripts, e.g., å or ñ.
  • \𝚆 matches any character that’s not matched by \𝚠.
'L8 love 2 u 4-ever'.scan /\d/ #=> ["8", "2", "4"]
'123def789 müde'.scan /\d/
#=> ["1", "2", "3", "7", "8", "9"]
'123def789 müde'.scan /\D/
#=> ["d", "e", "f", " ", "m", "ü", "d", "e"]
'123def789 müde'.scan /\w/
#=> ["1", "2", "3", "d", "e", "f", "7", "8", "9",
# "m", "d", "e"]
'123def789 müde'.scan /\W/ #=> [" ", "ü"]
'123def789 müde'.scan /\s/ #=> [" "]
'123def789 müde'.scan /\S/
#=> ["1", "2", "3", "d", "e", "f", "7", "8", "9",
# "m", "ü", "d", "e"]
'123def789 müde'.scan /./
#=> ["1", "2", "3", "d", "e", "f", "7", "8", "9",
# " ", "m", "ü", "d", "e"]
"gr\ny".scan /./ #=> ["g", "r", "y"]
"gr\ny".scan /\s|\S/ #=> ["g", "r", "\n", "y"]

Unicode Categories, Scripts, and Blocks

  • Other: \𝚙{𝙲}
  • Letter: \𝚙{𝙻}
  • Mark: \𝚙{𝙼}
  • Number: \𝚙{𝙽}
  • Punctuation: \𝚙{𝙿}
  • Symbol: \𝚙{𝚂}
  • Separator: \𝚙{𝚉}
'a; a-a, Ω. a: A€₺¥'.scan /\p{L}/
#=> ["a", "a", "a", "Ω", "a", "A"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{P}/
#=> [";", "-", ",", ".", ":"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{S}/
#=> ["€", "₺", "¥"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Z}/
#=> [" ", " ", " ", " "]
  • Letter, Lowercase \𝚙{𝙻𝚕}.
  • Punctuation, Dash \𝚙{𝙿𝚍}.
  • Symbol, Currency \𝚙{𝚂𝚌}.
'a; a-a, Ω. a: A€₺¥'.scan /\p{Ll}/
#=> ["a", "a", "a", "a"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Lu}/
#=> ["Ω", "A"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Sc}/
#=> ["€", "₺", "¥"]
'a; a-a, Ω. a: A€₺¥'.scan /\p{Sm}/
#=> []
'a; a-a, Ω. a: A€'.scan /\p{Greek}/
#=> ["Ω"]
'a; a-a, Ω. a: A€'.scan /\p{Latin}/
#=> ["a", "a", "a", "a", "A"]
  • U+0600…U+06FF == \𝚙{𝙸𝚗𝙰𝚛𝚊𝚋𝚒𝚌}
  • U+0000…U+007F == \𝚙{𝙸𝚗𝙱𝚊𝚜𝚒𝚌_𝙻𝚊𝚝𝚒𝚗}
'a; a-a, Ω. a: A€'.scan /\P{Ll}/
#=> [";", " ", "-", ",", " ", "Ω",
# ".", " ", ":", " ", "A", "€"]
'a; a-a, Ω. a: A€'.scan /\P{Latin}/
#=> [";", " ", "-", ",", " ", "Ω",
# ".", " ", ":", " ", "€"]
'a; a-a, Ω. a: A4€'.scan /[\p{L}\p{N}]/
#=> ["a", "a", "a", "Ω", "a", "A", "4"]
perl -e '"gr\ny" =~ /\X/; print $&' 
# g
perl -e '"gr\ny" =~ /\X+/; print $&'
# gr
# y

Generic Character Class

Alternation (top left), character class (bottom left), character class range (top right), and negated character class (bottom right).
'istanbul constantinople'.scan /a|e|i|o|u|y/
#=> ["i", "a", "u", "o", "a", "i", "o", "e"]
'istanbul constantinople'.scan /[aeiouy]/
#=> ["i", "a", "u", "o", "a", "i", "o", "e"]
'12d343ea3'.scan /[abcdef]/ #=> ["d", "e", "a"]
'12d343ea3'.scan /[a-f]/ #=> ["d", "e", "a"]
'john.smith@company.com'.scan /[7-d]/
#=> ["@", "c", "a", "c"]
  • ‘7’ has code point 55 in Unicode.
  • ’@’ has code point 64 in Unicode.
  • ‘d’ has code point 100 in Unicode.
'john.smith@company.com'.scan /[j7-dh]/
#=> ["j", "h", "h", "@", "c", "a", "c"]
'pera-beyoğlu'.scan /[a-z]/
#=> ["p", "e", "r", "a", "b", "e", "y", "o", "l", "u"]
'pera-beyoğlu'.scan /[-az]/ #=> ["a", "-"]
'pera-beyoğlu'.scan /[az-]/ #=> ["a", "-"]
'quick2 ball5 good4you1 money1'.scan /[a-z]+[0-9]/
#=> ["quick2", "ball5", "good4", "you1", "money1"]
'quick2 ball5 good4you1 money1'.scan /[a-z0-9]+[0-9]/
#=> ["quick2", "ball5", "good4you1", "money1"]

Generic Character Class Negated and Tweaked

'<span rel="info" class="people">'.scan /class=".*"/
#=> ["class=\"people\""]
'<span class="people" rel="info">'.scan /class=".*"/
#=> ["class=\"people\" rel=\"info\""]
'<span class="people" rel="info">'.scan /class="[^"]*"/
#=> ["class=\"people\""]
'<span class="people" rel="info">'.scan /class=['"][^'"]*"/
#=> ["class=\"people\""]
'Kadıköy^Chalcedon^Χαλκηδών'.scan /[^A-Za-z]/
#=> ["ı", "ö", "^", "^",
# "Χ", "α", "λ", "κ", "η", "δ", "ώ", "ν"]
'Kadıköy^Chalcedon^Χαλκηδών'.scan /[A-Z^a-z]/
#=> ["K", "a", "d", "k", "y", "^", "C", "h",
# "a", "l", "c", "e", "d", "o", "n", "^"]
  • A character class union is written as a nested character class. For example, [[𝚊-𝚏][𝟶-𝟿]] will match any lower hexadecimal digit.
  • A character class intersection adds the meta-sequence double-ampersand, &&, to find what’s common in the two sets: [\𝚠&&[\𝚍𝚊-𝚣]].
  • Character class subtraction has at least two different syntaxes. In some regex dialects, we type a hyphen, -, before the set we want to subtract. The expression [\da-z[g-z]] will match any lower hexadecimal digit. In other regex dialects, an intersection is written as the main set and the complement of what we want to remove (do you remember De Morgan’s laws?): [\𝚍𝚊-𝚣&&[^𝚐-𝚣]]

Generic Character Class Escape

The five literals which does not always match literally inside a character class.
'I know that 3 - 2 is 1'.scan /[a-z]/
#=> ["k", "n", "o", "w", "t", "h", "a", "t", "i", "s"]
'I know that 3 - 2 is 1'.scan /[-az]/
#=> ["a", "-"]
  • A slash, \, indicates that the next character is a metacharacter, e.g., newline \n or the digit character class shorthand \d.
  • A caret, ^, creates a negative character class — the complement of the characters listed — if the caret is positioned first in the character class.
  • A hyphen, -, is put in between two characters to describe a range, which entails the hyphen is matched literally if positioned first or last in the character class.
  • A right square bracket, ], marks the end of a character class.
  • A left square bracket, [, marks the beginning of a character class union, subtraction, or intersection — if our regex dialect supports that functionality.
'This is. That is.'.scan /[t.]/ #=> [".", "t", "."]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[3^]/ #=> ["^", "3", "^"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[^3]/
#=> ["1", " ", "-", " ", "2", " ", "^", " ", " ", "\\",
# " ", "4", " ", "[", " ", "5", " ", "^", " ", "6"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\^3]/ #=> ["^", "3", "^"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[]3]/ #=> ["3"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[3]]/ #=> []
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[3\]]/ #=> ["3"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[-24]]/ #=> []
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[-24]/ #=> ["-", "2", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[2-4]/ #=> ["2", "3", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[24-]/ #=> ["-", "2", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[2\-4]/ #=> ["-", "2", "4"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\da-z]/
#=> ["1", "2", "3", "4", "5", "6"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\Sa-z]/
#=> ["1", "-", "2", "^", "3", "\\", "4", "[", "5",
# "^", "6"]
'1 - 2 ^ 3 \ 4 [ 5 ^ 6'.scan /[\w-]/
#=> ["1", "-", "2", "3", "4", "5", "6"]
'Five $ (dollar) + one gull { *|. ?'.scan /(|)*?.{0}+$/
#=> [[nil]]
'Five $ (dollar) + one gull { *|. ?'.scan /[(|)*?.{+$]/
#=> ["$", "(", ")", "+", "{", "*", "|", ".", "?"]

Posix Character Class

  • [:𝚊𝚕𝚗𝚞𝚖:] English alphanumeric characters, equivalent to [𝚊-𝚣𝙰-𝚉𝟶-𝟿] 𝚘𝚛 [\𝚍\𝚙{𝙻𝚕}]
  • [:𝚊𝚕𝚙𝚑𝚊:] English alphabetic characters, equivalent to [𝚊-𝚣𝙰-𝚉] 𝚘𝚛 \𝚙{𝙻}
  • [:𝚊𝚜𝚌𝚒𝚒:] The seven-bit ASCII characters, equivalent to [\𝚡𝟶𝟶-\𝚡𝟽𝙵]
  • [:𝚋𝚕𝚊𝚗𝚔:] Spaces and tabs, but not line breaks, equivalent to [ \𝚝]
  • [:𝚌𝚗𝚝𝚛𝚕:] Control characters, equivalent to [\𝚡𝟶𝟶-\𝚡𝟷𝙵\𝚡𝟽𝙵]
  • [:𝚍𝚒𝚐𝚒𝚝:] Digits, equivalent to \𝚍
  • [:𝚐𝚛𝚊𝚙𝚑:] Visible characters, i.e., not spaces, control characters
  • [:𝚕𝚘𝚠𝚎𝚛:] Lowercase alphabetic characters, equivalent to [𝚊-𝚣] or \𝚙{𝙻𝚕}
  • [:𝚙𝚛𝚒𝚗𝚝:] Same as [:𝚐𝚛𝚊𝚙𝚑:], but including the space characters
  • [:𝚙𝚞𝚗𝚌𝚝:] Punctuation characters
  • [:𝚜𝚙𝚊𝚌𝚎:] Whitespace characters, equivalent to \𝚜
  • [:𝚞𝚙𝚙𝚎𝚛:] Uppercase alphabetic characters, equivalent to [𝙰-𝚉] or \𝚙{𝙻𝚞}
  • [:𝚠𝚘𝚛𝚍:] Word characters, equivalent to \𝚠
  • [:𝚡𝚍𝚒𝚐𝚒𝚝:] Hexadecimal digits, equivalent to [𝙰-𝙵𝚊-𝚏𝟶-𝟿]
'abc123efg'.scan /[[:digit:]]/ #=> ["1", "2", "3"]
'abc123efg'.scan /[:digit:]/ #=> ["g"]
'abc123efgå '.scan /[[:lower:]]/
#=> ["a", "b", "c", "e", "f", "g", "å"]
'abc123efgå '.scan /[\d\p{L}]/
#=> ["a", "b", "c", "1", "2", "3", "e", "f", "g", "å"]
'abc123efgå '.scan /[[:alnum:]]/
#=> ["a", "b", "c", "1", "2", "3", "e", "f", "g", "å"]

Grouping

'12(3)4'.gsub /(\d)\d/, '¤' #=> "¤(3)4"
'12(3)4'.gsub /\(\d\)\d/, '¤' #=> "12¤"
'1234'.gsub /1\d+/, '¤' #=> "¤"
'1234'.gsub /(1\d)+/, '¤' #=> "¤34"
  • Index 1: (𝚋𝚌)((𝚍)𝚎)
  • Index 2: 𝚋𝚌
  • Index 3: (𝚍)𝚎
  • Index 4: 𝚍

Capture and Back Reference

'abcde' =~ /a((bc)((d)e))/ #=> 0
$1 #=> "bcde"
$2 #=> "bc"
$3 #=> "de"
$4 #=> "d"
'Rūmiyyat al-kubra'.sub /(.)\1/, "¤"
#=> "Rūmi¤at al-kubra"
'You may may do that'.sub /(\S+)\s\1/, "¤"
#=> "You ¤ do that"
'abcde' =~ /a((?:bc)((d)e))/ #=> 0
$1 #=> "bcde"
$2 #=> "de"
$3 #=> "d"
  • Atomic grouping: (?>
  • Comments: (?#
  • Conditional: (?(
  • Mode modifiers: (?i:, (?s:, and (?m:
  • Named capture: (?<, (?P<, or (?’
  • Negative Lookahead: (?!
  • Positive Lookahead: (?=
  • Positive Lookbehind: (?<=
  • Negative Lookbehind: (?<!
'12/31/1999'.sub %r!(\d\d)/(\d\d)/(\d{4})!, '\3-\1-\2'
#=> "1999-12-31"
'aba'.match /(\2b|(a)){2}/ #=> nil
'aab'.match /(\2b|(a)){2}/ #=> #<MatchData "aab">
'abb'.match /(a)(b\k<-2>)/ #=> nil
'aba'.match /(a)(b\k<-2>)/ #=> #<MatchData "aba">

Named groups

'12/31/1999'.sub \
%r!(?<month>\d\d)/(?<day>\d\d)/(?<year>\d{4})!,
'\k<year>-\k<month>-\k<day>'
#=> "1999-12-31"
'You may may do that'.sub /(?<word>\S+)\s\k<word>/, '\k<word>'
#=> "You may do that"
csharp> Regex regex = new Regex("v(?'letter'[aeiouy])|" +
"c(?'letter'[b-df-hj-np-tv-xz])");
csharp> regex.Match("va");
va
csharp> regex.Match("vb");
csharp> regex.Match("ca");
csharp> regex.Match("cb");
cb
csharp> Regex.Replace("ab", "(?<first>.)(.)",
"Group 1 is '$1'");
"Group 1 is 'b'"
ruby> 'ab'.sub /(?<first>.)(.)/, "Group 1 is '" + $1 + "'"
#=> "Group 1 is 'a'"

Atomic Groups

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab'.scan /(?>a+a+)+b/
#=> ["aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"]
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab'.scan /(a+a+)+b/
#=> [["aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"]]
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.scan /(?>a+a+)+b/
#=> []
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.scan /(a+a+)+b/
#=> []
'123° 456 789°'.scan /\d+°/ #=> ["123°", "789°"]
'123° 456 789°'.scan /(?>\d+)°/ #=> ["123°", "789°"]
'123° 456 789°'.scan /(?>.+)°/ #=> []

Anchors

  • Caret ^ matches the position before the first character in the input string and usually also directly after a line break.
  • Dollar $ matches the position after the last character in the input string and usually also directly before a line break.
  • \𝙰 matches only at the start of the input string and ignores line breaks.
  • \𝚉 matches only at the end of the input string and ignores line breaks.
  • \𝚋 matches word boundaries, mostly based on what \𝚠 matches
'This to the fair Critias.'.scan /[A-Z][a-z]+/
#=> ["This", "Critias"]
'This to the fair Critias.'.scan /^[A-Z][a-z]+/
#=> ["This"]
"This to the fair\nCritias.".scan /^[A-Z][a-z]+/
#=> ["This", "Critias"]
'This to the fair Critias.'.scan /[A-Z][a-z]+$/
#=> []
"This to the fair Critias\n.".scan /[A-Z][a-z]+$/
#=> ["Critias"]
'This to the fair Critias.'.scan /^[A-Z][a-z]+$/
#=> []
'This to the fair Critias.'.scan /[A-Z][a-z]+/
#=> ["This", "Critias"]
'This to the fair Critias.'.scan /\A[A-Z][a-z]+/
#=> ["This"]
"This to the fair\nCritias.".scan /\A[A-Z][a-z]+/
#=> ["This"] — no Critias match after line break
'This to the fair Critias.'.scan /[A-Z][a-z]+\Z/
#=> []
"This to the fair Critias\n.".scan /[A-Z][a-z]+\Z/
#=> [] — no Critias match before line break
'This to the fair Critias.'.scan /\A[A-Z][a-z]+\Z/
#=> []
'This to the fair Critias.'.scan /\w+/
#=> ["This", "to", "the", "fair", "Critias"]
'This to the fair Critias.'.scan /t\w+/
#=> ["to", "the", "tias"]
'This to the fair Critias.'.scan /\bt\w+\b/
#=> ["to", "the"]

Lookarounds

“This clock-shaped machine solves the slash-slash problem. The square window at the top of the machine reads a tape — from left to right — consisting of the input string, i.e., the C code. We have a state machine of type Nondeterministic Finite Automaton (NFA). To visualize and represent the automaton, we use a transition graph, in which the vertices represent states and the edges represent transitions. The initial state is identified by an incoming unlabeled arrow not originating at any vertex. The acceptance state is surrounded by a circle. This is the graph that you see printed on the clock dial. Now, whenever a new symbol is read from the tape, the clock dial rotates so that the drooping peak, just below the reading window, points at the current state. Ah, and this particular machine also has a special lookahead feature: it’s a long arm with an eye in the end and a light bulb. This eye can look ahead and tell if there’s any double slash. If there is, the bulb will glow and the machine will understand that it doesn’t matter if the input ends in a pair of parentheses — it’s in a comment anyway.” Source: De Morgan to the Rescue.
  • Positive lookahead (?= ) — checks whether a pattern is present to the right
  • Negative lookahead (?! ) — checks whether a pattern isn’t present to the right
  • Positive lookbehind (?<= ) — checks whether a pattern is present to the left
  • Negative lookbehind (?<! ) — checks whether a pattern isn’t present to the left
'Cheese slicer invented by carpenter Thor'. scan  \
/\b\w+?\sThor/
#=> ["carpenter Thor"] -- too bad, Thor included
'Cheese slicer invented by carpenter Thor'. scan \
/\b\w+?(?=\s+Thor\b)/
#=> ["carpenter"] -- yes, Thor extracted from match
'The cheese slicer invented by carpenter Thor'. scan /c\w+/
#=> ["cheese", "cer", "carpenter"]
'The cheese slicer invented by carpenter Thor'. scan /\sc\w+/
#=> [" cheese", " carpenter"]
'The cheese slicer invented by carpenter Thor'. \
scan /(?<=\s)c\w+/
#=> ["cheese", "carpenter"] – preceded by whitespace
'The cheese slicer invented by carpenter Thor'. \
scan /(?<!\s)c\w+/
#=> ["cer"] – starting with c, but no whitespace before

Afterword

--

--

🌱 Twenty Years of Agile Coaching and Leadership • Monotasking and Pomodoro books (700.000 copies sold)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Staffan Nöteberg

🌱 Twenty Years of Agile Coaching and Leadership • Monotasking and Pomodoro books (700.000 copies sold)