Regular Expressions for Humanists

Edinburgh. 2nd March 2017.

A book

Professor Martin Paul Eve, Birkbeck, University of London

What are regular expressions?

  • Text extraction and replacement language
  • Sometimes also called PCRE

Why use regex?

  • Extract dates
  • Find email addresses
  • Find street addresses
  • Validate input
  • Capture spelling variations

How to use regex

  • Word processors
  • Sed and awk
  • Language-specific engines (Python, Perl, C#, C++, Java etc.)

Online test environments

  • https://regex101.com : a powerful online tool that can explain regular expression operation in human-readable terms while providing a real-time editing environment for testing. Allows the user to specify the “flavour” of regular expression implementation that they wish to use (e.g. “python”). This is my favourite tool.
  • http://regexr.com : another good live online tool. The advantage of this site is the quick online “cheatsheet” in the left-hand menu.
  • http://www.regexpal.com : another good tool. There isn't much in it between this and other tools.

What do you need to use regex?

  • A regex engine
  • A regular expression
  • A body of text on which to operate

What does a regex look like?

  • Open regex101.com
  • In the bottom box, type: in the Year of our Lord 992 the Justified Ancients of Mu Mu Set sail in their longboats on a voyage to rediscover the lost continent
  • In the top box, type: \d{1,4}

Basic regular expression syntax

. Any character
\n Matches a newline character
\t Matches a tab
\d Matches a digit
\w Matches an alphanumeric character
\W Matches a non-alphanumeric character
\s Matches a whitespace character
\S Matches a non-whitespace character
\ Escapes special characters

Anchors

^ Matches the start of a string
$ Matches the end of a string

Quantifiers

* Matches the preceding element 0 or more times
+ Matches the preceding element 1 or more times
? Matches the preceding element 0 or 1 times
{x} Matches the preceding element x times
{x,y} Matches the preceding element between x and y times
{x,} Matches the preceding element at least x times
{,y} Matches the preceding element between 0 and y times

Exercise

Match the last word of a string that ends in a full stop.

Solution

\w+\.$

  • The \w element specifies that we are looking for alphanumeric characters.
  • The + quantifier specifies that we are looking for the \w character to be repeated 1 or more times but not to include any whitespace (which would not be an alphanumeric character).
  • The \. element specifies that we are looking for a full stop after the repetition of the alphanumeric characters.
  • The $ element specifies that we want to look for this sequence only at the end of an input string.

Greediness

Regular expressions are greedy. They will match the most they can.

  • Text: <a href="https://www.martineve.com/">Some text</a> <p>Some more text</p>
  • Try to write an expression to match both of the HTML tags separately.

A commonly attempted solution

<.+\/.+>

  • Match “<” literally.
  • .+ means: match any character one or more times.
  • Match “/” literally.
  • .+ means: match any character one or more times.
  • Match “>” literally.

Why does it fail?

Because the regex is greedy. It is matching the whole string.

We need to use the lazy quantifier: ?

A better solution

<.+?>.*?<\/.+?>

  • Match “<” literally.
  • .+? means: lazily match any character one or more times until we find the subsequent literal “>”.
  • Match “>” literally.
  • .*? means: lazily match any text zero or one times until we find the subsequent literal “<”.
  • Match “<” literally.
  • Match “/” literally.
  • .+? means: lazily match any character one or more times until we find the subsequent literal “>”.
  • Match “>” literally.

Character groups

Match a set of characters: e.g. [abc] will match a, b, or c

[ Begins a character group.
] Closes a character group.
All but ^-]\ Matches literally.
\ A literal backslash.
\ and ^-]\ Escapes.
^ Negate the character group. That is, match everything NOT in the group.
- A character range. e.g. a-z.

Exercise

Write a single expression that will match both verbs:

  • Immanentize the eschaton
  • Immanentise the eschaton

Solution

\w+i[sz]e

  • The \w element specifies that we are looking for alphanumeric characters.
  • The + quantifier specifies that we are looking for the \w character to be repeated 1 or more times but not to include any whitespace (which would not be an alphanumeric character).
  • The literal “i” says: look for the character “i”.
  • The character group block [sz] says, look for either an s or a z.
  • The literal “e” says: look for the character “e”.

Capturing matches

(?P<group_name>.+)

Backreferences

  • You can reference a captured group using: \1, \2 etc.
  • Useful for when you want to find all instances of a previous capture

Backreference exercise

Work on this text: Is this equal to that? Is this equal to this?

Write an expression that captures the word "this" following the first "Is"

Capture the sentence where the second part reads "equal to this" without using the word "this"

Backreference solution

Is (\w+) equal to \1\?

  • Match "Is" literally.
  • (\w+) means: match an alphanumeric character once or more and capture it in a group.
  • Match “ equal to ” literally.
  • \1 means: match the same as was captured in group 1.
  • \? means: match a question mark literally.

Lookahead and lookbehind

Match characters ahead or behind without including them in the match

(?=zzz) Lookahead. True if the next part of the string is “zzz”.
(?<=zzz) Lookbehind. True if the preceding part of the string is “zzz”.
(?!zzz) Negative lookahead. True if the next part of the string is not “zzz”.
(?<!zzz) Negative lookbehind. True if the preceding part of the string is not “zzz”.

Lookaround exercise

Use the text: In the year 210AD there were 394 chickens roaming the plains.

Match the 210AD date but not the 394 number, extracting just the numbers, without literally matching.

Lookaround solution

\d+(?=AD)

  • \d+ matches a digit one or more times.
  • (?=AD) means: only match the preceding item (\d+) if it is followed by the literal “AD” (lookahead).

The End

Thank you!

Presentation licensed under a CC BY-SA 3.0 license. All institutional images excluded from CC license. Available to view online at http://meve.io/Regex2017.