5 Regular Expressions – Python One-Liners

5
REGULAR EXPRESSIONS

Are you an office worker, student, software developer, manager, blogger, researcher, author, copywriter, teacher, or self-employed freelancer? Most likely, you’re spending many hours in front of your computer, day after day. Improving your daily productivity—only by a small fraction of a percentage—will mean a gain of thousands, if not tens of thousands, of dollars of productivity and hundreds of hours of additional free time over the years.

This chapter shows you an undervalued technique that helps master coders be more efficient when working with textual data: using regular expressions. This chapter shows you 10 ways of using regular expressions to solve everyday problems with less effort, time, and energy. Study this chapter about regular expressions carefully—it’ll be worth your time!

Finding Basic Textual Patterns in Strings

This section introduces regular expressions using the re module and its important re.findall() function. I’ll start by explaining several basic regular expressions.

The Basics

A regular expression (regex, for short) formally describes a search pattern that you can use to match sections of text. The simple example in Figure 5-1 shows a search of Shakespeare’s text Romeo and Juliet for the pattern Juliet.

Figure 5-1: Searching Shakespeare’s Romeo and Juliet for the pattern Juliet

Figure 5-1 shows that the most basic regular expression is a simple string pattern. The string 'Juliet' is a perfectly valid regular expression.

Regular expressions are incredibly powerful, and can do much more than regular text search, but they’re built using only a handful of basic commands. Learn these basic commands and you’ll be able to understand and write complex regular expressions. In this section, we’ll focus on the three most important regex commands that extend the functionality of simple search of string patterns in a given text.

The Dot Regex

First, you need to know how to match an arbitrary character by using the dot regex, the . character. The dot regex matches any character (including whitespace characters). You can use it to indicate that you don’t care which character matches, as long as exactly one matches:

import re

text = '''A blockchain, originally block chain,
is a growing list of records, called blocks,
which are linked using cryptography.
'''

print(re.findall('b...k', text))
# ['block', 'block', 'block']

This example uses the findall() method of the re module. The first argument is the regex itself: you search for any string pattern starting with the character 'b', followed by three arbitrary characters, ... , followed by the character 'k'. This regex b...k matches the word 'block' but also 'boook', 'b erk', and 'bloek'. The second parameter to findall() is the text you’re searching. The string variable text contains three matching patterns, as you can see in the output of the print statement.

The Asterisk Regex

Second, say you want to match text that begins and ends with the character 'y' and an arbitrary number of characters in between. How do you accomplish this? You can do by this using the asterisk regex, the * character. Unlike the dot regex, the asterisk regex can’t stand on its own; it modifies the meaning of another regex. Consider the following example:

print(re.findall('y.*y', text))
# ['yptography']

The asterisk operator applies to the regex immediately in front of it. In this example, the regex pattern starts with the character 'y', followed by an arbitrary number of characters, .*, followed by the character 'y'. As you can see, the word 'cryptography' contains one such instance of this pattern: 'yptography'.

You may wonder why this code doesn’t find the long substring between 'originally' and 'cryptography', which should also match the regex pattern y.*y. The reason is simply that the dot operator matches any character except the newline character. The string stored in the variable text is a multiline string with three new lines. You can also use the asterisk operator in combination with any other regex. For example, you can use the regex abc* to match the strings 'ab', 'abc', 'abcc', and 'abccdc'.

The Zero-or-one Regex

Third, you need to know how to match zero or one characters by using the zero-or-one regex, the ? character. Just like the asterisk operator, the question mark modifies another regex, as you can see in the following example:

print(re.findall('blocks?', text))
# ['block', 'block', 'blocks']

The zero-or-one regex, ?, applies to the regex immediately in front of it. In our case, this is the character s. The zero-or-one regex says that the pattern it modifies is optional.

There is another use of the question mark in Python’s re package, but it has nothing to do with the zero-or-one regex: the question mark can be combined with the asterisk operator, *?, to allow for nongreedy pattern matching. For example, if you use the regex .*?, Python searches for a minimal number of arbitrary characters. In contrast, if you use the asterisk operator * without the question mark, it greedily matches as many characters as possible.

Let’s look at an example. When searching the HTML string '<div>hello world</div>' by using the regex <.*>, it matches the whole string '<div>hello world</div>' rather than only the prefix '<div>'. If you want only the prefix, you can use the nongreedy regex <.*?>:

txt = '<div>hello world</div>'

print(re.findall('<.*>', txt))
# ['<div>hello world</div>']

print(re.findall('<.*?>', txt))
# ['<div>', '</div>']

Equipped with these three tools—the dot regex ., the asterisk regex *, and the zero-or-one regex ?—you’re now able to comprehend the next one-liner solution.

The Code

Our input is a string, and our goal is to use a nongreedy approach to find all patterns that start with the character 'p', end with the character 'r', and have at least one occurrence of the character 'e' (and, possibly, an arbitrary number of other characters) in between!

These types of text queries occur quite frequently—especially in companies that focus on text processing, speech recognition, or machine translation (such as search engines, social networks, or video platforms). Take a look at Listing 5-1.

## Dependencies
import re


## Data
text = 'peter piper picked a peck of pickled peppers'


## One-Liner
result = re.findall('p.*?e.*?r', text)


## Result
print(result)

Listing 5-1: One-liner solution to search for specific phrases (nongreedy)

This code prints a list of all matching phrases in the text. What are they?

How It Works

The regex search query is p.*?e.*?r. Let’s break this down. You’re looking for a phrase that starts with the character 'p' and ends with the character 'r'. Between those two characters, you require one occurrence of the character 'e'. Apart from that, you allow an arbitrary number of characters (whitespace or not). However, you match in a nongreedy manner by using .*?, which means Python will search for a minimal number of arbitrary characters. Here’s the solution:

## Result
print(result)
# ['peter', 'piper', 'picked a peck of pickled pepper']

Compare this solution with the one you’d get when using the greedy regex p.*e.*r:

result = re.findall('p.*e.*r', text)
print(result)
# ['peter piper picked a peck of pickled pepper']

The first greedy asterisk operator .* matches almost the whole string before it terminates.

Writing Your First Web Scraper with Regular Expressions

In the previous section, you learned about the most powerful way to find arbitrary text patterns in strings: regular expressions. This section will further motivate your use of regular expressions and develop your knowledge with a practical example.

The Basics

Suppose you’re working as a freelance software developer. Your client is a fintech startup that needs to stay updated about the latest developments in cryptocurrency. They hire you to write a web scraper that regularly pulls the HTML source code of news websites and searches it for words starting with 'crypto' (for example, 'cryptocurrency', 'crypto-bot', 'crypto-crash', and so on).

Your first attempt is the following code snippet:

import urllib.request

search_phrase = 'crypto'

with urllib.request.urlopen('https://www.wired.com/') as response:
   html = response.read().decode("utf8") # convert to string
   first_pos = html.find(search_phrase)
   print(html[first_pos-10:first_pos+10])

The method urlopen() (from the module urllib.request) pulls the HTML source code from the specified URL. Because the result is a byte array, you have to first convert it to a string by using the decode() method. Then you use the string method find() to return the position of the first occurrence of the searched string. With slicing (see Chapter 2), you carve out a substring that returns the immediate environment of the position. The result is the following string:

# ,r=window.crypto||wi

Aw. That looks bad. As it turns out, the search phrase is ambiguous—most words containing 'crypto' are semantically unrelated to cryptocurrencies. Your web scraper generates false positives (it finds string results that you originally didn’t mean to find). So how can you fix it?

Luckily, you’ve just read this Python book, so the answer is obvious: regular expressions! Your idea to remove false positives is to search for occurrences in which the word 'crypto' is followed by up to 30 arbitrary characters, followed by the word coin. Roughly speaking, the search query is crypto + <up to 30 arbitrary characters> + coin. Consider the following two examples:

  • 'crypto-bot that is trading Bitcoin'—yes
  • 'cryptographic encryption methods that can be cracked easily with quantum computers'—no

So how to solve this problem of allowing up to 30 arbitrary characters between two strings? This goes beyond a simple string search. You can’t enumerate every exact string pattern—a virtually infinite number of matches is allowed. For example, the search pattern must match all of the following: 'cryptoxxxcoin', 'crypto coin', 'crypto bitcoin', 'crypto is a currency. Bitcoin', and all other character combinations with up to 30 characters between the two strings. Even if you had only 26 characters in the alphabet, the number of strings that would theoretically match our requirement exceeds 2630 = 2,813,198,901,284,745,919,258,621,029,615,971,520,741,376. In the following, you’ll learn how to search a text for a regex pattern that corresponds to a large number of possible string patterns.

The Code

Here, given a string, you will find occurrences in which the string 'crypto' is followed by up to 30 arbitrary characters, followed by the string 'coin'. Let’s first look at Listing 5-2 before discussing how the code solves the problem.

## Dependencies
import re


## Data
text_1 = "crypto-bot that is trading Bitcoin and other currencies"
text_2 = "cryptographic encryption methods that can be cracked easily with quantum computers"


## One-Liner
pattern = re.compile("crypto(.{1,30})coin")


## Result
print(pattern.match(text_1))
print(pattern.match(text_2))

Listing 5-2: One-liner solution to find text snippets in the form crypto(some text)coin

This code searches two string variables, text_1 and text_2. Does the search query (pattern) match them?

How It Works

First, you import the standard module for regular expressions in Python, called re. The important stuff happens in the one-liner where you compile the search query crypto(.{1,30})coin. This is the query that you can use to search various strings. You use the following special regex characters. Read them from top to bottom and you’ll understand the meaning of the pattern in Listing 5-2:

  • () matches whatever regex is inside.
  • .  matches an arbitrary character.
  • {1,30} matches between 1 and 30 occurrences of the previous regex.
  • (.{1,30}) matches between 1 and 30 arbitrary characters.
  • crypto(.{1,30})coin matches the regex consisting of three parts: the word 'crypto', an arbitrary sequence with 1 to 30 chars, followed by the word 'coin'.

We say that the pattern is compiled because Python creates a pattern object that can be reused in multiple locations—much as a compiled program can be executed multiple times. Now, you call the function match() on our compiled pattern and the text to be searched. This leads to the following result:

## Result
print(pattern.match(text_1))
# <re.Match object; span=(0, 34), match='crypto-bot that is trading Bitcoin'>

print(pattern.match(text_2))
# None

The string variable text_1 matches the pattern (indicated by the resulting match object), but text_2 doesn’t (indicated by the result None). Although the textual representation of the first matching object doesn’t look pretty, it gives a clear hint that the given string 'crypto-bot that is trading Bitcoin' matches the regular expression.

Analyzing Hyperlinks of HTML Documents

In the preceding section, you learned how to search a string for a large number of patterns by using the regex pattern .{x,y}. This section goes further, introducing many more regular expressions.

The Basics

Knowing more regular expressions will help you solve real-world problems quickly and concisely. So what are the most important regular expressions? Study the following list carefully because we’ll use all of them in this chapter. Just view the ones you’ve already seen as a small repetition exercise.

  • The dot regex . matches an arbitrary character.
  • The asterisk regex <pattern>* matches an arbitrary number of the regex <pattern>. Note that this includes zero matching instances.
  • The at-least-one regex <pattern>+ can match an arbitrary number of <pattern> but must match at least one instance.
  • The zero-or-one regex <pattern>? matches either zero or one instances of <pattern>.
  • The nongreedy asterisk regex *? matches as few arbitrary characters as possible to match the overall regex.
  • The regex <pattern>{m} matches exactly m copies of <pattern>.
  • The regex <pattern>{m,n} matches between m and n copies of <pattern>.
  • The regex <pattern_1>|<pattern_2> matches either <pattern_1> or <pattern_2>.
  • The regex <pattern_1><pattern_2> matches <pattern_1> and then <pattern_2>.
  • The regex (<pattern>) matches <pattern>. The parentheses group regular expressions so you can control the order of execution (for example, (<pattern_1><pattern_2>)|<pattern_3> is different from <pattern_1> (<pattern_2>|<pattern_3>). The parentheses regex also creates a matching group, as you’ll see later in the section.

Let’s consider a short example. Say you create the regex b?(.a)*. Which patterns will the regex match? The regex matches all patterns starting with zero or one b and an arbitrary number of two-character-sequences ending in the character 'a'. Hence, the strings 'bcacaca', 'cadaea', '' (the empty string), and 'aaaaaa' would all match the regex.

Before diving into the next one-liner, let’s quickly discuss when to use which regex function. The three most important regex functions are re.match(), re.search(), and re.findall(). You’ve already seen two of them, but let’s study them more thoroughly in this example:

import re

text = '''
"One can never have enough socks", said Dumbledore.
"Another Christmas has come and gone and I didn't
get a single pair. People will insist on giving me books."
Christmas Quote
'''

regex = 'Christ.*'

print(re.match(regex, text))
# None

print(re.search(regex, text))
# <re.Match object; span=(62, 102), match="Christmas has come and gone and I didn't">

print(re.findall(regex, text))
# ["Christmas has come and gone and I didn't", 'Christmas Quote']

All three functions take the regex and the string to be searched as an input. The match() and search() functions return a match object (or None if the regex did not match anything). The match object stores the position of the match and more advanced meta-information. The function match() does not find the regex in the string (it returns None). Why? Because the function looks for the pattern only at the beginning of the string. The function search() searches for the first occurrence of the regex anywhere in the string. Therefore, it finds the match "Christmas has come and gone and I didn't".

The findall() function has the most intuitive output, but it’s also the least useful for further processing. The result of findall() is a sequence of strings rather than a match object—so it doesn’t give us information about the precise location of the match. That said, findall() has its uses: in contrast to the match() and search() methods, the function findall() retrieves all matched patterns, which is useful when you want to quantify how often a word appears in a text (for example, the string 'Juliet' in the text 'Romeo and Juliet' or the string 'crypto' in an article about cryptocurrency).

The Code

Say your company asks you to create a small web bot that crawls web pages and checks whether they contain links to the domain finxter.com. They also ask you to make sure the hyperlink descriptions contain the strings 'test' or 'puzzle'. In HTML, hyperlinks are enclosed in an <a></a> tag environment. The hyperlink itself is defined as the value of the href attribute. So more precisely, the goal is to solve the following problem, depicted in Listing 5-3: given a string, find all hyperlinks that point to the domain finxter.com and contain the strings 'test' or 'puzzle' in the link description.

## Dependencies
import re


## Data
page = '''
<!DOCTYPE html>
<html>
<body>

<h1>My Programming Links</h1>
<a href="https://app.finxter.com/">test your Python skills</a>
<a href="https://blog.finxter.com/recursion/">Learn recursion</a>
<a href="https://nostarch.com/">Great books from NoStarchPress</a>
<a href="http://finxter.com/">Solve more Python puzzles</a>

</body>
</html>
'''

## One-Liner
practice_tests = re.findall("(<a.*?finxter.*?(test|puzzle).*?>)", page)


## Result
print(practice_tests)

Listing 5-3: One-liner solution to analyze web page links

This code finds two occurrences of the regular expression. Which ones?

How It Works

The data consists of a simple HTML web page (stored as a multiline string) containing a list of hyperlinks (the tag environment <a href="">link text</a>). The one-liner solution uses the function re.findall() to check the regular expression (<a.*?finxter.*?(test|puzzle).*?>). This way, the regular expression returns all occurrences in the tag environment <a. . .> with the following restrictions.

After the opening tag, you match an arbitrary number of characters (nongreedily, to prevent the regex from “chewing up” multiple HTML tag environments), followed by the string 'finxter'. Next, you match an arbitrary number of characters (nongreedily), followed by one occurrence of either the string 'test' or the string 'puzzle'. Again, you match an arbitrary number of characters (nongreedily), followed by the closing tag. This way, you find all hyperlink tags that contain the respective strings. Note that this regex also matches tags where the strings 'test' or 'puzzle' occur within the link itself. Please also note that you use only nongreedy asterisk operators '.*?' to ensure that you always search for minimal matches rather than matching—for example, a very long string enclosed in multiple nested tag environments.

The result of the one-liner is the following:

## Result
print(practice_tests)
# [('<a href="https://app.finxter.com/">test your Python skills</a>', 'test'),
#  ('<a href="http://finxter.com/">Solve more Python puzzles</a>', 'puzzle')]

Two hyperlinks match our regular expression: the result of the one-liner is a list with two elements. However, each element is a tuple of strings rather than a simple string. This is different from the results of findall(), which we’ve discussed in previous code snippets. What’s the reason for this behavior? The return type is a list of tuples—with one tuple value for each matching group enclosed in (). For instance, the regex (test|puzzle) uses the parentheses notation to create a matching group. If you use matching groups in your regex, the function re.findall() will add one tuple value for every matched group. The tuple value is the substring that matches this particular group. For example, in our case, the substring 'puzzle' matches the group (test|puzzle). Let’s dive more deeply into the topic of matching groups to clarify this concept.

Extracting Dollars from a String

This one-liner shows you another practical application of regular expressions. Here, you’re working as a financial analyst. As your company considers acquiring another company, you’re assigned to read the other company’s reports. You’re particularly interested in all dollar figures. Now, you could scan the whole document manually, but the work is tedious, and you don’t want to spend your best hours of the day doing tedious work. So you decide to write a small Python script. But what’s the best way of doing it?

The Basics

Fortunately, you’ve read this regex tutorial, so instead of wasting a lot of time writing your own lengthy, error-prone Python parser, you go for the clean solution with regular expressions—a wise choice. But before you dive into the problem, let’s discuss three more regex concepts.

First, sooner or later you want to match a special character that’s also used as a special character by the regex language. In this case, you need to use the prefix \ to escape the meaning of the special character. For example, to match the parenthesis character '(', which is normally used for regex groups, you need to escape it with the regex \(. This way, the regex character '(' loses its special meaning.

Second, the square bracket environment [ ] allows you to define a range of specific characters to be matched. For example, the regex [0-9] matches one of the following characters: '0', '1', '2', . . . , '9'. Another example is the regex [a-e], which matches one of the following characters: 'a', 'b', 'c', 'd', 'e'.

Third, as we discussed in the previous one-liner section, the parentheses regex (<pattern>) indicates a group. Every regex can have one or multiple groups. When using the re.findall() function on a regex with groups, only the matched groups are returned as a tuple of strings—one for each group—rather than the whole matched string. For example, the regex hello(world) called on the string 'helloworld' would match the whole string but return only the matched group world. On the other hand, when using two nested groups in the regex (hello(world)), the result of the re.findall() function would be a tuple of all matched groups ('helloworld', 'world'). Study the following code to understand nested groups completely:

string = 'helloworld'

regex_1 = 'hello(world)'
regex_2 = '(hello(world))'

res_1 = re.findall(regex_1, string)
res_2 = re.findall(regex_2, string)

print(res_1)
# ['world']
print(res_2)
# [('helloworld', 'world')]

Now, you know everything you need to know to understand the following code snippet.

The Code

To recap, you want to investigate all monetary numbers from a given company report. Specifically, your goal is to solve the following problem: given a string, find a list of all occurrences of dollar amounts with optional decimal values. The following example strings are valid matches: $10, $10., or $10.00021. How can you achieve this efficiently in a single line of code? Take a look at Listing 5-4.

## Dependencies
import re


## Data
report = '''
If you invested $1 in the year 1801, you would have $18087791.41 today.
This is a 7.967% return on investment.
But if you invested only $0.25 in 1801, you would end up with $4521947.8525.
'''


## One-Liner
dollars = [x[0] for x in re.findall('(\$[0-9]+(\.[0-9]*)?)', report)]


## Result
print(dollars)

Listing 5-4: One-liner solution to find all dollar amounts in a text

Take a guess: what’s the output of this code snippet?

How It Works

The report contains four dollar values in various formats. The goal is to develop a regex that matches all of them. You design the regex (\$[0-9]+(.[0-9]*)?) that matches the following patterns. First, it matches the dollar sign $ (you escape it because it’s a special regex character). Second, it matches a number with an arbitrary number of digits between 0 and 9 (but at least one digit). Third, it matches an arbitrary number of decimal values after the (escaped) dot character '.' (this last match is optional as indicated by the zero-or-one regex ?).

On top of that, you use list comprehension to extract only the first tuple value of all three resulting matches. Again, the default result of the re.findall() function is a list of tuples, with one tuple for each successful match and one tuple value for each group within the match:

[('$1', ''), ('$18087791.41', '.41'), ('$0.25', '.25'), ('$4521947.8525', '.8525')]

You’re interested in only the global group—the first value in the tuple. You filter out the other values by using list comprehension and get the following result:

## Result
print(dollars)
# ['$1 ', '$18087791.41', '$0.25', '$4521947.8525']

It’s worth noting again that implementing even a simple parser without the powerful capabilities of regular expressions would be difficult and error-prone!

Finding Nonsecure HTTP URLs

This one-liner shows you how to solve one of those small, time-intensive problems that web developers often run into. Say you own a programming blog and you’ve just moved your website from the unsecure protocol http to the (more) secure protocol https. However, your old articles still point to the old URLs. How can you find all occurrences of the old URLs?

The Basics

In the preceding section, you learned how to use square bracket notation to specify an arbitrary range of characters. For example, the regular expression [0-9] matches a single-digit number with a value from 0 to 9. However, the square bracket notation is more powerful than that. You can use an arbitrary combination of characters within the square brackets to specify exactly which characters match—and which don’t. For example, the regular expression [0-3a-c]+ matches the strings '01110' and '01c22a' but not the strings '443' and '00cd'. You can also specify a fixed set of characters not to match by using the symbol ^: the regular expression [^0-3a-c]+ matches the strings '4444d' and 'Python' but not the strings '001' and '01c22a'.

The Code

Here our input is a (multiline) string, and our aim is to find all occurrences of valid URLs that start with the prefix http://. However, don’t consider invalid URLs without a top-level domain (there has to be at least one . in the found URL). Take a look at Listing 5-5.

## Dependencies
import re


## Data
article = '''
The algorithm has important practical applications
http://blog.finxter.com/applications/
in many basic data structures such as sets, trees,
dictionaries, bags, bag trees, bag dictionaries,
hash sets, https://blog.finxter.com/sets-in-python/
hash tables, maps, and arrays. http://blog.finxter.com/
http://not-a-valid-url
http:/bla.ba.com
http://bo.bo.bo.bo.bo.bo/
http://bo.bo.bo.bo.bo.bo/333483--33343-/
'''


## One-Liner
stale_links = re.findall('http://[a-z0-9_\-.]+\.[a-z0-9_\-/]+', article)


## Results
print(stale_links)

Listing 5-5: One-liner solution to find valid http:// URLs

Again, try to come up with the output the code will produce before looking up the correct output that follows.

How It Works

In the regular expression, you analyze a given multiline string (potentially an old blog article) to find all URLs that start with the string prefix http://. The regular expression expects a positive number of (lowercase) characters, numbers, underscores, hyphens, or dots ([a-z0-9_\-\.]+). Note that you need to escape the hyphen (\-) because it normally indicates a range within the square brackets. Similarly, you need to escape the dot (\.) because you actually want to match the dot and not an arbitrary character. This results in the following output:

## Results
print(stale_links)
# ['http://blog.finxter.com/applications/',
#  'http://blog.finxter.com/',
#  'http://bo.bo.bo.bo.bo.bo/',
#  'http://bo.bo.bo.bo.bo.bo/333483--33343-/']

Four valid URLs may need to be moved to the more secure HTTPS protocol.

At this point, you’ve already mastered the most important features of regular expressions. But there’s a level of deep understanding that you’ll reach only by practicing and studying a lot of examples—and regular expressions are no exception. Let’s study a few more practical examples of how regular expressions can make your life easier.

Validating the Time Format of User Input, Part 1

Let’s learn to check the correctness of user-input formatting. Say you write a web application that calculates health statistics based on the sleep duration of your users. Your users enter the time they went to bed and the time they wake up. An example for a correct time format is 12:45, but because web bots are spamming your user input fields, a lot of “dirty” data is causing unnecessary processing overhead on your servers. To address this issue, you write a time-format checker that determines whether the input is worth processing further with your backend application. With regular expressions, writing the code takes only a few minutes.

The Basics

In the previous few sections, you’ve learned about the re.search(), re.match(), and re.findall() functions. These are not the only regex functions. In this section, you’ll use re.fullmatch(regex, string), which checks whether the regex matches the full string as the name suggests.

Furthermore, you’ll use the regex syntax pattern{m,n} that matches between m and n instances of the regex pattern, but no more and no less. Note that it attempts to match the maximal number of occurrences of pattern. Here’s an example:

import re
print(re.findall('x{3,5}y', 'xy'))
# []
print(re.findall('x{3,5}y', 'xxxy'))
# ['xxxy']
print(re.findall('x{3,5}y', 'xxxxxy'))
# ['xxxxxy']
print(re.findall('x{3,5}y', 'xxxxxxy'))
# ['xxxxxy']

Using the bracket notation, the code doesn’t match substrings with fewer than three and more than five 'x' characters.

The Code

Our goal is to write a function input_ok that takes a string argument and checks whether it has the (time) format XX:XX, where X is a number from 0 to 9; see Listing 5-6. Note that, for now, you accept semantically wrong time formats such as 12:86, but the next one-liner section tackles this more advanced problem.

## Dependencies
import re


## Data
inputs = ['18:29', '23:55', '123', 'ab:de', '18:299', '99:99']


## One-Liner
input_ok = lambda x: re.fullmatch('[0-9]{2}:[0-9]{2}', x) != None


## Result
for x in inputs:
    print(input_ok(x))

Listing 5-6: One-liner solution to check whether a given user input matches the general time format XX:XX

Before you move on, try to determine the results of the six function calls in this code.

How It Works

The data consists of six input strings as received by the frontend of your web application. Are they correctly formatted? To check this, you create the function input_ok by using a lambda expression with one input argument x and a Boolean output. You use the function fullmatch(regex, x) and attempt to match the input argument x by using our time-formatting regex. If you couldn’t match it, the result takes the value None and the Boolean output becomes False. Otherwise, the Boolean output is True.

The regex is simple: [0-9]{2}:[0-9]{2}. This pattern matches two leading numbers from 0 to 9, followed by the colon:, followed by two trailing numbers from 0 to 9. Thus, the result of Listing 5-6 is the following:

## Result
for x in inputs:
    print(input_ok(x))
'''
True
True
False
False
False
True
'''

The function input_ok correctly identifies the correct formats of the time inputs. In this one-liner, you’ve learned how highly practical tasks—that would otherwise take multiple lines of code and more effort—can be finished successfully in a few seconds with the right tool set.

Validating Time Format of User Input, Part 2

In this section, you’ll dive deeper into validating the time format of user inputs to solve the problem of the previous section: invalid time inputs such as 99:99 should not be considered valid matches.

The Basics

A useful strategy to solve problems is to address them hierarchically. First, strip down the problem to its core and solve the easier variant. Then, refine the solution to match your specific (and more complicated) problem. This section refines the previous solution in an important way: it doesn’t allow invalid time inputs such as 99:99 or 28:66. Hence, the problem is more specific (and more complicated), but you can reuse parts of our old solution.

The Code

Our goal is to write a function input_ok that takes a string argument and checks whether it has the (time) format XX:XX, where X is a number between 0 and 9; see Listing 5-7. Additionally, the given time must be a valid time format in the 24-hour time ranging from 00:00 to 23:59.

## Dependencies
import re


## Data
inputs = ['18:29', '23:55', '123', 'ab:de', '18:299', '99:99']


## One-Liner
input_ok = lambda x: re.fullmatch('([01][0-9]|2[0-3]):[0-5][0-9]', x) != None

## Result
for x in inputs:
    print(input_ok(x))

Listing 5-7: One-liner solution to check whether a given user input matches the general time format XX:XX and is valid in the 24-hour time

This code prints six lines. What are they?

How It Works

As mentioned in the introduction of this section, you can reuse the solution of the previous one-liner to solve this problem easily. The code stays the same—you modified only the regular expression ([01][0-9]|2[0-3]):[0-5][0-9]. The first part ([01][0-9]|2[0-3]) is a group that matches all possible hours of the day. You use the or operator | to differentiate hours 00 to 19 on the one hand, and hours 20 to 23 on the other hand. The second part [0-5][0-9] matches the minutes of the day from 00 to 59. The result is, therefore, as follows:

## Result
for x in inputs:
    print(input_ok(x))

'''
True
True
False
False
False
False
'''

Note that the sixth line of the output indicates that the time 99:99 is no longer considered a valid user input. This one-liner shows how to use regular expressions to check whether the user input matches the semantic requirements of your application.

Duplicate Detection in Strings

This one-liner introduces an exciting capability of regular expressions: reusing parts you’ve already matched later in the same regex. This powerful extension allows you to solve a new set of problems, including detecting strings with duplicated characters.

The Basics

This time, you’re working as a computer linguistics researcher analyzing how certain word usages change over time. You use published books to classify and track word usage. Your professor asks you to analyze whether there’s a trend toward a more frequent use of duplicate characters in words. For example, the word 'hello' contains the duplicate character 'l', while the word 'spoon' contains the duplicate character 'o'. However, the word 'mama' would not be counted as a word with a duplicate character 'a'.

The naive solution to this problem is to enumerate all possible duplicate characters 'aa', 'bb', 'cc', 'dd', . . . , 'zz' and combine them in an either-or regex. This solution is tedious and not easily generalized. What if your professor changes their mind and asks you to check for repeat characters with up to one character in between (for example, the string 'mama' would now be a match)?

No problem: there’s a simple, clean, and effective solution if you know the regex feature of named groups. You’ve already learned about groups that are enclosed in parentheses (...). As the name suggests, a named group is just a group with a name. For instance, you can define a named group around the pattern ... with the name name by using the syntax (?P<name>...). After you define a named group, you can use it anywhere in your regular expression with the syntax (?P=name). Consider the following example:

import re

pattern = '(?P<quote>[\'"]).*(?P=quote)'
text = 'She said "hi"'
print(re.search(pattern, text))
# <re.Match object; span=(9, 13), match='"hi"'>

In the code, you search for substrings that are enclosed in either single or double quotes. To accomplish that, you first match the opening quote by using the regex ['"] (you escape the single quote, \’, to avoid Python wrongly assuming that the single quote indicates the end of the string). Then, you use the same group to match the closing quote of the same character (either a single or double quote).

Before diving into the code, note that you can match arbitrary whitespaces with the regex \s. Also, you can match characters that are not in a set Y by using the syntax [^Y]. That’s everything you need to know to solve our problem.

The Code

Consider the problem illustrated in Listing 5-8: given a text, find all words that contain duplicate characters. A word in this case is defined as any series of non-whitespace characters separated by an arbitrary number of whitespace characters.

## Dependencies
import re


## Data
text = '''
It was a bright cold day in April, and the clocks were
striking thirteen. Winston Smith, his chin nuzzled into
his breast in an effort to escape the vile wind, slipped
quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty
dust from entering along with him.
-- George Orwell, 1984
'''


## One-Liner
duplicates = re.findall('([^\s]*(?P<x>[^\s])(?P=x)[^\s]*)', text)


## Results
print(duplicates)

Listing 5-8: One-liner solution to find all duplicate characters

What are the words with duplicate characters found in this code?

How It Works

The regex (?P<x>[^\s]) defines a new group with the name x. The group consists of only a single arbitrary character that is not the whitespace character. The regex (?P=x) immediately follows the named group x. It simply matches the same character matched by the group x. You’ve found the duplicate characters! However, the goal is not to find duplicate characters, but words with duplicate characters. So you match an arbitrary number of non-whitespace characters [^\s]* before and after the duplicate characters.

The output of Listing 5-8 is the following:

## Results
print(duplicates)
'''
[('thirteen.', 'e'), ('nuzzled', 'z'), ('effort', 'f'),
('slipped', 'p'), ('glass', 's'), ('doors', 'o'),
('gritty', 't'), ('--', '-'), ('Orwell,', 'l')]
'''

The regex finds all words with duplicate characters in the text. Note that there are two groups in the regex of Listing 5-8, so every element returned by the re.findall() function consists of a tuple of matched groups. You’ve already seen this behavior in previous sections.

In this section, you’ve enhanced your regex tool set with one powerful tool: named groups. In combination with two minor regex features of matching arbitrary whitespace characters with \s and defining a set of characters that are not matched with the operator [^...], you’ve made serious progress toward Python regex proficiency.

Detecting Word Repetitions

In the preceding section, you learned about named groups. The goal of this section is to show you more advanced ways of using this powerful feature.

The Basics

While working as a researcher over the last few years, I spent most of my time writing, reading, and editing research papers. When editing my research papers, a colleague used to complain that I was using the same words repeatedly (and too closely in the text). Wouldn’t it be useful to have a tool that checks your writing programmatically?

The Code

You’re given a string consisting of lowercase, whitespace-separated words, without special characters. Find a matching substring where the first and the last word are the same (repetition) and in-between are at most 10 words. See Listing 5-9.

## Dependencies
import re


## Data
text = 'if you use words too often words become used'


## One-Liner
style_problems = re.search('\s(?P<x>[a-z]+)\s+([a-z]+\s+){0,10}(?P=x)\s', ' ' + text + ' ')


## Results
print(style_problems)

Listing 5-9: One-liner solution to find word repetitions

Does this code find word repetitions?

How It Works

Again, you assume that a given text consists of only whitespace-separated, lowercase words. Now, you search the text by using a regular expression. It might look complex at first, but let’s break it down piece by piece:

'\s(?P<x>[a-z]+)\s+([a-z]+\s+){0,10}(?P=x)\s'

You start with a single whitespace character. This is important to ensure that you start with a whole word (and not with a suffix of a word). Then, you match a named group x that consists of a positive number of lowercase characters from 'a' to 'z', followed by a positive number of whitespaces .

You proceed with 0 to 10 words, where each word consists of a positive number of lowercase characters from 'a' to 'z', followed by a positive number of whitespaces .

You finish with the named group x, followed by a whitespace character to ensure that the last match is a whole word (and not only a prefix of a word) .

The following is the output of the code snippet:

## Results
print(style_problems)
# <re.Match object; span=(12, 35), match=' words too often words '>

You found a matching substring that may (or may not) be considered as bad style.

In this one-liner, you stripped down the problem of finding duplicate words to its core and solved this easier variant. Note that in practice, you’d have to include more complicated cases such as special characters, a mix of lowercase and uppercase characters, numbers, and so on. Alternatively, you could do some preprocessing to bring the text into the desired form of lowercase, whitespace-separated words, without special characters.

EXERCISE 5-1

Write a Python script that allows for more special characters, such as characters to structure your sentences (period, colon, comma).

Modifying Regex Patterns in a Multiline String

In the final regex one-liner, you’ll learn how to modify a text rather than matching only parts of it.

The Basics

To replace all occurrences of a certain regex pattern with a new string replacement in a given text, use the regex function re.sub(regex, replacement, text). This way, you can quickly edit large text bases without a lot of manual labor.

In the previous sections, you learned how to match patterns that occur in the text. But what if you don’t want to match a certain pattern if another pattern occurs? The negative lookahead regex pattern A(?!X) matches a regex A if the regex X does not match afterward. For example, the regex not (?!good) would match the string 'this is not great' but would not match the string 'this is not good'.

The Code

Our data is a string, and our task is to replace all occurrences of Alice Wonderland with 'Alice Doe', but not to replace occurrences of 'Alice Wonderland' (enclosed in single quotes). See Listing 5-10.

## Dependencies
import re


## Data
text = '''
Alice Wonderland married John Doe.
The new name of former 'Alice Wonderland' is Alice Doe.
Alice Wonderland replaces her old name 'Wonderland' with her new name 'Doe'.
Alice's sister Jane Wonderland still keeps her old name.
'''


## One-Liner
updated_text = re.sub("Alice Wonderland(?!')", 'Alice Doe', text)


## Result
print(updated_text)

Listing 5-10: One-liner solution to replace patterns in a text

This code prints the updated text. What is it?

How It Works

You replace all occurrences of Alice Wonderland with Alice Doe, but not the ones that end with the single quote '. You do this by using a negative lookahead. Note that you check only whether the closing quote exists. For example, a string with an opening quote but without a closing quote would match, and you’d simply replace it. This may not be desired in general, but it leads to the desired behavior in our example string:

## Result
print(updated_text)
'''
Alice Doe married John Doe.
The new name of former 'Alice Wonderland' is Alice Doe.
Alice Doe replaces her old name 'Wonderland' with her new name 'Doe'.
Alice's sister Jane Wonderland still keeps her old name.
'''

You can see that the original name of 'Alice Wonderland' is left unchanged when enclosed in single quotes—which was the goal of this code snippet.

Summary

This chapter covered a lot of ground. You’ve learned about regular expressions, which you can use to match patterns in a given string. In particular, you’ve learned about the functions re.compile(), re.match(), re.search(), re.findall(), and re.sub(). Together, they cover a high percentage of regular expression use cases. You can pick up other functions as you apply regular expressions in practice.

You’ve also learned about various basic regular expressions that you can combine (and recombine) in order to create more advanced regular expressions. You’ve learned about whitespaces, escaped characters, greedy/nongreedy operators, character sets (and negative characters sets), grouping and named groups, and negative lookaheads. And finally, you’ve learned that it’s often better to solve a simplified variant of the original problem than trying to generalize too early.

The only thing left is to apply your new regex skill in practice. A good way of getting used to regular expressions is to start using them in your favorite text editor. Most advanced text and code editors (including Notepad++) ship with powerful regular expression functionality. Also, consider regular expressions when working with textual data (for example when writing emails, blog articles, books, and code). Regular expressions will make your life easier and save you many hours of tedious work.

In the next chapter, we’ll dive into the supreme discipline of coding: algorithms.