This past week I’ve found myself working on a directory project at work as part of an experimental outreach strategy for link building. One of the main aspects of this project was scraping existing listings from several different established directories and sites, collating these into a single directory—which the most challenging aspect of was getting all the data into a consistent form and format.
As anyone who does any web scraping will know, it can be tricky getting all your scraped data into a consistent format, as it’s you’ll often find that this data isn’t always as consistent as it might be when it find it. In my case, the big problem was with addresses—and in particular, phone numbers, email addresses, and company URLs sometimes being mixed in, which I needed to identify and filter out—the solution for which was to use regular expressions, or “regex”.
I’ve created this post primarily as a reference for myself, so I have a quick and easy way to get hold of these examples when the time inevitably comes again, saving myself a handful of Google searches. That said, I also wanted to create a resource for anyone else with a task similar to mine, especially for anyone new to both regex and programming.
I also just wanted an excuse to play around with regex, as it’s not something I find myself ever really needing – which is why some of the examples given below arguably venture off into the more obscure, as I was looking for some I could right myself, to help me get a better understanding and some first-hand practice at writing regular expressions.
Useful regex examples
Before I get into any of these regex examples, I’ll warn you that these are all very UK-centric—I’m UK-based and work for a British company, which is why this is the case. If you’re not looking for UK-specific examples, then these might not be of much use to you.
I’d also add that while I’ve written and modified some of the examples below myself, they’re not all my original work, and for the life of my I can’t recall where each of these originally came from. So if any of the examples below are one of yours, please leave a link to the original source and I’ll add that with the respective example below.
Website URL regex:
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)
Source: https://gist.github.com/SavannahF/61b12a2c880183e1612488f8ec8b4aa3
Email address regex:
(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Source: https://emailregex.com/
UK phone number regex:
(\s*\(?0\d{4}\)?\s*\d{6}\s*)|(\s*\(?0\d{3}\)?\s*\d{3}\s*\d{4}\s*)|(\s*\(?0\d{2}\)?\s*\d{8}\s*)|(\s*\(?0\d{2}\)?\s*\d{4}\s*\d{4}\s*)|(\s*\(?0\d{4}\)?\s*\d{3}\s*\d{3}\s*)
UK post code regex:
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})
Source: https://stackoverflow.com/questions/164979/regex-for-matching-uk-postcodes
UK company number regex:
/^((AC|ZC|FC|GE|LP|OC|SE|SA|SZ|SF|GS|SL|SO|SC|ES|NA|NZ|NF|GN|NL|NC|R0|NI|EN|\d{2}|SG|FE)\d{5}(\d|C|R))|((RS|SO)\d{3}(\d{3}|\d{2}[WSRCZF]|\d(FI|RS|SA|IP|US|EN|AS)|CUS))|((NI|SL)\d{5}[\dA])|(OC(([\dP]{5}[CWERTB])|([\dP]{4}(OC|CU))))$/
Source: https://gist.github.com/rob-murray/01d43581114a6b319034732bcbda29e1
UK VAT number regex:
\(?(GB|XI)\d{9}
How to use regex in Python and JavaScript
In the following sections we’ll cover a few super basic examples of how to extract regex matches from strings, and how to split these strings, helping you clean up your scraped data, getting the right values into their correct columns.
For anyone with any experience in either of these languages, these examples aren’t going to teach you anything new—they’re really for the the novices out there, giving them simple, working snippets they, can quickly utilise and repurpose as required.
Python examples
1. Extract all regex matches from a string into a list using Python
import re
regex = "\(?(fox|dog)"
string = "The quick brown fox jumps over the lazy dog"
found = re.findall(regex,string)
if found:
print(found[0]) # prints "fox"
print(found[1]) # prints "dog"
2. Split string at first match in list using Python
import re
regex = "\(?(fox|dog)"
string = "The quick brown fox jumps over the lazy dog"
found = re.findall(regex,string)
if found:
split = string.split(found[0])
print(split[0].strip()) # prints "The quick brown"
print(split[1].strip()) # prints "jumps over the lazy dog"
JavaScript examples
1. Extract all regex matches from a string into an array using JavaScript
const regex = /\(?(fox|dog)/g
let string = 'The quick brown fox jumps over the lazy dog'
let found = string.match(regex)
if(found) {
console.log(found[0]) // prints "fox"
console.log(found[1]) // prints "dog"
}
2. Split string at first match in array using JavaScript
const regex = /\(?(fox|dog)/g
let string = 'The quick brown fox jumps over the lazy dog'
let found = string.match(regex)
if(found) {
split = string.split(found[0])
console.log(split[0].trim()) // prints "The quick brown"
console.log(split[1].trim()) // prints "jumps over the lazy dog"
}
Resources for learning, writing, and testing your own regular expressions
For anyone new to regex and who needs more than what’s covered by the working examples in this post, there are a ton of great resources to get you started—some of which I’ve linked to below:
Learning resources:
https://www.regular-expressions.info/
Tools for testing and practicing:
If anyone has any other great regex resources, please let me know in the comments and I’ll get them added to the list above. I’ll also try to keep expanding and updating this list as I come across more useful resources.
Summing it up
Regex is one of those things that looks infinitely more complicated at first glance than it turns out to be, with there not only being plenty of resources out there to help you better understand and write your own, but also a wealth of working examples for all sorts of common use cases—such as those examples I’ve included in the this post for reference.
While it’s useful to clean up your data, it’s better to get the right data in the first place—it certainly would have saved me a significant amount of time. All of the examples and snippets given in this article can be reworked and incorporated into your scraper script with minimal effort, which is certainly something I plan on taking a little extra time to do in the beginning, saving me all the hassle of fixing my data up after the fact.
And it’s not just web scraping or tidying up data sets where it can be useful. For anyone who works in SEO as I do, many of the tools we use on a day-to-day basis, like those from Google—including Analytics and Search Console—support regex, making it easier to drill down into the data these tools provide. This makes it a worthwhile skill to pick up and develop, even for those of you who aren’t primarily developers.
If you’ve made it this far—thank you for reading! I hope you’ve found this article some use, especially if you’re new to regular expression, and that I’ve gone some way to offer you a shortcut or good jumping off point to help get into and develop your understanding of and regex skills further.