Anyone who has visited my blog before will know that I’m a big advocate of SEO automation—in particular, SEOs learning to code to help run much of the manual busy work they do with scripts. This has been a common theme and, if you’ve been here before, it’s probably because of my previous PageSpeed Automation script written in Node.
I’ve recently been playing around with a number of different tools, taking advantage of the numerous SEO free trials on offer, and ended up experimenting with SERP APIs a fair bit—which are essentially a way of pulling data from Google’s SERP programmatically, using an API. These tools are great—easy to use, cheap, and handy for a good number of automation task, especially those that involve the use of advanced search operators—something that this script does.
But enough of the fluff, this article is all about my latest script—which lets you automatically generate lists of sites accepting guest posts by keyword, pulling email addresses and word counts (where possible), along with site metrics, spitting out a CSV file. If your someone who uses Google a fair bit to find link opportunities—which I’m sure many of us do—this script could save you a ton of time, doing most of the donkey work for.
Right, with the intro out of the way, I’m going to start this in reverse—first let’s take a look at the output you can expect when using this script.
What the output looks like
When the script is done running, it outputs a CSV file, which—once imported into Google Sheets or similar—will look like the reference image below:

Things you’ll need
To use the script, you’ll need to setup a couple of accounts—don’t worry, they’re all free.
- ScrapingRobot—you can create a forever-free account on ScrapingRobot here. This product is a SERP API and offers a surprising 5k free credits p/month—which is an amazing deal if you ask me.
I’ll probably do a few more of these using this tool over the coming weeks and months, as I think it’s well-worth taking advantage of. - Mozscape API—you can open a free account using this link. It’s important to note that you’ll need to provide a credit card, even for the forever free tier—no idea why, seems unnecessary to me, but it is what it is. With the free account, you can query 2,400 rows free each—a row being a site in our case.
The code
While I’m not going to hold anyone’s hand through this—or indeed any other similar posts in the future—I will make more of an effort to separate out the various components of these scripts into individual files, so that it’s easier to follow along with and understand.
The script is made up of six separate files; index.py, scrapingbot.py, decodeemail.py, mozapi.py, datestamp.py, and writecsv.py—each are given below for reference.
index.py
Not a great deal to explain here—all you really need to do is provide your settings, including API keys for scrapingbot and the mozscape API, the keyword for your search query, and the number of results you want returned from Google.
I’ve indicated where these go in the script with inline comments.
index.py
import time
from scrapingrobot import *
from urllib.parse import urlparse
from mozapi import *
from datestamp import *
from writecsv import *
#API CREDS
scrapingrobot_token = "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" #YOUR SCRAPINGROBOT TOKEN
moz_access_id = "mozscape-xxxxxxxxxx" #YOUR MOZ_ACCESS_ID
moz_secret_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" YOUR MOZ_SECRET_KEY
#SERP API SETTINGS
keyword = "seo" #YOUR SEARCH KEYWORD
country = 'US' #GEOLOCATION SETTING
results = 10 #NUMBER OF RESULTES (MAX 50)
def main():
print('finding guest post opportunities...')
urls = check_keyword_ranking(keyword, country, results, scrapingrobot_token)
guest_post_opportunities = []
guest_post_urls = []
for x, url in enumerate(urls):
time.sleep(1)
print("processing "+str((x+1))+" of "+str(len(urls))+"...")
guest_post = guest_post_details(url[0], scrapingrobot_token)
if guest_post:
guest_post_opportunities.append(guest_post)
guest_post_urls.append(guest_post["url"])
url_list = []
for url in urls:
parsed_uri = urlparse(url[0]).netloc
url_list.append(parsed_uri)
urls_string = ''.join(f'"{url}", ' for url in url_list)
print('getting moz site metrics...')
moz_site_metrics = get_moz(urls_string.rstrip(','), moz_access_id, moz_secret_key)
guest_post_opportunities_with_metrics = []
for moz_site in moz_site_metrics:
for guest_post in guest_post_opportunities:
if moz_site["url"] in guest_post["url"]:
guest_post_with_metrics = {"url": moz_site["url"], "guidelines": guest_post["url"], "contact": guest_post["email"],"word_count": guest_post["words"], "domain_authority": moz_site["domain_authority"],"referring_domains": moz_site["referring_domains"] }
guest_post_opportunities_with_metrics.append(guest_post_with_metrics)
filename = keyword.replace(" ","_")+'_guest_posts_'+get_datestamp()+'.csv'
write_csv(guest_post_opportunities_with_metrics, filename)
if __name__ == "__main__":
main()
scrapingrobot.py
There are two parts to this—the first is the “get_guest_post_opportunities” function, which scrapes the SERP using your keyword and an advanced search operator.
The second is the “guest_post_details” function, which uses scrapingrobot to scrape each found page, retrieving any content details and minimum word counts for guest posts (if these can be found) using regex.
scrapingrobot.py
import requests
from bs4 import BeautifulSoup
from decodeemail import *
import re
def get_guest_post_opportunities(keyword, country, results, token):
query = '"'+keyword+'" + inurl:"write-for-us"'
url = "https://api.scrapingrobot.com?token="+token
payload = {
"url": "https://www.google.com",
"module": "GoogleScraper",
"params":
{
"proxyCountry":country.upper(),
"countryCountry":country.lower(),
"query":query,
"num": results,
}
}
headers = {
"accept": "application/json",
"content-type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
data = response.json()
sites = []
if "result" in data.keys():
#if data["result"]:
results = data["result"]["organicResults"]
for result in results:
result = [result["url"]]
sites.append(result)
return sites
def guest_post_details(url, token):
request_url = "https://api.scrapingrobot.com/?token="+token+"?url="+url
headers = {"accept": "application/json"}
response = requests.get(request_url, headers=headers)
doc = response.json()
if "result" in doc.keys():
#if doc["result"]:
soup = BeautifulSoup(doc["result"], "html.parser")
for each in ['header','footer']:
s = soup.find(each)
if s:
s.extract()
elems = []
for elem in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'li']):
elems.append(elem.text)
elems_string = ' '.join(elems)
protected_email = soup.find("span", {"data-cfemail" : True})
email = url
count = '-'
if protected_email:
attrs = protected_email.attrs
cfemail_clean = attrs['data-cfemail'].replace('"', '').replace('\\', '')
email = cfDecodeEmail(cfemail_clean)
else:
email_regex = "([A-Za-z0-9]+[.-_]*[A-Za-z0-9]+\s*@\s*[A-Za-z0-9-]+.[A-Za-z]{2,})"
found_email = re.findall(email_regex,elems_string)
form_url = soup.select_one("a[href*=contact]")
if found_email:
email = found_email[0]
elif form_url:
email = form_url.get("href")
count_regex = "(?i)(?<!\$)([1-9]{1}[0-9]{1}00|[1-9]{1},[0-9]{1}00|[1-9]{1}00)(?=(?:\+|\-|\s*word|\s*to))"
found_count = re.findall(count_regex,elems_string)
if found_count:
found_count_clean = []
for count in found_count:
found_count_clean.append(count.replace(',', ''))
found_counts = list(map(int, found_count_clean))
found_counts.sort()
count = found_counts[0]
return {"url": url, "email": email.replace(" ", ""), "words": count}
decodeemail.py
This file contains the “cfDecodeEmail” function—which I borrowed from a blog post title “Decoding Cloudflare-protected Emails“.
All this does is decode emails that are protected for sites that use CloudFlare.
decodeemail.py
def cfDecodeEmail(encodedString):
r = int(encodedString[:2],16)
email = ''.join([chr(int(encodedString[i:i+2], 16) ^ r) for i in range(2, len(encodedString), 2)])
return email
mozapi.py
This file contains the function “get_moz“, which takes the root domain of the URLs scraped from the SERP and returns basic site metrics via an SEO API provided by Moz—the Mozscape Link API.
I’ve only included Domain Authority (DA) and Referring Domains (root_domains_to_root_domain) in my script, but you can edit this to add any of the other metrics that are available to free accounts from Moz’s links API.
mozapi.py
import requests
def get_moz(urls, access_id, secret_key):
auth = (access_id, secret_key)
request_url = "https://lsapi.seomoz.com/v2/url_metrics"
data = """{
"targets": ["""+urls+"""]
}"""
request = requests.post(request_url, data=data, auth=auth)
moz_data = request.json()
all_site_metrics = []
for site_metrics in moz_data["results"]:
site_metrics_row = {"url": site_metrics["page"].rstrip('/'),"domain_authority": site_metrics["domain_authority"], "referring_domains": site_metrics["root_domains_to_root_domain"]}
all_site_metrics.append(site_metrics_row)
return all_site_metrics
datestamp.py
This file just contains a simple function to retrieve the date—this is then use in the filename for the exported CSV file.
datestamp.py
from datetime import date
def get_datestamp():
return(str(date.today()))
writecsv.py
This file contains the “write_csv” function, which just writes the data to CSV.
writecsv.py
import csv
def write_csv(data, filename):
file = open(filename, 'w', newline='')
writer = csv.DictWriter(file, fieldnames=['url', 'guidelines', 'contact', 'word_count', 'domain_authority', 'referring_domains'])
writer.writeheader()
writer.writerows(data)
file.close()
print("writted to csv file...done!")
Couple of notes on using the script
I wanted to include a couple of notes on using the script—particularly for anyone wanting to use it as-is, without any modification.
- You’ll obviously need to install Python if you don’t have it already. I’m not going to go into detail on how to do that—it’s best the check the official Python getting started guide for that.
- You’ll also need to install and import a few dependencies—these are; “time“, “urllib.parse“, “datetime“, “requests“, “bs4“, “re” and “csv“.
- Put all the files in the same directory, then open a terminal, navigate to the location, and run the script from there using “py index.py” in the command line.
- The script is set to run a single query at a time—you could of course modify this, adding a loop—but while I was building this using free accounts, I didn’t want to blow all my credits in the process.
- You can also change the advanced operator used to find guest posts. Right now, it uses inurl:”write-for-us”—but there are plenty of others you could play around with/find example of online. You’ll want to modify the “query” string in the “check_keyword_ranking” function, found in the scrapingrobot.py file.
- You can only query 50 target URLs at a time with the Mozscape Links API. I could chunk the data to run more, but I haven’t yet—so don’t exceed “50” in your settings in the index.py file.
- If you want to modify the script yourself to do the above—keep in mind that the Mozscape Links API has a 10 second rate limit for free accounts.
- The output is in CSV and can be found in the same directory—which you’ll want to open or import into Microsoft Excel, Google Sheets, or whatever spreadsheet software you use.
Summing it up
That’s all really. With any luck the steps outlined above are easy enough to follow, and even those completely unfamiliar with coding should be able to get the script up and running without too much difficulty. Although this script is in many ways still a work in progress, I felt it was far enough along to share for feedback and comment—it’s definitely useable, although of course, it could be better.
While it may still be a little buggy—especially when it comes to pulling out emails, as there are plenty of different ways people try to conceal their emails from scrapers—it’s certainly in a usable state. There are other things that could be done here, such as checking if there is a form present on the page, or trying to find general emails and contact forms when none are found—but I’ll work on this later, as the early attempts at this were super unreliable.
On that note—if you have any issues, feedback, or better yet—share sites you struggled to pull data from that is present on the page, please drop these in the comments so I can look to implement fixes and changes to make the script more useful for everyone. I’m fairly new to Python, so I’m happy welcome to criticism—constructive of course—which can help me to improve and produce better code. Feel free to drop these in the comments too.
Well that’s all folks—thanks for reading! I hope that you’ve found this useful.