Automate the alphabet soup method with Node JS and Puppeteer

Share this:

It’s been a while since my last post—over a month in fact, which is pretty poor form. Shame on me! I’ve been fairly busy, as I expect we’ve all been, with work and a few side projects. In particular, a number of projects that involve web scraping.

This got me thinking about daily SEO tasks that could be easily automated with something similar, which led me to the alphabet soup methods popularised by Income School

For anyone unfamiliar, this method is essentially using Google’s autocomplete to identify opportunities for content generation, being a staple of search analysis—especially for anyone just starting out, who doesn’t yet have the budget for the typical assortment of costly SEO tools most of us use for keyword research.

For the purposes of this article, we’re going to keep things as simple as possible—homing in on the most basic example of how to do this. This should be good enough for most people, and if you’re looking for a more sophisticated implementation, this would offer a reasonable base from which to flesh these additional requirements out.

How it works

The script itself is fairly simple. It takes a keyword, a question modifier—”how”, “who”, “why”, “what”, etc.—a regional Google, and a set of proxy credentials. With these, it scrapes Google, automatically typing in search queries to populate Google’s autocomplete. 

It creates six variants of these search queries, reordering the modifier and keywords, as well as using the wildcard (*) modifier—ensuring that the maximum number of possible phrases are spat out and captured. These variations can be seen in the snippet below:

let queries = [
     `${modifier} ${keyword} ${letter}`,
     `${modifier} ${letter} ${keyword}`,
     `${modifier} * ${keyword} ${letter}`,
     `${modifier} ${keyword} * ${letter}`,
     `${modifier} * ${letter} ${keyword}`,
     `${modifier} ${letter} * ${keyword}`
]

The results are then merged, duplicates are removed, and the final set of unique phrases are written to a text file, which you can then cluster and prioritise them, either manually or using a third party clustering tool.

What you’ll need

Not much—just Node and a rotating proxy service, to prevent your IP getting blocked by Google.

Rotating proxy service

While it’s not completely necessary, if you intend to use the script heavily, you’ll want to use a rotating proxy service to prevent your IP being blocked by Google. You’ll also need to make a couple of changes to the script if you don’t use a proxy—these changes are giving in the “how to use” section of this post.

Personally, I use BotProxy—I find their product to be super straight forward, affordable, and more than up to what I need it for. There are plenty of other options out there, however, if you’re looking for something on this sort of scale, you’ll struggle to find better alternatives. They also offer a free trial—which not all similar services do.

I’m subscribed to their lowest tier, which gives me up to 55 different IPs per/day and usage up to 10GB for $10 per/month, which is more than enough for what I need it for. If you’re not primarily focused on the US market, you might want to consider the next tier, which includes local IPs and not just US ones. You can find pricing for BotProxy here.

That said, for this task, provided you include the correct local Google variant, include the correct location modifier in the query string (e.g. &gl=uk), and aren’t absolutely ragging it with thousands upon thousands, you should be fine with the entry pricing tier.

The code

Like our previous SEO-related post—PageSpeed Insights testing with Node—this isn’t supposed to be a programming tutorial – so I won’t spend too much time on the specifics of how it works. 

If you’re someone who wants to dive into the code a little, you can see it below. If you’re not, and are more interested in just using it, you can jump ahead to the “How to use it” section in this post.

const puppeteer = require('puppeteer')
const fs = require('fs')

//Keyword and modifier
const keyword = `KEYWORD HERE`
const modifier = `QUESTION MODIFIER HERE`
const googleUrl = `GOOGLE URL HERE`

//Proxy settings
const proxyUser = `PROXY USERNAME HERE`
const proxyPassword = `PROXY PASSWORD HERE`
const proxyServer = `PROXY SERVER URL HERE`

const proxySettings = [proxyUser,proxyPassword,proxyServer]

const alphabet = ['','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] 

async function type(page,text) {
    try {
        let searchInput = await page.$('input[name=q]')
        await searchInput.type(text, {delay: 50})
        await page.waitForSelector('ul[role=listbox]')
        await delay(500)
    }
    catch(e) {
        console.log(e)
    }
}

async function clear(page) {
    try {
        let searchInput = await page.$('input[name=q]') 
        await searchInput.click({clickCount: 3})
        await searchInput.press('Backspace')
    }
    catch(e) {
        console.log(e)
    }
}

function delay(time) {
    return new Promise(function(resolve) { 
        setTimeout(resolve, time)
    })
 }

async function google(alphabet,keyword,modifier,url,proxySettings) {
    try {
        const browser = await puppeteer.launch({headless: true,
            args: [`--proxy-server=${proxySettings[2]}`]
         })
        const page = await browser.newPage()
        await page.authenticate({        
            username: proxySettings[0],
            password: proxySettings[1]
        })
        await page.goto(url)
        await page.exposeFunction('autocompleteQuery', async (alphabet,keyword,modifier) => {
            let results = []
            for (const letter of alphabet) {
                let queries = [
                    `${modifier} ${keyword} ${letter}`,
                    `${modifier} ${letter} ${keyword}`,
                    `${modifier} * ${keyword} ${letter}`,
                    `${modifier} ${keyword} * ${letter}`,
                    `${modifier} * ${letter} ${keyword}`,
                    `${modifier} ${letter} * ${keyword}`
                ]
                for (const query of queries) {
                    await type(page,query)
                    let getAutocomplete = await page.$$eval('ul[role=listbox] span', elems => elems.map(elem => elem.innerText))
                    getAutocomplete.forEach(function(value) {
                        if(value.indexOf(keyword.toLowerCase()) !== -1 && value.indexOf(modifier.toLowerCase()) !== -1) {
                            results.push(value)
                        }
                    })
                    await clear(page)
                }
            }
            return results
        })
        const results = await page.evaluate(async (alphabet,keyword,modifier) => {
            let results = await autocompleteQuery(alphabet,keyword,modifier)
            return results
        },alphabet,keyword,modifier)
        browser.close()
        return results
    }
    catch(e) {
        console.log(e)
    }
}

async function timestamp() {
    try {
        const isoDateTime = date => date.toISOString().slice(0, 19)
        let dateTime = isoDateTime(new Date())
        return dateTime.replace('T','_').replace(/:/g,'-')
    }
    catch(e) {
        console.log(e)
    }
}

async function fileName(keyword,modifier) {
    try {
        let time = await timestamp()
        let fileName = `${keyword.replace(/ /g,'_')}_${modifier}_${time}`.toLowerCase()
        return fileName
    }
    catch(e) {
        console.log(e)
    }
}

async function writeFile(keyword,modifier,results) {
    try {
        let file = await fileName(keyword,modifier)
        let data = results.join('\n')
        fs.writeFile(`./data/${file}.txt`, data, (e) => { if (e) throw e })
        return file
    }
    catch(e) {
        console.log(e)
    }
}

async function removeDuplicates(results) {
    try {
        let uniqueResults = results.filter((element, index) => {
            return results.indexOf(element) === index;
        })
        return uniqueResults
    }
    catch(e) {
        console.log(e)
    }
}

async function init(alphabet,keyword,modifier,url,proxySettings) {
    try {
        let results = await google(alphabet,keyword,modifier,url,proxySettings)
        let uniqueResults = await removeDuplicates(results)
        await writeFile(keyword,modifier,uniqueResults)
    }
    catch(e) {
        console.log(e)
    }
}

init(alphabet,keyword,modifier,googleUrl,proxySettings)

How to use it

The script is fairly easy to setup, even without much in the way of technical know-how or expertise. I’ve outlined the steps below to make it super easy to get up and running with it.

  1. If you don’t already have Node on your machine, you’ll need to install it. You can grab the installers on the download page of the official NodeJS website.



  2. Download the “index.js” and “package.json” files for the script here. Create a new directory on your machine and copy these two files to it.

  3. Create a directory named “data” within the same folder as the directory you created in the previous step.

  4. Open “index.js” in a text editor and enter your own “keyword”, “modifier”, “googleUrl”, “proxyUser”, “proxyPassword” and “proxyServer”, then save and close the file.

    const keyword = `KEYWORD HERE`   
    const modifier = `QUESTION MODIFIER HERE`   
    const googleUrl = `GOOGLE URL HERE`   
    
    const proxyUser = `PROXY USERNAME HERE`   
    const proxyPassword = `PROXY PASSWORD HERE`   
    const proxyServer = `PROXY SERVER URL HERE`

    If you’re going to use it without a rotating proxy service, you’ll need to make the following edits to the “google” function in the index.js file:

    async function google(alphabet,keyword,modifier,url,proxySettings) {
        try {
            const browser = await puppeteer.launch({headless: false})
            const page = await browser.newPage()
            await page.goto(url)
            await page.exposeFunction('autocompleteQuery', async (alphabet,keyword,modifier) => {
                let results = []
                for (const letter of alphabet) {
                    let queries = [
                        `${modifier} ${keyword} ${letter}`,
                        `${modifier} ${letter} ${keyword}`,
                        `${modifier} * ${keyword} ${letter}`,
                        `${modifier} ${keyword} * ${letter}`,
                        `${modifier} * ${letter} ${keyword}`,
                        `${modifier} ${letter} * ${keyword}`
                    ]
                    for (const query of queries) {
                        await type(page,query)
                        let getAutocomplete = await page.$$eval('ul[role=listbox] span', elems => elems.map(elem => elem.innerText))
                        getAutocomplete.forEach(function(value) {
                            if(value.indexOf(keyword.toLowerCase()) !== -1 && value.indexOf(modifier.toLowerCase()) !== -1) {
                                results.push(value)
                            }
                        })
                        await clear(page)
                    }
                }
                return results
            })
            const results = await page.evaluate(async (alphabet,keyword,modifier) => {
                let results = await autocompleteQuery(alphabet,keyword,modifier)
                return results
            },alphabet,keyword,modifier)
            browser.close()
            return results
        }
        catch(e) {
            console.log(e)
        }
    }

  5. Open a terminal, navigate to the directory you just created, and enter “npm install” to install the dependencies.

  6. Once the dependencies are installed, enter “npm start” to run the script.

  7. After the script has finished, you’ll find a text file in the “data” folder that you created, the format of which will look like the image below:



Summing it up and next steps

The obvious next steps for this script would be to rework it to handle multiple queries at once, running these asynchronously with multiple instances of Chromium running at the same time. As well as this, I’ve got a couple of ideas to further expand and add utility to it, making it more of a robust tool for search analysis.

This could be things like identifying ranked competitors, competition, and other similar factors which would have a significant impact of picking and prioritising which phrases to go after first. If anyone out there has any additional thoughts or ideas on what they’d like to see added in terms of functionality, I’d love to hear them!

If you’ve got any questions, comments, or need a little help getting it working, drop it in the comments and I’ll do my best to respond. I hope someone found this useful, and thanks for reading!

Share this:

Leave a Comment