Get total count of internal links to each page of a site using Node

Share this:

It’s been just over a month since my last post, so I figured it was about time I pulled my finger out and posted something.

As I often do, I was browsing SEO subs on reddit—in particular r/TechSEO—where I came across a post from user slow___show, who asked a question about tools to track the change in count of internal links to help in with reporting SEO activity.

I’ve embedded the post below for reference:

I thought it was an interesting use case and that it was a decent micro task worth having a stab at with Node, so I whipped up a little code to do what the op was after, letting you create a CSV file containing all pages of a site with a count of all the internal links pointed at them, using puppeteer and xml-sitemap-parser.

How it works

There’s not a great deal to it—as you’ll see from the steps broken down below. As usual with my posts, this isn’t a tutorial, so don’t expect much of an in-depth explanation. These posts are all about useable, working code snippets that you can use to automate bits and pieces in your SEO workflow.

That said, the code is of course given in-full in the “The Code” section, so feel free to go through it, edit it, and use it how you like.

  1. First it uses sitemap-xml-parser to create a list of URLs, using the websites xml sitemap.

  2. Each of these URLs is then looped through, using puppeteer to extract all of the hrefs from each page

  3. These are then filtered to remove external links, as well as any images, videos, or files—I’ve only included common ones, but you could always add whatever other file extensions you might find on a particular site.

    For this, you’d just need to add anything other extensions to the regex given below:

    /\(?(#|.jpg|.jpeg|.png|.gif|.svg|.webp|.webm|.mp4|.pdf)/g
  4. All of the internal links found are then processed to return a count, which is then written to csv. The file name is also appended with a date stamp, making it easy to run this once before work is done and again once you’re finished, giving you a simple count for before and after.

The Code

I’ve given the code below in full, although you’ll of course need to install the dependencies—which has already been stated, are puppeteer and xml-sitemap-parser.

How long it takes to run the script will depend on the size of your site (obviously). If you’re working with a fairly large site, you’ll probably want to rework the script, chunking the URLs so you can scrap multiple pages simultaneously, speeding the whole thing up.

Another thing you might want to change is the selector, as you may only want to pull out links from within the body of the page, ignoring those in the nav, sidebar, and footer. This is equally simple to implement.

index.js

const puppeteer = require('puppeteer');
const SitemapXMLParser = require('sitemap-xml-parser');
const url = require('url');
const fs = require('fs');

const sitemap = 'https://www.example.com/sitemap.xml'
const domain = url.parse(sitemap).hostname

async function getPages(sitemap) {
  const sitemapXMLParser = new SitemapXMLParser(sitemap, {delay: 3000, limit: 5})
  let pages = sitemapXMLParser.fetch().then(results => {
    let pages = []
    for (let result of results) {
      pages.push(result.loc)
    }
    return pages
  });
  return pages
}

async function getInternalLinks(url) {
    const browser = await puppeteer.launch()
    const page = await browser.newPage()
    await page.goto(`${url}`, {
        waitUntil: 'networkidle2'
    });
    const links = await page.evaluate(
        () => Array.from(
          document.querySelectorAll('a[href]'),
          a => a.getAttribute('href')
        )
      )
    await browser.close()
    return links
}

function getDate() {
  let ts = Date.now()
  let date_ob = new Date(ts)
  let date = date_ob.getDate()
  let month = date_ob.getMonth() + 1
  let year = date_ob.getFullYear()
  return `${year}-${month}-${date}`
}

function writeCSV(internalLinks) {
  const count = {}
  internalLinks.forEach(element => {
    count[element] = (count[element] || 0) + 1;
  })
  countSorted = Object.keys(count).sort(function(a,b){return count[b]-count[a]}).map((key) => ([key,count[key]]))
  let csv = 'page url,link count\n';
  for(let row of countSorted) {
    csv += row + "\n"
  }
  fs.writeFile(`./internal-links-${getDate()}.csv`, csv, 'utf8', function (err) {
    if (err) {
      console.log('Some error occured...')
    } else{
      console.log('Done...written to CSV!')
    }
  })
}

async function init(sitemap, domain){
  let pages = await getPages(sitemap)
  let internalLinks = []
  for (let [i, page] of pages.entries()) {
    console.log(`processing page ${i+1} of ${pages.length}...`)
    const links = await getInternalLinks(page)
    for(let link of links) {
      if(link.indexOf(domain) != -1 || url.parse(link).hostname === null) {
        if(!link.match(/\(?(#|.jpg|.jpeg|.png|.gif|.svg|.webp|.webm|.mp4|.pdf)/g)) {
          if(url.parse(link).hostname === null) {
            link = `${domain}${link}`
          }
          internalLinks.push(link)
        }
      }
    }
  }
  writeCSV(internalLinks)
}

init(sitemap, domain)

The output

I’ve included a screen shot below of the output you can expect for it. There’s nothing fancy here, just a humble csv with two columns—one for “page url” the other for “count”—but what more do you really need I suppose.

Once it’s finished running, the csv file of pages and counts will be saved to the same location that you run the script from.

Nothing fancy—just a simple CSV with each page and it’s internal link count, in descending order

Summing it up

Well that’s it really. Simply run the script before you start adding links and again after, giving a snapshot of the change in count across a site on a page level.

While I was writing the script, I did think that it would be fairly easy to add anchors to it, and that would definitely add a lot more utility to it. I’d certainly find it more useful with anchors, especially with the scope and scale of sites I have to deal with as part of the day job.

As well as the addition of anchors, and as I know that the SEO community isn’t exactly the biggest fan of JavaScript, I may rewrite it in Python as I could do with the practice. If there’s any interest in this, let me know in the comments and I’ll get around to it, although no promises.

If you’ve made it this far, I hope you found this post useful. If you’ve got any thoughts or questions, feel free to give me a shout in the comments below. Thanks for reading!

Share this:

Leave a Comment