Checking a webpage (or sitemap) for broken links with wget

As the internet gets bigger, link rot gets badder. I still have a gigantic folder of bookmarks from pages I liked on StumbleUpon over a decade ago, and it's sad to see how many of them now lead to nowhere. I've been making a real effort over the past couple of years to archive the things I've enjoyed, but since nobody lets you know when a little blog drops offline, I wanted something that could occasionally scan through the links on my websites and email me when one breaks so I could replace it if possible.

There are plenty of free and commercial products that already do this, but I prefer the big-folder-full-of-shell-scripts approach to getting things done and this is an easy task. To download a webpage, scan it for links and follow them to check for errors, you can run the following wget command:

wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 https://url
Code language: Bash (bash)

I've written all the arguments in long form to make it a bit easier to read. The --spider option just checks pages are there instead of downloading them, but it still creates the directory structure so we also add --no-directories. To make it follow the links it finds, we use --recursive, but set --level 1 so it only goes one level deep. This is ideal for me as I only want to run my script against single webpages, but play with the number if you need more. For example, to automate this across your whole site, you could grab the sitemap.xml with wget, extract the URLs then pass them back to wget to scan each in turn (edit: see the bottom for an example). But back to what we're doing: we also need --span-hosts to allow wget to visit different sites, and --no-verbose cuts out most of the junk from the output that we don't need. Finally, we add --timeout 10 --tries 1 so it doesn't take forever when a site is temporarily down, and --execute robots=off because some sites reject wget entirely with robots.txt and it politely complies. Maybe it's a bit rude to ignore that, but our intent is not to hammer anything here so I've decided it's okay.

Our wget output is still quite verbose, so let's clean it up a bit:

wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 https://url | grep --before-context 1 --no-group-separator 'broken link!' | grep --invert-match 'broken link!' | sed --expression 's/:$//'
Code language: Bash (bash)

When wget finds a broken link, it returns something like this in the output:

https://asdfghjkl.me.uk/nonexistent:
Remote file does not exist -- broken link!!!

The first grep only matches lines containing "broken link!". This isn't very helpful on its own, so we add --before-context 1 to also return the line with the URL immediately above. With this option grep puts a line with "--" between matches, which we turn off with --no-group-separator so it looks cleaner. We then pipe through grep again but this time match the inverse, to remove the "broken link!" lines we no longer need. And just to be pendantic, we finally run our list of URLs through sed to remove the colon from the end of each URL ($ matches the end of a line so it leaves the https:// alone).

We're now left with a list of links, but we're not quite done yet. We want to automate this process so it can run semi-regularly without our input, and email any time it finds a broken link. We're also currently only looking for HTTP errors (404 etc) – if a whole domain disappears, we'd never know! So let's wrap the whole thing in a shell script so we can feed it a URL as an argument:

#!/bin/bash # check for arguments if [[ $# -eq 0 ]]; then echo 'No URL supplied' exit 1 fi # scan URL for links, follow to see if they exist wget_output=$(wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 $1 2>&1) # if wget exited with error (i.e. if any broken links were found) if [[ $? -ne 0 ]]; then echo -e "Found broken links in ${1}:\n" # check for broken link line, return one line before, remove broken link line, remove colon from end of url echo "$wget_output" | grep --before-context 1 --no-group-separator 'broken link!' | grep --invert-match 'broken link!' | sed --expression 's/:$//' # same again, but for failure to resolve echo "$wget_output" | grep 'unable to resolve' | sed --expression 's/^wget: //' # exit with error exit 1 fi # otherwise, exit silently with success exit 0
Code language: Bash (bash)

I saved this as check_links.sh and made it executable with chmod +x check_links.sh, so it runs as ./check_links.sh https://url. Here's how it all works:

We first check the number of arguments ($#) supplied to the script. If this is zero, no URL was supplied, so we exit with an error. We then run our wget command, feeding in the first argument to the script ($1) as the URL and saving its output to the variable wget_output. wget by default outputs its messages to stderr rather than stdout, so we add 2>&1 to redirect stderr to stdout so it'll end up in our variable. I could never remember what order these characters went in, so I'll break it down: 2 means stderr, > means "redirect to a file" (compare to |, which redirects to a command), and &1 means "reuse whatever stdout is using".

We separated out wget from the rest because we want to now check its exit code. If it didn't find any broken links, it'll exit successfully with code 0. If it did, it'll exit with a different number. We compare the exit code of the last-run command ($?) with 0, and if they don't match, we can continue cleaning up its output. If they do, there's nothing more we need to do, so we exit successfully ourselves.

First we return the URL that was fed to the script, because we'll be running this on a schedule and we want our emails to say which page they were looking at. We use ${1} instead of $1 so we can put characters immediately after the variable without needing a space in between. \n adds an extra newline, which requires that echo be called with -e. We then send our output through the same series of greps as before. Something I didn't realise was that running echo "$variable" keeps the line breaks intact, whereas echo $variable strips them out (the difference between running it with one tall parameter, or a separate parameter for every line). You learn something new every day!

We also wanted to cover domains disappearing entirely. When wget can't resolve a domain, it leaves a one-line message like wget: unable to resolve host address ‘asdfghjkladssaddsa.com’. We run through our output again and use sed to take the wget: off the front (^ matches the start of a line), leaving behind a nice descriptive message. We can now exit with code 1, indicating that an error occurred.

To run this on a schedule, cron has us covered. Run crontab -e to edit your user's crontab, and add something like this:

0 5 */15 * * /home/asdfghjkl/check_links.sh "https://url"
Code language: Bash (bash)

This will run the script at 5:00am twice a month. If you're unfamiliar with the format, check out crontab.guru for some examples – it's an incredibly useful piece of software to know and can accommodate the most complex schedules. It's best to include the full path to the script: cron should use your home directory as its working directory, but you never know.

To email our results, there's no need to reinvent the wheel: cron can do it too. In your crontab, set the MAILTO variable, and make sure it's above the line you added:

MAILTO="email@address.com"
Code language: Bash (bash)

You just need to make sure your server can send emails. Now, I've run my own mailservers before, for fun mind you, and if you haven't, don't. It Is Hell. You spend an age getting postfix and a nice web interface set up perfectly, create your SPF records, generate your DKIM keys, check your IP on all the blacklists, and then everyone drops your mail in the spam box or rejects it outright anyway. Don't forget we're sending emails full of random links too, which never helps. No, email is one of those things I will happily pay (or trade for diet pill ads) to have dealt with for me. I use ssmtp, which quietly replaces the default mail/sendmail commands and only needs a simple config file filling with your SMTP details. That link has some tips on setting it up with a Gmail account; I use a separate address from an old free Google Apps plan so I'm not leaving important passwords floating about in cleartext.

The only problem with this approach is that cron is chatty. Okay, it's 45 years old and perfect as far as I'm concerned, but if a task outputs anything, it figures you want an email about it – even if it finished successfully and only printed to stdout. There are a few solutions to this: you can set the MAILTO variable more than once in your crontab, so you can set it just for this task and unset it afterwards:

MAILTO="email@address.com" 0 5 */15 * * /home/asdfghjkl/check_links.sh "https://url" MAILTO=
Code language: Bash (bash)

Or you could go scorched-earth and redirect everything else to /dev/null:

0 0 * * * important_thing > /dev/null 2>&1
Code language: Bash (bash)

But if you still want other commands to email if something goes wrong, you want cronic. It's a small shell script to wrap commands in that suppresses output unless an error occurred, so that's the only time you'll get emails. If your distribution doesn't have a package for it, just drop it in /usr/local/bin and chmod +x it, then prepend your commands with cronic. You don't need it for our script, because we exit 0 without any output if we found nothing, but it works fine with or without.

(P.S. if you've done that and the deluge hasn't ceased, also check root's crontab with sudo crontab -e and anything in /etc/cron.d, hourly, daily etc)

Bonus tip: add an alias to your ~/.bashrc to let you check a URL from the command line:

alias check-links="/home/asdfghjkl/check_links.sh"
Code language: Bash (bash)

Save and run bash to reload, then you can check-links https://url.horse to your heart's content.

Okay, that's it. This post turned out quite excessive for a simple script, so I'm sorry if it was a bit long. I find if I don't practice stuff like this regularly I start to forget the basics, which I'm sure everyone can relate to. But if I rubber duck it from first principles it's much easier to remember, and god knows my girlfriend has suffered enough, so into the void it goes. Have a good one.

Double bonus tip: since I'm still awake, here's a modified version of the script that ingests an XML sitemap instead of a single page to check. Many CMSs will generate these for you so it's an easy way to check links across your entire website without having to scrape it yourself. I made this for WordPress but it should work with any sitemap that meets the spec.

#!/bin/bash # run as ./check_sitemap.sh https://example.com/wp-sitemap-posts-post-1.xml # note: each wordpress sitemap contains max 2000 posts, scrape wp-sitemap.xml for the rest if you need. pages are in a separate sitemap. # don't check URLs containing these patterns (supply a POSIX regex) # these are some sane defaults to ignore for a wordpress install ignore="xmlrpc.php|//fonts.googleapis.com/|//fonts.gstatic.com/|//secure.gravatar.com/|//akismet.com/|//wordpress.org/|//s.w.org/|//gmpg.org/" # optionally also exclude internal links to the same directory as the sitemap # e.g. https://example.com/blog/sitemap.xml excludes https://example.com/blog/foo but includes https://example.com/bar ignore="${ignore}|$(echo $1 | grep --perl-regexp --only-matching '//.+(?:/)')" # optionally exclude internal links to the sitemap's entire domain #ignore="${ignore}|$(echo $1 | grep --extended-regexp --only-matching '//[^/]+')" # check for arguments if [[ $# -eq 0 ]]; then echo 'No URL supplied' exit 1 fi # download sitemap.xml sitemap_content=$(wget --execute robots=off --no-directories --no-verbose --timeout 10 --tries 1 --output-document - $1) if [[ $? -eq 0 ]]; then echo 'Failed to get sitemap URL' fi # extract URLs from <loc> tags, scan for links, follow to see if they exist wget_output=$(echo "$sitemap_content" | grep --perl-regexp --only-matching '(?<=<loc>)https?://[^<]+' | wget --input-file - --reject-regex $ignore --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 --wait 3 2>&1) # if wget exited with error (i.e. if any broken links were found) if [[ $? -ne 0 ]]; then echo -e "Found broken links in ${1}:\n" # check for broken link line, return one line before, remove broken link line, remove colon from end of url echo "$wget_output" | grep --before-context 1 --no-group-separator 'broken link!' | grep --invert-match 'broken link!' | sed --expression 's/:$//' # same again, but for failure to resolve echo "$wget_output" | grep 'unable to resolve' | sed --expression 's/^wget: //' # exit with error exit 1 fi # otherwise, exit silently with success exit 0
Code language: Bash (bash)

A short explanation of the changes: since wget can't extract links from xml files, we first download the sitemap to stdout (--output-file -) and search it for URLs. The ones we want are inside <loc> tags, so we grep for those: (?<=...) is a "positive lookbehind", which finds a tag located just before the rest of the match but doesn't include it in the result. We then match for http(s)://, then any number of characters until we reach a < symbol, signifying the start of the closing </loc>.

We pass our list of URLs to wget using --input-file - and scan each in turn for broken links as before. This time we add a 3-second wait between requests to avoid hitting anyone too fast, and also allow for ignoring certain URL patterns using --reject-regex. A CMS likely pulls in some external resources which we don't need to be warned about – for example, fonts.googleapis.com is linked here in the <head> to be DNS prefetched, but the URL itself will always 404. We don't need an email about it. I've prefilled the $ignore variable with some reasonable exclusions for a stock WordPress install: note the patterns don't need wildcards, so use //domain.com/ to ignore a whole domain and xmlrpc.php for a specific file.

Something else you might like to ignore is your own site! You already have all the links to scan so there's little need to go through them again on each page, though maybe you'd like to check for typos or missing resources. I'm only interested in external links, so I use the second $ignore addition (line 11) to exclude everything from the same subdirectory as the sitemap. The grep command here takes our input URL, starts at the // of https://, and matches any character up until the final / is found. This removes just the sitemap filename and leaves the rest behind. So feeding it https://asdfghjkl.me.uk/blog/sitemap.xml would give //asdfghjkl.me.uk/blog/ as the exclusion, ignoring /blog and /blog/post but still checking links to other parts of the site like /shop or /. To instead exclude my entire domain I could switch it with line 13, where the regex starts at // and stops when it finds the first / (if it exists), leaving //asdfghjkl.me.uk as the exclusion.

The only thing missing from this script variation is letting you know which specific page it found a broken link on – right now it just reports the sitemap URL. Instead of passing the list of URLs to wget in one go, you could loop through one at a time and output that for the "Found broken links" message. But that is left as an exercise to the reader. I'm out!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.