{"id":287,"date":"2021-03-09T06:23:38","date_gmt":"2021-03-09T06:23:38","guid":{"rendered":"https:\/\/asdfghjkl.me.uk\/blog\/?p=287"},"modified":"2021-03-09T10:02:13","modified_gmt":"2021-03-09T10:02:13","slug":"broken-links","status":"publish","type":"post","link":"https:\/\/asdfghjkl.me.uk\/blog\/broken-links\/","title":{"rendered":"Checking a webpage (or sitemap) for broken links with wget"},"content":{"rendered":"\n<p>As the internet gets bigger, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Link_rot\">link rot<\/a> gets badder. I still have a gigantic folder of bookmarks from pages I liked on StumbleUpon over a decade ago, and it's sad to see how many of them now lead to nowhere. I've been making a real effort over the past couple of years to archive the things I've enjoyed, but since nobody lets you know when a little blog drops offline, I wanted something that could occasionally scan through the links on my websites and email me when one breaks so I could replace it if possible.<\/p>\n\n\n\n<p>There are plenty of free and commercial products that already do this, but I prefer the big-folder-full-of-shell-scripts approach to getting things done and this is an easy task. To download a webpage, scan it for links and follow them to check for errors, you can run the following wget command:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-wrap-lines\">wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 https:\/\/url<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>I've written all the arguments in long form to make it a bit easier to read. The <code>--spider<\/code> option just checks pages are there instead of downloading them, but it still creates the directory structure so we also add <code>--no-directories<\/code>. To make it follow the links it finds, we use <code>--recursive<\/code>, but set <code>--level 1<\/code> so it only goes one level deep. This is ideal for me as I only want to run my script against single webpages, but play with the number if you need more. For example, to automate this across your whole site, you could grab the sitemap.xml with <code>wget<\/code>, extract the URLs then pass them back to <code>wget<\/code> to scan each in turn (<strong>edit:<\/strong> <a href=\"#doublebonus\">see the bottom<\/a> for an example). But back to what we're doing: we also need <code>--span-hosts<\/code> to allow wget to visit different sites, and <code>--no-verbose<\/code> cuts out most of the junk from the output that we don't need. Finally, we add <code>--timeout 10 --tries 1<\/code> so it doesn't take forever when a site is temporarily down, and <code>--execute robots=off<\/code> because some sites reject wget entirely with robots.txt and it politely complies. Maybe it's a bit rude to ignore that, but our intent is not to hammer anything here so I've decided it's okay.<\/p>\n\n\n\n<p>Our wget output is still quite verbose, so let's clean it up a bit:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-wrap-lines\">wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 https:\/\/url | grep --before-context 1 --no-group-separator <span class=\"hljs-string\">'broken link!'<\/span> | grep --invert-match <span class=\"hljs-string\">'broken link!'<\/span> | sed --expression <span class=\"hljs-string\">'s\/:$\/\/'<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>When wget finds a broken link, it returns something like this in the output:<\/p>\n\n\n\n<pre id=\"hterm:copy-to-clipboard-source\" class=\"wp-block-preformatted\">https:\/\/asdfghjkl.me.uk\/nonexistent:\nRemote file does not exist -- broken link!!!<\/pre>\n\n\n\n<p>The first <code>grep<\/code> only matches lines containing \"broken link!\". This isn't very helpful on its own, so we add <code>--before-context 1<\/code> to also return the line with the URL immediately above. With this option <code>grep<\/code> puts a line with \"<code>--<\/code>\" between matches, which we turn off with <code>--no-group-separator<\/code> so it looks cleaner. We then pipe through <code>grep<\/code> again but this time match the inverse, to remove the \"broken link!\" lines we no longer need. And just to be pendantic, we finally run our list of URLs through <code>sed<\/code> to remove the colon from the end of each URL (<code>$<\/code> matches the end of a line so it leaves the https:\/\/ alone).<\/p>\n\n\n\n<p>We're now left with a list of links, but we're not quite done yet. We want to automate this process so it can run semi-regularly without our input, and email any time it finds a broken link. We're also currently only looking for HTTP errors (404 etc) \u2013 if a whole domain disappears, we'd never know! So let's wrap the whole thing in a shell script so we can feed it a URL as an argument:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-code-table shcb-line-numbers shcb-wrap-lines\"><span class='shcb-loc'><span><span class=\"hljs-meta\">#!\/bin\/bash<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># check for arguments<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">if<\/span> &#91;&#91; <span class=\"hljs-variable\">$#<\/span> -eq 0 ]]; <span class=\"hljs-keyword\">then<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">'No URL supplied'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">exit<\/span> 1\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">fi<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># scan URL for links, follow to see if they exist<\/span>\n<\/span><\/span><span class='shcb-loc'><span>wget_output=$(wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 <span class=\"hljs-variable\">$1<\/span> 2&gt;&amp;1)\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># if wget exited with error (i.e. if any broken links were found)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">if<\/span> &#91;&#91; $? -ne 0 ]]; <span class=\"hljs-keyword\">then<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> -e <span class=\"hljs-string\">\"Found broken links in <span class=\"hljs-variable\">${1}<\/span>:\\n\"<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-comment\"># check for broken link line, return one line before, remove broken link line, remove colon from end of url<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$wget_output<\/span>\"<\/span> | grep --before-context 1 --no-group-separator <span class=\"hljs-string\">'broken link!'<\/span> | grep --invert-match <span class=\"hljs-string\">'broken link!'<\/span> | sed --expression <span class=\"hljs-string\">'s\/:$\/\/'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-comment\"># same again, but for failure to resolve<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$wget_output<\/span>\"<\/span> | grep <span class=\"hljs-string\">'unable to resolve'<\/span> | sed --expression <span class=\"hljs-string\">'s\/^wget: \/\/'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-comment\"># exit with error<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">exit<\/span> 1\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">fi<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># otherwise, exit silently with success<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-built_in\">exit<\/span> 0\n<\/span><\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>I saved this as check_links.sh and made it executable with <code>chmod +x check_links.sh<\/code>, so it runs as <code>.\/check_links.sh https:\/\/url<\/code>. Here's how it all works:<\/p>\n\n\n\n<p>We first check the number of arguments (<code>$#<\/code>) supplied to the script. If this is zero, no URL was supplied, so we exit with an error. We then run our <code>wget<\/code> command, feeding in the first argument to the script (<code>$1<\/code>) as the URL and saving its output to the variable <code>wget_output<\/code>. <code>wget<\/code> by default outputs its messages to stderr rather than stdout, so we add <code>2>&amp;1<\/code> to redirect stderr to stdout so it'll end up in our variable. I could never remember what order these characters went in, so I'll break it down: 2 means stderr, > means \"redirect to a file\" (compare to |, which redirects to a command), and &amp;1 means \"reuse whatever stdout is using\".<\/p>\n\n\n\n<p>We separated out <code>wget<\/code> from the rest because we want to now check its exit code. If it didn't find any broken links, it'll exit successfully with code 0. If it did, it'll exit with a different number. We compare the exit code of the last-run command (<code>$?<\/code>) with 0, and if they don't match, we can continue cleaning up its output. If they do, there's nothing more we need to do, so we exit successfully ourselves.<\/p>\n\n\n\n<p>First we return the URL that was fed to the script, because we'll be running this on a schedule and we want our emails to say which page they were looking at. We use <code>${1}<\/code> instead of <code>$1<\/code> so we can put characters immediately after the variable without needing a space in between. <code>\\n<\/code> adds an extra newline, which requires that <code>echo<\/code> be called with <code>-e<\/code>. We then send our output through the same series of <code>grep<\/code>s as before. Something I didn't realise was that running <code>echo \"$variable\"<\/code> keeps the line breaks intact, whereas <code>echo $variable<\/code> strips them out (the difference between running it with one tall parameter, or a separate parameter for every line). You learn something new every day!<\/p>\n\n\n\n<p>We also wanted to cover domains disappearing entirely. When <code>wget<\/code> can't resolve a domain, it leaves a one-line message like <code>wget: unable to resolve host address \u2018asdfghjkladssaddsa.com\u2019<\/code>. We run through our output again and use <code>sed<\/code> to take the <code>wget:<\/code> off the front (^ matches the start of a line), leaving behind a nice descriptive message. We can now exit with code 1, indicating that an error occurred.<\/p>\n\n\n\n<p>To run this on a schedule, <code>cron<\/code> has us covered. Run <code>crontab -e<\/code> to edit your user's crontab, and add something like this:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-4\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-wrap-lines\">0 5 *\/15 * * \/home\/asdfghjkl\/check_links.sh <span class=\"hljs-string\">\"https:\/\/url\"<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-4\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>This will run the script at 5:00am twice a month. If you're unfamiliar with the format, check out <a href=\"https:\/\/crontab.guru\/\">crontab.guru<\/a> for some examples \u2013 it's an incredibly useful piece of software to know and can accommodate the most complex schedules. It's best to include the full path to the script: <code>cron<\/code> should use your home directory as its working directory, but you never know.<\/p>\n\n\n\n<p>To email our results, there's no need to reinvent the wheel: <code>cron<\/code> can do it too. In your crontab, set the <code>MAILTO<\/code> variable, and make sure it's above the line you added:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-5\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-wrap-lines\">MAILTO=<span class=\"hljs-string\">\"email@address.com\"<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-5\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>You just need to make sure your server can send emails. Now, I've run my own mailservers before, for fun mind you, and if you haven't, don't. It Is Hell. You spend an age getting postfix and a nice web interface set up perfectly, create your SPF records, generate your DKIM keys, check your IP on all the blacklists, and then everyone drops your mail in the spam box or rejects it outright anyway. Don't forget we're sending emails full of random links too, which never helps. No, email is one of those things I will happily pay (or trade for diet pill ads) to have dealt with for me. I use <a href=\"https:\/\/wiki.archlinux.org\/index.php\/SSMTP\">ssmtp<\/a>, which quietly replaces the default mail\/sendmail commands and only needs a simple config file filling with your SMTP details. That link has some tips on setting it up with a Gmail account; I use a separate address from an old free Google Apps plan so I'm not leaving important passwords floating about in cleartext.<\/p>\n\n\n\n<p>The only problem with this approach is that cron is <em>chatty<\/em>. Okay, it's 45 years old and perfect as far as I'm concerned, but if a task outputs anything, it figures you want an email about it \u2013 even if it finished successfully and only printed to stdout. There are a few solutions to this: you can set the <code>MAILTO<\/code> variable more than once in your crontab, so you can set it just for this task and unset it afterwards:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-6\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-wrap-lines\">MAILTO=<span class=\"hljs-string\">\"email@address.com\"<\/span>\r\n0 5 *\/15 * * \/home\/asdfghjkl\/check_links.sh <span class=\"hljs-string\">\"https:\/\/url\"<\/span>\r\nMAILTO=<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-6\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Or you could go scorched-earth and redirect everything else to \/dev\/null:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-7\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-wrap-lines\">0 0 * * * important_thing &gt; \/dev\/null 2&gt;&amp;1<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-7\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>But if you still want other commands to email if something goes wrong, you want <a href=\"https:\/\/habilis.net\/cronic\/\">cronic<\/a>. It's a small shell script to wrap commands in that suppresses output unless an error occurred, so that's the only time you'll get emails. If your distribution doesn't have a package for it, just drop it in <code>\/usr\/local\/bin<\/code> and <code>chmod +x<\/code> it, then prepend your commands with <code>cronic<\/code>. You don't need it for our script, because we <code>exit 0<\/code> without any output if we found nothing, but it works fine with or without.<\/p>\n\n\n\n<p>(P.S. if you've done that and the deluge hasn't ceased, also check root's crontab with <code>sudo crontab -e<\/code> and anything in <code>\/etc\/cron.d<\/code>, <code>hourly<\/code>, <code>daily<\/code> etc)<\/p>\n\n\n\n<p><strong>Bonus tip:<\/strong> add an alias to your <code>~\/.bashrc<\/code> to let you check a URL from the command line:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-8\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-wrap-lines\"><span class=\"hljs-built_in\">alias<\/span> check-links=<span class=\"hljs-string\">\"\/home\/asdfghjkl\/check_links.sh\"<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-8\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Save and run <code>bash<\/code> to reload, then you can <code>check-links https:\/\/url.horse<\/code> to your heart's content.<\/p>\n\n\n\n<p>Okay, that's it. This post turned out quite excessive for a simple script, so I'm sorry if it was a bit long. I find if I don't practice stuff like this regularly I start to forget the basics, which I'm sure everyone can relate to. But if I <a href=\"https:\/\/en.wikipedia.org\/wiki\/Rubber_duck_debugging\">rubber duck<\/a> it from first principles it's much easier to remember, and god knows my girlfriend has suffered enough, so into the void it goes. Have a good one.<\/p>\n\n\n\n<p id=\"doublebonus\"><strong>Double bonus tip:<\/strong> since I'm still awake, here's a modified version of the script that ingests an <a href=\"https:\/\/www.google.com\/sitemaps\/protocol.html\">XML sitemap<\/a> instead of a single page to check. Many CMSs will generate these for you so it's an easy way to check links across your entire website without having to scrape it yourself. I made this for WordPress but it should work with any sitemap that meets the spec.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-9\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-code-table shcb-line-numbers shcb-wrap-lines\"><span class='shcb-loc'><span><span class=\"hljs-meta\">#!\/bin\/bash<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># run as .\/check_sitemap.sh https:\/\/example.com\/wp-sitemap-posts-post-1.xml<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># note: each wordpress sitemap contains max 2000 posts, scrape wp-sitemap.xml for the rest if you need. pages are in a separate sitemap.<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># don't check URLs containing these patterns (supply a POSIX regex)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># these are some sane defaults to ignore for a wordpress install<\/span>\n<\/span><\/span><span class='shcb-loc'><span>ignore=<span class=\"hljs-string\">\"xmlrpc.php|\/\/fonts.googleapis.com\/|\/\/fonts.gstatic.com\/|\/\/secure.gravatar.com\/|\/\/akismet.com\/|\/\/wordpress.org\/|\/\/s.w.org\/|\/\/gmpg.org\/\"<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># optionally also exclude internal links to the same directory as the sitemap<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># e.g. https:\/\/example.com\/blog\/sitemap.xml excludes https:\/\/example.com\/blog\/foo but includes https:\/\/example.com\/bar<\/span>\n<\/span><\/span><span class='shcb-loc'><span>ignore=<span class=\"hljs-string\">\"<span class=\"hljs-variable\">${ignore}<\/span>|<span class=\"hljs-variable\">$(echo $1 | grep --perl-regexp --only-matching '\/\/.+(?:\/)<\/span>')\"<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># optionally exclude internal links to the sitemap's entire domain<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#ignore=\"${ignore}|$(echo $1 | grep --extended-regexp --only-matching '\/\/&#91;^\/]+')\"<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># check for arguments<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">if<\/span> &#91;&#91; <span class=\"hljs-variable\">$#<\/span> -eq 0 ]]; <span class=\"hljs-keyword\">then<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">'No URL supplied'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">exit<\/span> 1\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">fi<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># download sitemap.xml<\/span>\n<\/span><\/span><span class='shcb-loc'><span>sitemap_content=$(wget --execute robots=off --no-directories --no-verbose --timeout 10 --tries 1 --output-document - <span class=\"hljs-variable\">$1<\/span>)\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">if<\/span> &#91;&#91; $? -eq 0 ]]; <span class=\"hljs-keyword\">then<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">'Failed to get sitemap URL'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">fi<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># extract URLs from &lt;loc&gt; tags, scan for links, follow to see if they exist<\/span>\n<\/span><\/span><span class='shcb-loc'><span>wget_output=$(<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$sitemap_content<\/span>\"<\/span> | grep --perl-regexp --only-matching <span class=\"hljs-string\">'(?&lt;=&lt;loc&gt;)https?:\/\/&#91;^&lt;]+'<\/span> | wget --input-file - --reject-regex <span class=\"hljs-variable\">$ignore<\/span> --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 --<span class=\"hljs-built_in\">wait<\/span> 3 2&gt;&amp;1)\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># if wget exited with error (i.e. if any broken links were found)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">if<\/span> &#91;&#91; $? -ne 0 ]]; <span class=\"hljs-keyword\">then<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> -e <span class=\"hljs-string\">\"Found broken links in <span class=\"hljs-variable\">${1}<\/span>:\\n\"<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-comment\"># check for broken link line, return one line before, remove broken link line, remove colon from end of url<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$wget_output<\/span>\"<\/span> | grep --before-context 1 --no-group-separator <span class=\"hljs-string\">'broken link!'<\/span> | grep --invert-match <span class=\"hljs-string\">'broken link!'<\/span> | sed --expression <span class=\"hljs-string\">'s\/:$\/\/'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-comment\"># same again, but for failure to resolve<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$wget_output<\/span>\"<\/span> | grep <span class=\"hljs-string\">'unable to resolve'<\/span> | sed --expression <span class=\"hljs-string\">'s\/^wget: \/\/'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-comment\"># exit with error<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\t<span class=\"hljs-built_in\">exit<\/span> 1\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-keyword\">fi<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># otherwise, exit silently with success<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-built_in\">exit<\/span> 0\n<\/span><\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-9\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>A short explanation of the changes: since wget can't extract links from xml files, we first download the sitemap to stdout (<code>--output-file -<\/code>) and search it for URLs. The ones we want are inside <code>&lt;loc><\/code> tags, so we <code>grep<\/code> for those: <code>(?&lt;=...)<\/code> is a \"positive lookbehind\", which finds a tag located just before the rest of the match but doesn't include it in the result. We then match for http(s):\/\/, then any number of characters until we reach a &lt; symbol, signifying the start of the closing <code>&lt;\/loc><\/code>.<\/p>\n\n\n\n<p>We pass our list of URLs to <code>wget<\/code> using <code>--input-file -<\/code> and scan each in turn for broken links as before. This time we add a 3-second wait between requests to avoid hitting anyone too fast, and also allow for ignoring certain URL patterns using <code>--reject-regex<\/code>. A CMS likely pulls in some external resources which we don't need to be warned about \u2013 for example, <code>fonts.googleapis.com<\/code> is linked here in the <code>&lt;head><\/code> to be DNS prefetched, but the URL itself will always 404. We don't need an email about it. I've prefilled the <code>$ignore<\/code> variable with some reasonable exclusions for a stock WordPress install: note the patterns don't need wildcards, so use <code>\/\/domain.com\/<\/code> to ignore a whole domain and <code>xmlrpc.php<\/code> for a specific file.<\/p>\n\n\n\n<p>Something else you might like to ignore is your own site! You already have all the links to scan so there's little need to go through them again on each page, though maybe you'd like to check for typos or missing resources. I'm only interested in external links, so I use the second <code>$ignore<\/code> addition (line 11) to exclude everything from the same subdirectory as the sitemap. The <code>grep<\/code> command here takes our input URL, starts at the \/\/ of https:\/\/, and matches any character up until the final \/ is found. This removes just the sitemap filename and leaves the rest behind. So feeding it <code>https:\/\/asdfghjkl.me.uk\/blog\/sitemap.xml<\/code> would give <code>\/\/asdfghjkl.me.uk\/blog\/<\/code> as the exclusion, ignoring \/blog and \/blog\/post but still checking links to other parts of the site like \/shop or \/. To instead exclude my entire domain I could switch it with line 13, where the regex starts at \/\/ and stops when it finds the <em>first<\/em> \/ (if it exists), leaving <code>\/\/asdfghjkl.me.uk<\/code> as the exclusion.<\/p>\n\n\n\n<p>The only thing missing from this script variation is letting you know which specific page it found a broken link on \u2013 right now it just reports the sitemap URL. Instead of passing the list of URLs to wget in one go, you could loop through one at a time and output that for the \"Found broken links\" message. But that is left as an exercise to the reader. I'm out!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As the internet gets bigger, link rot gets badder. I still have a gigantic folder of bookmarks from pages I liked on StumbleUpon over a decade ago, and it's sad to see how many of them now lead to nowhere. I've been making a real effort over the past couple of years to archive the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[],"_links":{"self":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts\/287"}],"collection":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/comments?post=287"}],"version-history":[{"count":27,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts\/287\/revisions"}],"predecessor-version":[{"id":324,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts\/287\/revisions\/324"}],"wp:attachment":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/media?parent=287"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/categories?post=287"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/tags?post=287"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}