Category Archives: Linux

Checking a webpage (or sitemap) for broken links with wget

As the internet gets bigger, link rot gets badder. I still have a gigantic folder of bookmarks from pages I liked on StumbleUpon over a decade ago, and it's sad to see how many of them now lead to nowhere. I've been making a real effort over the past couple of years to archive the things I've enjoyed, but since nobody lets you know when a little blog drops offline, I wanted something that could occasionally scan through the links on my websites and email me when one breaks so I could replace it if possible.

There are plenty of free and commercial products that already do this, but I prefer the big-folder-full-of-shell-scripts approach to getting things done and this is an easy task. To download a webpage, scan it for links and follow them to check for errors, you can run the following wget command:

wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 https://url
Code language: Bash (bash)

I've written all the arguments in long form to make it a bit easier to read. The --spider option just checks pages are there instead of downloading them, but it still creates the directory structure so we also add --no-directories. To make it follow the links it finds, we use --recursive, but set --level 1 so it only goes one level deep. This is ideal for me as I only want to run my script against single webpages, but play with the number if you need more. For example, to automate this across your whole site, you could grab the sitemap.xml with wget, extract the URLs then pass them back to wget to scan each in turn (edit: see the bottom for an example). But back to what we're doing: we also need --span-hosts to allow wget to visit different sites, and --no-verbose cuts out most of the junk from the output that we don't need. Finally, we add --timeout 10 --tries 1 so it doesn't take forever when a site is temporarily down, and --execute robots=off because some sites reject wget entirely with robots.txt and it politely complies. Maybe it's a bit rude to ignore that, but our intent is not to hammer anything here so I've decided it's okay.

Our wget output is still quite verbose, so let's clean it up a bit:

wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 https://url | grep --before-context 1 --no-group-separator 'broken link!' | grep --invert-match 'broken link!' | sed --expression 's/:$//'
Code language: Bash (bash)

When wget finds a broken link, it returns something like this in the output:

https://asdfghjkl.me.uk/nonexistent:
Remote file does not exist -- broken link!!!

The first grep only matches lines containing "broken link!". This isn't very helpful on its own, so we add --before-context 1 to also return the line with the URL immediately above. With this option grep puts a line with "--" between matches, which we turn off with --no-group-separator so it looks cleaner. We then pipe through grep again but this time match the inverse, to remove the "broken link!" lines we no longer need. And just to be pendantic, we finally run our list of URLs through sed to remove the colon from the end of each URL ($ matches the end of a line so it leaves the https:// alone).

We're now left with a list of links, but we're not quite done yet. We want to automate this process so it can run semi-regularly without our input, and email any time it finds a broken link. We're also currently only looking for HTTP errors (404 etc) – if a whole domain disappears, we'd never know! So let's wrap the whole thing in a shell script so we can feed it a URL as an argument:

#!/bin/bash # check for arguments if [[ $# -eq 0 ]]; then echo 'No URL supplied' exit 1 fi # scan URL for links, follow to see if they exist wget_output=$(wget --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 $1 2>&1) # if wget exited with error (i.e. if any broken links were found) if [[ $? -ne 0 ]]; then echo -e "Found broken links in ${1}:\n" # check for broken link line, return one line before, remove broken link line, remove colon from end of url echo "$wget_output" | grep --before-context 1 --no-group-separator 'broken link!' | grep --invert-match 'broken link!' | sed --expression 's/:$//' # same again, but for failure to resolve echo "$wget_output" | grep 'unable to resolve' | sed --expression 's/^wget: //' # exit with error exit 1 fi # otherwise, exit silently with success exit 0
Code language: Bash (bash)

I saved this as check_links.sh and made it executable with chmod +x check_links.sh, so it runs as ./check_links.sh https://url. Here's how it all works:

We first check the number of arguments ($#) supplied to the script. If this is zero, no URL was supplied, so we exit with an error. We then run our wget command, feeding in the first argument to the script ($1) as the URL and saving its output to the variable wget_output. wget by default outputs its messages to stderr rather than stdout, so we add 2>&1 to redirect stderr to stdout so it'll end up in our variable. I could never remember what order these characters went in, so I'll break it down: 2 means stderr, > means "redirect to a file" (compare to |, which redirects to a command), and &1 means "reuse whatever stdout is using".

We separated out wget from the rest because we want to now check its exit code. If it didn't find any broken links, it'll exit successfully with code 0. If it did, it'll exit with a different number. We compare the exit code of the last-run command ($?) with 0, and if they don't match, we can continue cleaning up its output. If they do, there's nothing more we need to do, so we exit successfully ourselves.

First we return the URL that was fed to the script, because we'll be running this on a schedule and we want our emails to say which page they were looking at. We use ${1} instead of $1 so we can put characters immediately after the variable without needing a space in between. \n adds an extra newline, which requires that echo be called with -e. We then send our output through the same series of greps as before. Something I didn't realise was that running echo "$variable" keeps the line breaks intact, whereas echo $variable strips them out (the difference between running it with one tall parameter, or a separate parameter for every line). You learn something new every day!

We also wanted to cover domains disappearing entirely. When wget can't resolve a domain, it leaves a one-line message like wget: unable to resolve host address ‘asdfghjkladssaddsa.com’. We run through our output again and use sed to take the wget: off the front (^ matches the start of a line), leaving behind a nice descriptive message. We can now exit with code 1, indicating that an error occurred.

To run this on a schedule, cron has us covered. Run crontab -e to edit your user's crontab, and add something like this:

0 5 */15 * * /home/asdfghjkl/check_links.sh "https://url"
Code language: Bash (bash)

This will run the script at 5:00am twice a month. If you're unfamiliar with the format, check out crontab.guru for some examples – it's an incredibly useful piece of software to know and can accommodate the most complex schedules. It's best to include the full path to the script: cron should use your home directory as its working directory, but you never know.

To email our results, there's no need to reinvent the wheel: cron can do it too. In your crontab, set the MAILTO variable, and make sure it's above the line you added:

MAILTO="email@address.com"
Code language: Bash (bash)

You just need to make sure your server can send emails. Now, I've run my own mailservers before, for fun mind you, and if you haven't, don't. It Is Hell. You spend an age getting postfix and a nice web interface set up perfectly, create your SPF records, generate your DKIM keys, check your IP on all the blacklists, and then everyone drops your mail in the spam box or rejects it outright anyway. Don't forget we're sending emails full of random links too, which never helps. No, email is one of those things I will happily pay (or trade for diet pill ads) to have dealt with for me. I use ssmtp, which quietly replaces the default mail/sendmail commands and only needs a simple config file filling with your SMTP details. That link has some tips on setting it up with a Gmail account; I use a separate address from an old free Google Apps plan so I'm not leaving important passwords floating about in cleartext.

The only problem with this approach is that cron is chatty. Okay, it's 45 years old and perfect as far as I'm concerned, but if a task outputs anything, it figures you want an email about it – even if it finished successfully and only printed to stdout. There are a few solutions to this: you can set the MAILTO variable more than once in your crontab, so you can set it just for this task and unset it afterwards:

MAILTO="email@address.com" 0 5 */15 * * /home/asdfghjkl/check_links.sh "https://url" MAILTO=
Code language: Bash (bash)

Or you could go scorched-earth and redirect everything else to /dev/null:

0 0 * * * important_thing > /dev/null 2>&1
Code language: Bash (bash)

But if you still want other commands to email if something goes wrong, you want cronic. It's a small shell script to wrap commands in that suppresses output unless an error occurred, so that's the only time you'll get emails. If your distribution doesn't have a package for it, just drop it in /usr/local/bin and chmod +x it, then prepend your commands with cronic. You don't need it for our script, because we exit 0 without any output if we found nothing, but it works fine with or without.

(P.S. if you've done that and the deluge hasn't ceased, also check root's crontab with sudo crontab -e and anything in /etc/cron.d, hourly, daily etc)

Bonus tip: add an alias to your ~/.bashrc to let you check a URL from the command line:

alias check-links="/home/asdfghjkl/check_links.sh"
Code language: Bash (bash)

Save and run bash to reload, then you can check-links https://url.horse to your heart's content.

Okay, that's it. This post turned out quite excessive for a simple script, so I'm sorry if it was a bit long. I find if I don't practice stuff like this regularly I start to forget the basics, which I'm sure everyone can relate to. But if I rubber duck it from first principles it's much easier to remember, and god knows my girlfriend has suffered enough, so into the void it goes. Have a good one.

Double bonus tip: since I'm still awake, here's a modified version of the script that ingests an XML sitemap instead of a single page to check. Many CMSs will generate these for you so it's an easy way to check links across your entire website without having to scrape it yourself. I made this for WordPress but it should work with any sitemap that meets the spec.

#!/bin/bash # run as ./check_sitemap.sh https://example.com/wp-sitemap-posts-post-1.xml # note: each wordpress sitemap contains max 2000 posts, scrape wp-sitemap.xml for the rest if you need. pages are in a separate sitemap. # don't check URLs containing these patterns (supply a POSIX regex) # these are some sane defaults to ignore for a wordpress install ignore="xmlrpc.php|//fonts.googleapis.com/|//fonts.gstatic.com/|//secure.gravatar.com/|//akismet.com/|//wordpress.org/|//s.w.org/|//gmpg.org/" # optionally also exclude internal links to the same directory as the sitemap # e.g. https://example.com/blog/sitemap.xml excludes https://example.com/blog/foo but includes https://example.com/bar ignore="${ignore}|$(echo $1 | grep --perl-regexp --only-matching '//.+(?:/)')" # optionally exclude internal links to the sitemap's entire domain #ignore="${ignore}|$(echo $1 | grep --extended-regexp --only-matching '//[^/]+')" # check for arguments if [[ $# -eq 0 ]]; then echo 'No URL supplied' exit 1 fi # download sitemap.xml sitemap_content=$(wget --execute robots=off --no-directories --no-verbose --timeout 10 --tries 1 --output-document - $1) if [[ $? -eq 0 ]]; then echo 'Failed to get sitemap URL' fi # extract URLs from <loc> tags, scan for links, follow to see if they exist wget_output=$(echo "$sitemap_content" | grep --perl-regexp --only-matching '(?<=<loc>)https?://[^<]+' | wget --input-file - --reject-regex $ignore --spider --recursive --execute robots=off --no-directories --no-verbose --span-hosts --level 1 --timeout 10 --tries 1 --wait 3 2>&1) # if wget exited with error (i.e. if any broken links were found) if [[ $? -ne 0 ]]; then echo -e "Found broken links in ${1}:\n" # check for broken link line, return one line before, remove broken link line, remove colon from end of url echo "$wget_output" | grep --before-context 1 --no-group-separator 'broken link!' | grep --invert-match 'broken link!' | sed --expression 's/:$//' # same again, but for failure to resolve echo "$wget_output" | grep 'unable to resolve' | sed --expression 's/^wget: //' # exit with error exit 1 fi # otherwise, exit silently with success exit 0
Code language: Bash (bash)

A short explanation of the changes: since wget can't extract links from xml files, we first download the sitemap to stdout (--output-file -) and search it for URLs. The ones we want are inside <loc> tags, so we grep for those: (?<=...) is a "positive lookbehind", which finds a tag located just before the rest of the match but doesn't include it in the result. We then match for http(s)://, then any number of characters until we reach a < symbol, signifying the start of the closing </loc>.

We pass our list of URLs to wget using --input-file - and scan each in turn for broken links as before. This time we add a 3-second wait between requests to avoid hitting anyone too fast, and also allow for ignoring certain URL patterns using --reject-regex. A CMS likely pulls in some external resources which we don't need to be warned about – for example, fonts.googleapis.com is linked here in the <head> to be DNS prefetched, but the URL itself will always 404. We don't need an email about it. I've prefilled the $ignore variable with some reasonable exclusions for a stock WordPress install: note the patterns don't need wildcards, so use //domain.com/ to ignore a whole domain and xmlrpc.php for a specific file.

Something else you might like to ignore is your own site! You already have all the links to scan so there's little need to go through them again on each page, though maybe you'd like to check for typos or missing resources. I'm only interested in external links, so I use the second $ignore addition (line 11) to exclude everything from the same subdirectory as the sitemap. The grep command here takes our input URL, starts at the // of https://, and matches any character up until the final / is found. This removes just the sitemap filename and leaves the rest behind. So feeding it https://asdfghjkl.me.uk/blog/sitemap.xml would give //asdfghjkl.me.uk/blog/ as the exclusion, ignoring /blog and /blog/post but still checking links to other parts of the site like /shop or /. To instead exclude my entire domain I could switch it with line 13, where the regex starts at // and stops when it finds the first / (if it exists), leaving //asdfghjkl.me.uk as the exclusion.

The only thing missing from this script variation is letting you know which specific page it found a broken link on – right now it just reports the sitemap URL. Instead of passing the list of URLs to wget in one go, you could loop through one at a time and output that for the "Found broken links" message. But that is left as an exercise to the reader. I'm out!

Download YouTube videos quickly in countries with slow international links

My local ISP recently installed fibre in town, which freed us up from the horror that is 700kbit WiMAX connections. The sales rep came round and enthusiastically encouraged us to upgrade to an "up to 100mbit" plan, which turned out to be shared with the entire town.

Yep.

So in practice we get about 1mbit for international traffic, though national traffic is pretty fast at 8-25mbit. Google and Akamai have servers in Madagascar so Google services are super fast, Facebook works great and Windows updates come through fairly quickly, but everything else sorta plods along.

Spotify, Netflix and basically anything streaming are out, but YouTube works perfectly, even in HD, as long as you immediately refresh the page after the video first starts playing. It seems that the first time someone loads a video, it immediately gets cached in-country over what I can only assume is a super-secret super-fast Google link. The second time, it loads much quicker.

This is great in the office, but if you want to load up some videos to take home (internet is way too expensive to have at home) you're going to want to download them. I'm a big fan of youtube-dl, which runs on most OSs and lets you pick and choose your formats. You can start it going, immediately cancel and restart to download at full speed, but you have to do it separately for video and audio and it's generally pretty irritating. So here's a bit of bash script to do it for you!

First install youtube-dl and expect if you don't have them already:

sudo apt-get install youtube-dl expect

Then add something like this to your ~/.bashrc:

yt()
{
expect -c 'spawn youtube-dl -f "bestvideo\[height<=480\]/best\[height<=480\]" -o /home/user/YouTube/%(title)s.f%(format_id)s.%(ext)s --no-playlist --no-mtime '"$1"'; expect " ETA " { close }'
expect -c 'spawn youtube-dl -f "worstaudio" -o /home/user/YouTube/%(title)s.f%(format_id)s.%(ext)s --no-playlist --no-mtime '"$1"'; expect " ETA " { close }'
youtube-dl -f "bestvideo[height<=480]+worstaudio/best[height<=480]" -o "/home/user/YouTube/%(title)s.%(ext)s" --no-playlist --no-mtime $1
}

Run bash to reload and use it like yt https://youtube.com/watch?v=whatever

The first two expect commands start downloading the video and audio respectively (I limit mine to 480p or below video and the smallest possible audio, but feel free to change it), killing youtube-dl as soon as they see " ETA " which appears once downloads start. The third command downloads the whole thing once it's been cached in-country.

The reason we include the format ID in the filename for the first two commands is because when downloading video and audio together, youtube-dl adds the format code to the temporary files as title.fcode.ext. When downloading just video or just audio, these aren't included by default. By adding these ourselves, the third command will resume downloading from the existing files and remove them automatically after combining them into one file.

I like to include --no-mtime so the downloaded files' modification date is when they were downloaded, rather than when the video was uploaded. This means I can easily delete them after a month with a crontab entry:

0 21 * * Sun root find /home/user/YouTube/ -type f -mtime +31 -print -delete

Ignore the running as root bit, it's on a NAS so everything runs as root. Woo.

Bash one-liner: Add an Apache directory index to an aria2 download queue

I work in a country with terrible internet, so large downloads through browsers often break part way through. The solution is aria2, a command-line download utility with an optional web UI to queue up downloads. This runs on a server (i.e. a laptop on a shelf) with a few extra config options to make it handle dodgy electricity and dodgy connections a bit better.

A simple crontab entry starts it on boot:

@reboot screen -dmS aria2 aria2c --conf-path=/home/user/.aria2/aria2.conf

The config file /home/user/.aria2/aria2.conf adds some default options:

continue
dir=/home/user/downloads
enable-rpc
rpc-listen-all
rpc-secret=secret_token
check-certificate=false
enable-http-pipelining=true
max-tries=0
retry-wait=10
file-allocation=none
save-session=/home/user/.aria2/aria2.session
input-file=/home/user/.aria2/aria2.session
max-concurrent-downloads=1
always-resume=false

The three RPC options allows the web UI to connect (port 6800 by default), and the session file allows the download queue to persist across reboots (again, dodgy electricity).

Most downloads work fine, but others expire after a certain time, don't allow resuming or only allow a single HTTP request. For these I use a server on a fast connection that acts as a middleman - I can download files immediately there and bring them in later on the slow connection. This is easy enough for single files with directory indexes set up in Apache - right click, copy URL, paste into web UI, download. For entire folders it's a bit more effort to copy every URL, so here's a quick and dirty one-liner you can add to your .bashrc that will accept a URL to an Apache directory index and add every file listed to the aria2 queue.

dl()
{
wget --spider -r --no-parent --level=1 --reject index.html* -nd -e robots=off --reject-regex '(.*)\?(.*)' --user=apache_user --password=apache_password $1 2>&1 | grep '^--' | awk '{ print $3 }' | sed "s/'/%27/" | sed -e '1,2d' | sed '$!N; /^\(.*\)\n\1$/!P; D' | sed 's#^#http://aria2_url:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data \x27{"jsonrpc": "2.0","id":1,"method": "aria2.addUri", "params":["token:secret_token", ["#' | sed 's#$#"], {"pause":"true", "http-user":"apache_user", "http-passwd":"apache_password"}]}\x27#' | xargs -L 1 curl
}

Add the above to your .bashrc and run bash to reload. Then, to add a directory:

dl https://website.com/directory/

By default this will add downloads paused - see below for more info.

The code is a bit of a mouthful, so here's what each bit does:

wget --spider -r --no-parent --level=1 --reject index.html* -nd -e robots=off --reject-regex '(.*)\?(.*)' --user=apache_user --password=apache_password $1 2>&1

--spider: Don't download anything, just check the page is there (this is later used to provide a list of links to download)
-r --no-parent --level=1: Retrieve recursively, so check all the links on the page, but don't download the parent directory and don't go any deeper than the current directory
--reject index.html*: Ignore the current page
-nd: Don't create a directory structure for downloaded files. wget needs to download at least the index page to check for links, but by default will create a directory structure like website.com/folder/file in the current folder. The --spider option deletes these files after they're created, but doesn't delete directories, leaving you with a bunch of useless empty folders. In theory you could instead output to a single temporary file with -O tmpfile, but for some reason this stops wget from parsing for further links.
-e robots=off: Ignore robots.txt in case it exists
--reject-regex '(.*)\?(.*)': ignore any link with a query string - this covers the ones which sort the listing by name, date, size or description
--user=apache_user --password=apache_password: if you're using Basic Authentication to secure the directory listing
$1: feeds in the URL from the shell
2>&1: wget writes to stderr by default, so we redirect all output to stdout

grep '^--' | awk '{ print $3 }' | sed "s/'/% 27/" | sed -e '1,2d' | sed '$!N; /^\(.*\)\n\1$/!P; D'

grep '^--': lines containing URLs begin with the date enclosed in two hyphens (e.g. --2017-08-23 12:37:28--), so we match only lines which begin with two hyphens
awk '{ print $3 }': separates each line into columns separated by spaces, and outputs only the third one (e.g. --2017-08-23 12:37:28-- https://website.com/file)
sed "s/'/%27/": Apache doesn't urlencode single quote marks in URLs but the script struggles with them, so we convert them to their URL encoded equivalent
sed -e '1,2d': the first two URLs wget outputs is always the directory itself, so we remove the first two lines
sed '$!N; /^\(.*\)\n\1$/!P; D': occasionally you get consecutive duplicate lines coming out, so this removes them. You could use uniq. But this looks more impressive.

sed 's#^#http://aria2_url:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data \x27{"jsonrpc": "2 .0","id":1,"method": "aria2.addUri", "params":["token:secret_token", ["#'

Now it all gets a bit rough. We're now creating an expression to feed to curl that will add each download to the start of the queue. We want to run something like this for each line:

curl http://aria2_url:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2 .0","id":1,"method": "aria2.addUri", "params":["token:secret_token", ["http://website.com/file"], {"pause":"true", "http-user":"apache_user", "http-passwd":"apache_password"}]}'

So we use sed once to add the bits before the URL (s#^#whatever# replaces the start of the line). We use # in place of the normal / so it works okay with all the slashes in the URLs, and replace two of the single quotes with their ASCII equivalent \x27 because getting quotes to nest properly is hard and I don't like doing it.

sed 's#$#"], {"pause":"true", "http-user":"apache_user", "http-passwd":"apache_password"}]}\x27#'

We then use sed again to add the bits after the URL (s#$#whatever# replaces the end of the line).

xargs -L 1 curl

Once everything's put together, we feed each line to curl with xargs. A successful addition to the queue looks like this:

{"id":1,"jsonrpc":"2.0","result":"721db74ea91db42c"}

Why are downloads added paused?

Due to the limited bandwidth of our office connection, we only run big downloads outside of office hours and restrict speeds to avoid hitting our monthly cap. You can change "pause":"true" to "pause":"false" if you prefer.

To automatically start and stop downloads at certain times, you can add crontab entries to the server you host aria2 on:

# Pause aria2 downloads at 8am and 2pm, but remove the speed limit
0 8,14 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.pauseAll", "params":["token:secret_token"]}'
0 8,14 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.changeGlobalOption", "params":["token:secret_token",{"max-overall-download-limit":"0"}]}'

# Resume downloads at 12pm and 5pm but limit speed to 80KB/s
0 12,17 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.unpauseAll", "params":["token:secret_token"]}'
0 12,17 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.changeGlobalOption", "params":["token:secret_token",{"max-overall-download-limit":"80K"}]}'

Caveats

  • wget --spider will download text files and those which are missing a ContentType header to check for further links. Apache will serve a header for most common types but does miss a few, and the DefaultType option has been deprecated so you can't set, say, application/octet-stream for anything unknown. It's therefore sensible to run this script on the server hosting the directory indexes so you're not waiting on downloads (which are albeit immediately deleted afterwards).

USB tethering with Nokia N9 on Windows

After a few days of internet troubles at work, I decided to attempt USB tethering with my Nokia N9 before Facebook withdrawal killed me (I'd browse on mobile but the only place I get signal is hanging off my desk which makes typing a bit awkward). This is a little more involved than on other platforms - if you have wifi you can use the included hotspot app, but I couldn't be bothered to walk the whole 15 minutes home to grab a wireless card. I knew that the SDK app you get when you enable developer mode (you have done this, right? Settings -> Security -> Developer Mode and hit the button) lets you set up a network over USB so you can SSH to the N9, and figured I could simply set up an SSH tunnel and proxy all my PC traffic through that. Course, it's never that easy.

Continue reading