Bash one-liner: Add an Apache directory index to an aria2 download queue

I work in a country with terrible internet, so large downloads through browsers often break part way through. The solution is aria2, a command-line download utility with an optional web UI to queue up downloads. This runs on a server (i.e. a laptop on a shelf) with a few extra config options to make it handle dodgy electricity and dodgy connections a bit better.

A simple crontab entry starts it on boot:

@reboot screen -dmS aria2 aria2c --conf-path=/home/user/.aria2/aria2.conf

The config file /home/user/.aria2/aria2.conf adds some default options:

continue
dir=/home/user/downloads
enable-rpc
rpc-listen-all
rpc-secret=secret_token
check-certificate=false
enable-http-pipelining=true
max-tries=0
retry-wait=10
file-allocation=none
save-session=/home/user/.aria2/aria2.session
input-file=/home/user/.aria2/aria2.session
max-concurrent-downloads=1
always-resume=false

The three RPC options allows the web UI to connect (port 6800 by default), and the session file allows the download queue to persist across reboots (again, dodgy electricity).

Most downloads work fine, but others expire after a certain time, don't allow resuming or only allow a single HTTP request. For these I use a server on a fast connection that acts as a middleman - I can download files immediately there and bring them in later on the slow connection. This is easy enough for single files with directory indexes set up in Apache - right click, copy URL, paste into web UI, download. For entire folders it's a bit more effort to copy every URL, so here's a quick and dirty one-liner you can add to your .bashrc that will accept a URL to an Apache directory index and add every file listed to the aria2 queue.

dl()
{
wget --spider -r --no-parent --level=1 --reject index.html* -nd -e robots=off --reject-regex '(.*)\?(.*)' --user=apache_user --password=apache_password $1 2>&1 | grep '^--' | awk '{ print $3 }' | sed "s/'/%27/" | sed -e '1,2d' | sed '$!N; /^\(.*\)\n\1$/!P; D' | sed 's#^#http://aria2_url:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data \x27{"jsonrpc": "2.0","id":1,"method": "aria2.addUri", "params":["token:secret_token", ["#' | sed 's#$#"], {"pause":"true", "http-user":"apache_user", "http-passwd":"apache_password"}]}\x27#' | xargs -L 1 curl
}

Add the above to your .bashrc and run bash to reload. Then, to add a directory:

dl https://website.com/directory/

By default this will add downloads paused - see below for more info.

The code is a bit of a mouthful, so here's what each bit does:

wget --spider -r --no-parent --level=1 --reject index.html* -nd -e robots=off --reject-regex '(.*)\?(.*)' --user=apache_user --password=apache_password $1 2>&1

--spider: Don't download anything, just check the page is there (this is later used to provide a list of links to download)
-r --no-parent --level=1: Retrieve recursively, so check all the links on the page, but don't download the parent directory and don't go any deeper than the current directory
--reject index.html*: Ignore the current page
-nd: Don't create a directory structure for downloaded files. wget needs to download at least the index page to check for links, but by default will create a directory structure like website.com/folder/file in the current folder. The --spider option deletes these files after they're created, but doesn't delete directories, leaving you with a bunch of useless empty folders. In theory you could instead output to a single temporary file with -O tmpfile, but for some reason this stops wget from parsing for further links.
-e robots=off: Ignore robots.txt in case it exists
--reject-regex '(.*)\?(.*)': ignore any link with a query string - this covers the ones which sort the listing by name, date, size or description
--user=apache_user --password=apache_password: if you're using Basic Authentication to secure the directory listing
$1: feeds in the URL from the shell
2>&1: wget writes to stderr by default, so we redirect all output to stdout

grep '^--' | awk '{ print $3 }' | sed "s/'/% 27/" | sed -e '1,2d' | sed '$!N; /^\(.*\)\n\1$/!P; D'

grep '^--': lines containing URLs begin with the date enclosed in two hyphens (e.g. --2017-08-23 12:37:28--), so we match only lines which begin with two hyphens
awk '{ print $3 }': separates each line into columns separated by spaces, and outputs only the third one (e.g. --2017-08-23 12:37:28-- https://website.com/file)
sed "s/'/%27/": Apache doesn't urlencode single quote marks in URLs but the script struggles with them, so we convert them to their URL encoded equivalent
sed -e '1,2d': the first two URLs wget outputs is always the directory itself, so we remove the first two lines
sed '$!N; /^\(.*\)\n\1$/!P; D': occasionally you get consecutive duplicate lines coming out, so this removes them. You could use uniq. But this looks more impressive.

sed 's#^#http://aria2_url:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data \x27{"jsonrpc": "2 .0","id":1,"method": "aria2.addUri", "params":["token:secret_token", ["#'

Now it all gets a bit rough. We're now creating an expression to feed to curl that will add each download to the start of the queue. We want to run something like this for each line:

curl http://aria2_url:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2 .0","id":1,"method": "aria2.addUri", "params":["token:secret_token", ["http://website.com/file"], {"pause":"true", "http-user":"apache_user", "http-passwd":"apache_password"}]}'

So we use sed once to add the bits before the URL (s#^#whatever# replaces the start of the line). We use # in place of the normal / so it works okay with all the slashes in the URLs, and replace two of the single quotes with their ASCII equivalent \x27 because getting quotes to nest properly is hard and I don't like doing it.

sed 's#$#"], {"pause":"true", "http-user":"apache_user", "http-passwd":"apache_password"}]}\x27#'

We then use sed again to add the bits after the URL (s#$#whatever# replaces the end of the line).

xargs -L 1 curl

Once everything's put together, we feed each line to curl with xargs. A successful addition to the queue looks like this:

{"id":1,"jsonrpc":"2.0","result":"721db74ea91db42c"}

Why are downloads added paused?

Due to the limited bandwidth of our office connection, we only run big downloads outside of office hours and restrict speeds to avoid hitting our monthly cap. You can change "pause":"true" to "pause":"false" if you prefer.

To automatically start and stop downloads at certain times, you can add crontab entries to the server you host aria2 on:

# Pause aria2 downloads at 8am and 2pm, but remove the speed limit
0 8,14 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.pauseAll", "params":["token:secret_token"]}'
0 8,14 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.changeGlobalOption", "params":["token:secret_token",{"max-overall-download-limit":"0"}]}'

# Resume downloads at 12pm and 5pm but limit speed to 80KB/s
0 12,17 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.unpauseAll", "params":["token:secret_token"]}'
0 12,17 * * 1-5 curl http://127.0.0.1:6800/jsonrpc -H "Content-Type: application/json" -H "Accept: application/json" --data '{"jsonrpc": "2.0","id":1, "method": "aria2.changeGlobalOption", "params":["token:secret_token",{"max-overall-download-limit":"80K"}]}'

Caveats

  • wget --spider will download text files and those which are missing a ContentType header to check for further links. Apache will serve a header for most common types but does miss a few, and the DefaultType option has been deprecated so you can't set, say, application/octet-stream for anything unknown. It's therefore sensible to run this script on the server hosting the directory indexes so you're not waiting on downloads (which are albeit immediately deleted afterwards).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.