Archiving everything I like with youtube-dl

Continuing on the theme of "link rot bad, hard drives cheap", a year or so ago I started archiving videos I'd liked or saved to YouTube playlists. You can do this manually without too much trouble but I chucked it in a shell script to run regularly, keeping as much metadata as possible. Here it is!

#!/bin/bash

# Archive youtube videos from a list of channels/playlists, in up to selected quality,
# with formatted filenames and all available metadata in sidecar files.
#
# Note: this probably relies on having an up-to-date youtube-dl, so we run
# youtube-dl -U in the root crontab an hour before this script runs

# Settings
quality='bestvideo[height<=?1080]+bestaudio/best[height<=?1080]'
# If we ever get infinite hard drive space:
#quality='bestvideo+bestaudio/best'
# Batch file of URLs to download
batch_file='youtube-list.txt'
# File to pull youtube cookies from (for private videos and liked playlist)
cookies_file='youtube-cookies.txt'
# Don't download anything absurdly-sized at all (if prefer to download but in worse quality,
# add to quality definition instead like [height<=?1080][filesize<10G]
max_filesize='10G'
# Clone current useragent (that account is logged in as)
user_agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
# Bind to different IP in case of geo-blocks
# Country 1
source_IP='111.111.111.111'
# Country 2
#source_IP='222.222.222.222'
# ipv6 1 (etc)
#source_IP='2000:aaaa:b:cccc::'
# Limit download rate and sleep a random number of seconds between downloads to avoid IP blocks
rate_limit='5M'
sleep_min='10'
sleep_max='15'
# Set folder and filename format, and an archive file to avoid redownloading completed videos
filename_format='youtube/%(playlist)s/%(playlist_index)05d - %(title)s - %(id)s - %(upload_date)s.%(ext)s'
archive_file='youtube/youtube-dl-downloaded.txt'

# Change to directory this script is in (for cron etc)
cd $(dirname $0) || { echo 'Failed to change directory, giving up'; exit 1; }                              

# Explanations
#-sv: simulate verbose for testing
#--playlist-items 1-3: first few only for testing
#--restrict-filenames: replace special characters in case need to transfer to Windows etc
#--no-overwrites: do not overwrite existing files
#--continue: resume partially downloaded files
#--ignore-errors: continue even if a video is unavailable (taken down etc)
#--ignore-config: don't read usual config files
#--download-archive $archive_file: use an archive file to avoid redownloading already-downloaded videos
#--yes-playlist: download the whole playlist, in case we pass a video+playlist link
#--playlist-reverse: may be necessary if index starts from most recent addition?
#--write-description: write video description to a .description file
#--write-info-json: write video metadata to a .info.json file
#--write-annotations: write annotations to a .annotations.xml file, why not
#--write-thumbnail: write thumbnail image to disk
#--write-sub: write subtitles (but not autogenerated)
#--embed-subs: also add them to the video file, why not
#--add-metadata: add metadata to video file

# Use --cookies to temporarily pass cookies (note must be in UNIX newline format, use notepad++ to convert)
# fix youtube-dl not working with cookies in python2
# https://github.com/ytdl-org/youtube-dl/issues/28640
python3 /usr/bin/youtube-dl \
--cookies "$cookies_file" \
--batch-file "$batch_file" \
--output "$filename_format" \
--format "$quality" \
--user-agent "$user_agent" \
--source-address "$source_IP" \
--max-filesize "$max_filesize" \
--limit-rate "$rate_limit" \
--sleep-interval "$sleep_min" \
--max-sleep-interval "$sleep_max" \
--restrict-filenames \
--no-overwrites \
--no-warnings \
--continue \
--ignore-errors \
--ignore-config \
--download-archive "$archive_file" \
--yes-playlist \
--playlist-reverse \
--write-description \
--write-info-json \
--write-annotations \
--write-thumbnail \
--write-sub \
--sub-lang en \
--embed-subs \
--add-metadata

Code language: Bash (bash)

You'll need the wonderful youtube-dl to run this. Should be fairly self-explanatory, but there's a few bits I find especially useful.

I limit video quality to the best up-to-1080p possible, since 4K videos can be huge and I'm not fussed for an archive. I also put a hard limit on filesize to avoid downloading any 10-hour videos, but you have the option to get them in lower quality instead. I keep the URLs to download in a separate file: these can be individual videos, entire channels or playlists, one on each line.

You can make your own playlists unlisted if you don't want them public but still want to be able to download them with this script. Unfortunately there is one case where this doesn't work – your liked videos playlist is always private and can't be changed. youtube-dl does let you pass in the username and password to your Google account but I find this rarely works, so instead you can export your YouTube cookies (using something like this extension on a YouTube page), dump them in a .txt file and point youtube-dl to them. It's probably sensible to clone your browser's useragent too, and set some rate limits to not abuse their hospitality too much.

Since some videos will inevitably be geo-restricted and I have a few IPs pointing to my box that geolocate to different countries, I'll occasionally let it do a run from somewhere else to sweep up any videos that might have been missed.

Although I save metadata anyway, I try to make the output format descriptive enough that I could live without it. I save each video to a folder named for its playlist/channel, and name the video with its position in the playlist, title, video ID and upload date. Reversing the playlist order means the position index starts from the first video added to the playlist – otherwise when more videos are added, the latest becomes the new number 1 and your index becomes useless.

Next post: doing something with them!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.