{"id":270,"date":"2021-10-27T19:00:37","date_gmt":"2021-10-27T18:00:37","guid":{"rendered":"https:\/\/asdfghjkl.me.uk\/blog\/?p=270"},"modified":"2021-10-27T19:02:04","modified_gmt":"2021-10-27T18:02:04","slug":"archive-youtube","status":"publish","type":"post","link":"https:\/\/asdfghjkl.me.uk\/blog\/archive-youtube\/","title":{"rendered":"Archiving everything I like with youtube-dl"},"content":{"rendered":"\n<p>Continuing on the theme of \"link rot bad, hard drives cheap\", a year or so ago I started archiving videos I'd liked or saved to YouTube playlists. You can do this manually without too much trouble but I chucked it in a shell script to run regularly, keeping as much metadata as possible. Here it is!<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"Bash\" data-shcb-language-slug=\"bash\"><span><code class=\"hljs language-bash shcb-code-table shcb-line-numbers shcb-wrap-lines\"><span class='shcb-loc'><span><span class=\"hljs-meta\">#!\/bin\/bash<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-meta\"><\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Archive youtube videos from a list of channels\/playlists, in up to selected quality,<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># with formatted filenames and all available metadata in sidecar files.<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Note: this probably relies on having an up-to-date youtube-dl, so we run<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># youtube-dl -U in the root crontab an hour before this script runs<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Settings<\/span>\n<\/span><\/span><span class='shcb-loc'><span>quality=<span class=\"hljs-string\">'bestvideo&#91;height&lt;=?1080]+bestaudio\/best&#91;height&lt;=?1080]'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># If we ever get infinite hard drive space:<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#quality='bestvideo+bestaudio\/best'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Batch file of URLs to download<\/span>\n<\/span><\/span><span class='shcb-loc'><span>batch_file=<span class=\"hljs-string\">'youtube-list.txt'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># File to pull youtube cookies from (for private videos and liked playlist)<\/span>\n<\/span><\/span><span class='shcb-loc'><span>cookies_file=<span class=\"hljs-string\">'youtube-cookies.txt'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Don't download anything absurdly-sized at all (if prefer to download but in worse quality,<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># add to quality definition instead like &#91;height&lt;=?1080]&#91;filesize&lt;10G]<\/span>\n<\/span><\/span><span class='shcb-loc'><span>max_filesize=<span class=\"hljs-string\">'10G'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Clone current useragent (that account is logged in as)<\/span>\n<\/span><\/span><span class='shcb-loc'><span>user_agent=<span class=\"hljs-string\">'Mozilla\/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/83.0.4103.116 Safari\/537.36'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Bind to different IP in case of geo-blocks<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Country 1<\/span>\n<\/span><\/span><span class='shcb-loc'><span>source_IP=<span class=\"hljs-string\">'111.111.111.111'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Country 2<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#source_IP='222.222.222.222'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># ipv6 1 (etc)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#source_IP='2000:aaaa:b:cccc::'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Limit download rate and sleep a random number of seconds between downloads to avoid IP blocks<\/span>\n<\/span><\/span><span class='shcb-loc'><span>rate_limit=<span class=\"hljs-string\">'5M'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>sleep_min=<span class=\"hljs-string\">'10'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>sleep_max=<span class=\"hljs-string\">'15'<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Set folder and filename format, and an archive file to avoid redownloading completed videos<\/span>\n<\/span><\/span><span class='shcb-loc'><span>filename_format=<span class=\"hljs-string\">'youtube\/%(playlist)s\/%(playlist_index)05d - %(title)s - %(id)s - %(upload_date)s.%(ext)s'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>archive_file=<span class=\"hljs-string\">'youtube\/youtube-dl-downloaded.txt'<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Change to directory this script is in (for cron etc)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-built_in\">cd<\/span> $(dirname <span class=\"hljs-variable\">$0<\/span>) || { <span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-string\">'Failed to change directory, giving up'<\/span>; <span class=\"hljs-built_in\">exit<\/span> 1; }                              \n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Explanations<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#-sv: simulate verbose for testing<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--playlist-items 1-3: first few only for testing<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--restrict-filenames: replace special characters in case need to transfer to Windows etc<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--no-overwrites: do not overwrite existing files<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--continue: resume partially downloaded files<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--ignore-errors: continue even if a video is unavailable (taken down etc)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--ignore-config: don't read usual config files<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--download-archive $archive_file: use an archive file to avoid redownloading already-downloaded videos<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--yes-playlist: download the whole playlist, in case we pass a video+playlist link<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--playlist-reverse: may be necessary if index starts from most recent addition?<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--write-description: write video description to a .description file<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--write-info-json: write video metadata to a .info.json file<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--write-annotations: write annotations to a .annotations.xml file, why not<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--write-thumbnail: write thumbnail image to disk<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--write-sub: write subtitles (but not autogenerated)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--embed-subs: also add them to the video file, why not<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\">#--add-metadata: add metadata to video file<\/span>\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># Use --cookies to temporarily pass cookies (note must be in UNIX newline format, use notepad++ to convert)<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># fix youtube-dl not working with cookies in python2<\/span>\n<\/span><\/span><span class='shcb-loc'><span><span class=\"hljs-comment\"># https:\/\/github.com\/ytdl-org\/youtube-dl\/issues\/28640<\/span>\n<\/span><\/span><span class='shcb-loc'><span>python3 \/usr\/bin\/youtube-dl \\\n<\/span><\/span><span class='shcb-loc'><span>--cookies <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$cookies_file<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--batch-file <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$batch_file<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--output <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$filename_format<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--format <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$quality<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--user-agent <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$user_agent<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--<span class=\"hljs-built_in\">source<\/span>-address <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$source_IP<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--max-filesize <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$max_filesize<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--<span class=\"hljs-built_in\">limit<\/span>-rate <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$rate_limit<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--sleep-interval <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$sleep_min<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--max-sleep-interval <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$sleep_max<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--restrict-filenames \\\n<\/span><\/span><span class='shcb-loc'><span>--no-overwrites \\\n<\/span><\/span><span class='shcb-loc'><span>--no-warnings \\\n<\/span><\/span><span class='shcb-loc'><span>--<span class=\"hljs-built_in\">continue<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--ignore-errors \\\n<\/span><\/span><span class='shcb-loc'><span>--ignore-config \\\n<\/span><\/span><span class='shcb-loc'><span>--download-archive <span class=\"hljs-string\">\"<span class=\"hljs-variable\">$archive_file<\/span>\"<\/span> \\\n<\/span><\/span><span class='shcb-loc'><span>--yes-playlist \\\n<\/span><\/span><span class='shcb-loc'><span>--playlist-reverse \\\n<\/span><\/span><span class='shcb-loc'><span>--write-description \\\n<\/span><\/span><span class='shcb-loc'><span>--write-info-json \\\n<\/span><\/span><span class='shcb-loc'><span>--write-annotations \\\n<\/span><\/span><span class='shcb-loc'><span>--write-thumbnail \\\n<\/span><\/span><span class='shcb-loc'><span>--write-sub \\\n<\/span><\/span><span class='shcb-loc'><span>--sub-lang en \\\n<\/span><\/span><span class='shcb-loc'><span>--embed-subs \\\n<\/span><\/span><span class='shcb-loc'><span>--add-metadata\n<\/span><\/span><span class='shcb-loc'><span>\n<\/span><\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Bash<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">bash<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>You'll need the wonderful <a href=\"https:\/\/github.com\/ytdl-org\/youtube-dl\">youtube-dl<\/a> to run this. Should be fairly self-explanatory, but there's a few bits I find especially useful.<\/p>\n\n\n\n<p>I limit video quality to the best up-to-1080p possible, since 4K videos can be huge and I'm not fussed for an archive. I also put a hard limit on filesize to avoid downloading any 10-hour videos, but you have the option to get them in lower quality instead. I keep the URLs to download in a separate file: these can be individual videos, entire channels or playlists, one on each line.<\/p>\n\n\n\n<p>You can make your own playlists unlisted if you don't want them public but still want to be able to download them with this script. Unfortunately there is one case where this doesn't work \u2013 your liked videos playlist is always private and can't be changed. youtube-dl does let you pass in the username and password to your Google account but I find this rarely works, so instead you can export your YouTube cookies (using something like <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/get-cookiestxt\/bgaddhkoddajcdgocldbbfleckgcbcid\">this extension<\/a> on a YouTube page), dump them in a .txt file and point youtube-dl to them. It's probably sensible to clone your browser's useragent too, and set some rate limits to not abuse their hospitality too much.<\/p>\n\n\n\n<p>Since some videos will inevitably be geo-restricted and I have a few IPs pointing to my box that geolocate to different countries, I'll occasionally let it do a run from somewhere else to sweep up any videos that might have been missed.<\/p>\n\n\n\n<p>Although I save metadata anyway, I try to make the output format descriptive enough that I could live without it. I save each video to a folder named for its playlist\/channel, and name the video with its position in the playlist, title, video ID and upload date. Reversing the playlist order means the position index starts from the first video added to the playlist \u2013 otherwise when more videos are added, the latest becomes the new number 1 and your index becomes useless.<\/p>\n\n\n\n<p>Next post: doing something with them!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Continuing on the theme of \"link rot bad, hard drives cheap\", a year or so ago I started archiving videos I'd liked or saved to YouTube playlists. You can do this manually without too much trouble but I chucked it in a shell script to run regularly, keeping as much metadata as possible. Here it [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[10,6],"tags":[],"_links":{"self":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts\/270"}],"collection":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/comments?post=270"}],"version-history":[{"count":4,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts\/270\/revisions"}],"predecessor-version":[{"id":336,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/posts\/270\/revisions\/336"}],"wp:attachment":[{"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/media?parent=270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/categories?post=270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/asdfghjkl.me.uk\/blog\/wp-json\/wp\/v2\/tags?post=270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}