2022-07-22
unarchive, unarchive_collection - download from the Internet Archive
unarchive [option…] item . . .
unarchive [option…] < items
unarchive_collection [option…] collection . . .
unarchive_collection [option…] < collections
unarchive is a tool to download items from the Internet Archive. Each item argument given to unarchive is the identifier of an item on the Internet Archive, with the corresponding URL https://archive.org/details/
item. For each item, an index is retrieved from the Internet Archive, and the files contained within item are retrieved and verified, if they are not already present on the filesystem.
Each item is stored in its own directory. Only original files and metadata are downloaded - derivatives (transcoded files, music waveform images, video thumbnails, …) are not downloaded. Each item folder will have two extra files that are not part of the item on the server:
__unarchive.md5
containing checksums;__unarchive.log
containing download log.unarchive_collection is a tool to download collections of items from the Internet Archive. Each collection argument is the identifier of a collection on the Internet Archive, with the corresponding URL https://archive.org/details/
collection. For each collection, an index is retrieved from the Internet Archive, and the items contained within it are retrieved using unarchive.
Each collection is stored in its own directory, with each item of the collection stored in its own subdirectory. Each collection folder will have two extra files that not part of the collection on the server:
__unarchive_collection.xml
containing search results;__unarchive_collection.log
containing download log.Metadata downloaded from https://archive.org is implicitly trusted, which may be a security risk, especially on untrusted networks. It is recommended to make backups and run with reduced privileges (e.g. using a chroot jail).
Later options take priority over earlier options. That is to say, unarchive -q -v will be verbose (not quiet). All options below apply to both unarchive and unarchive_collection, apart from those listed under Collection Options, which are only applicable for unarchive_collection.
Print a help message and exit.
Display version information and exit.
Be quieter. This hides the messages saying checksums are OK. Other status messages (including errors) are still shown.
Be louder. This shows messages saying checksums OK. This is the default.
Colourize the output. This is the default if $NO_COLOR is unset or empty.
Don’t colourize the output. This is the default if $NO_COLOR is set and non-empty.
Show download progress bars. This is the default. The progress bar style can be adjusted by configuring wget(1).
Hide download progress bars.
Maximum number of items to retrieve from the index (default 10000). Items are sorted by date, newest first.
If $NO_COLOR is set and non-empty, it will have the same effect as if --no-colour were specified at the beginning of the command line.
Successful program execution.
One or more errors occured.
There are two mutually exclusive ways to use unarchive and unarchive_collection:
For example, the identifier of https://archive.org/details/lab08
is lab08
. It is used in these examples because it is small (download size less than 5 MB).
To download an item:
unarchive lab08
To verify without downloading anything at all:
( cd lab08 && md5sum -c __unarchive.md5 )
Or if you want to verify they are the same as currently on the server:
unarchive lab08
If the files have already been downloaded and are still the same as on the server, no downloading occurs (apart from the _files.xml
metadata, which is always requested but only downloaded if newer than the file on disk), otherwise the download will be resumed.
Backups are made before partial downloads are resumed, in case the file has changed on the server (which would cause data corruption). If the file has not changed after all, the backup of the partial data is not deleted. If the file did change, and the result is a mishmash of both, there are two ways to proceed: restore from backup (for old version), or delete and retry (for new version). Inspect the output for FAILED messages (which are highlighted in red when colour is enabled). This seems to happen most commonly with _meta.xml
files and similar.
To simulate a partial download (for example due to network issues):
echo '<?xml version="1.0" encoding="UTF-8"?>' > lab08/lab08_meta.xml
unarchive lab08
To simulate a changed file or otherwise broken download:
echo 'broken' > lab08/lab08_meta.xml
unarchive lab08
Deleting the broken file is the recommended way to proceed after a failed resume of a partial download:
rm lab08/lab08_meta.xml
unarchive lab08
Alternatively you can restore the old version from the backup.
unarchive works for single items. unarchive_collection downloads whole collections of items (specified on command line, or via standard input), using unarchive to download each item.
To download all the releases from the GOSUB10 netlabel (total download size 575MB, as of 2022-06-27):
unarchive_collection gosub10
unarchive_collection has an option to limit the number of items per collection (sorted newest first).
To download the most recent release from the Bump Foot netlabel (total download size 137MB, for bump221, as of 2022-06-27).
unarchive_collection -n 1 bumpfoot
2007 – original prototype
2008 – verify download checksums
2009 – whole collection unarchiver
2010 – command line arguments, release v0.3
2011, 2014, 2016 – various bugfixes
2022 – rewritten with many improvements, release 1.0
Copyright (C) 2022 Claude Heiland-Allen.
License AGPL-3.0-only: GNU Affero GPL version 3 https://www.gnu.org/licenses/agpl-3.0.html.
This is free software: you are free to change and redistribute it. There is NO WARRANTY.
The unarchive home page is at https://mathr.co.uk/unarchive.
unarchive(1), unarchive_collection(1), wget(1)