UNARCHIVE

Claude Heiland-Allen

2022-07-22

NAME

unarchive, unarchive_collection - download from the Internet Archive

SYNOPSIS

unarchive [option…] item . . .
unarchive [option…] < items
unarchive_collection [option…] collection . . .
unarchive_collection [option…] < collections

DESCRIPTION

unarchive is a tool to download items from the Internet Archive. Each item argument given to unarchive is the identifier of an item on the Internet Archive, with the corresponding URL https://archive.org/details/item. For each item, an index is retrieved from the Internet Archive, and the files contained within item are retrieved and verified, if they are not already present on the filesystem.

Each item is stored in its own directory. Only original files and metadata are downloaded - derivatives (transcoded files, music waveform images, video thumbnails, …) are not downloaded. Each item folder will have two extra files that are not part of the item on the server:

__unarchive.md5 containing checksums;
__unarchive.log containing download log.

unarchive_collection is a tool to download collections of items from the Internet Archive. Each collection argument is the identifier of a collection on the Internet Archive, with the corresponding URL https://archive.org/details/collection. For each collection, an index is retrieved from the Internet Archive, and the items contained within it are retrieved using unarchive.

Each collection is stored in its own directory, with each item of the collection stored in its own subdirectory. Each collection folder will have two extra files that not part of the collection on the server:

__unarchive_collection.xml containing search results;
__unarchive_collection.log containing download log.

SECURITY

Metadata downloaded from https://archive.org is implicitly trusted, which may be a security risk, especially on untrusted networks. It is recommended to make backups and run with reduced privileges (e.g. using a chroot jail).

OPTIONS

Later options take priority over earlier options. That is to say, unarchive -q -v will be verbose (not quiet). All options below apply to both unarchive and unarchive_collection, apart from those listed under Collection Options, which are only applicable for unarchive_collection.

Program Information

-?, -h, --help: Print a help message and exit.
-V, --version: Display version information and exit.

Verbosity

-q, --quiet, --no-verbose: Be quieter. This hides the messages saying checksums are OK. Other status messages (including errors) are still shown.
-v, --verbose, --no-quiet: Be louder. This shows messages saying checksums OK. This is the default.

Colour

-c, --color, --colour: Colourize the output. This is the default if $NO_COLOR is unset or empty.
-C, --no-color, --no-colour: Don’t colourize the output. This is the default if $NO_COLOR is set and non-empty.

Progress Bars

-p, --progress: Show download progress bars. This is the default. The progress bar style can be adjusted by configuring wget(1).
-P, --no-progress: Hide download progress bars.

Collection Options

-n n, --count n: Maximum number of items to retrieve from the index (default 10000). Items are sorted by date, newest first.

ENVIRONMENT

NO_COLOR: If $NO_COLOR is set and non-empty, it will have the same effect as if --no-colour were specified at the beginning of the command line.

EXIT STATUS

0: Successful program execution.
1: One or more errors occured.

EXAMPLES

There are two mutually exclusive ways to use unarchive and unarchive_collection:

provide identifiers as command line arguments
provide identifiers on standard input

For example, the identifier of https://archive.org/details/lab08 is lab08. It is used in these examples because it is small (download size less than 5 MB).

Download

To download an item:

unarchive lab08

Verifying Downloads

To verify without downloading anything at all:

( cd lab08 && md5sum -c __unarchive.md5 )

Or if you want to verify they are the same as currently on the server:

unarchive lab08

If the files have already been downloaded and are still the same as on the server, no downloading occurs (apart from the _files.xml metadata, which is always requested but only downloaded if newer than the file on disk), otherwise the download will be resumed.

Resume Downloads (Partial File)

Backups are made before partial downloads are resumed, in case the file has changed on the server (which would cause data corruption). If the file has not changed after all, the backup of the partial data is not deleted. If the file did change, and the result is a mishmash of both, there are two ways to proceed: restore from backup (for old version), or delete and retry (for new version). Inspect the output for FAILED messages (which are highlighted in red when colour is enabled). This seems to happen most commonly with _meta.xml files and similar.

To simulate a partial download (for example due to network issues):

echo '<?xml version="1.0" encoding="UTF-8"?>' > lab08/lab08_meta.xml
unarchive lab08

To simulate a changed file or otherwise broken download:

echo 'broken' > lab08/lab08_meta.xml
unarchive lab08

Resume Downloads (Missing File)

Deleting the broken file is the recommended way to proceed after a failed resume of a partial download:

rm lab08/lab08_meta.xml
unarchive lab08

Alternatively you can restore the old version from the backup.

Unarchiving Whole Collections

unarchive works for single items. unarchive_collection downloads whole collections of items (specified on command line, or via standard input), using unarchive to download each item.

To download all the releases from the GOSUB10 netlabel (total download size 575MB, as of 2022-06-27):

unarchive_collection gosub10

Unarchiving Partial Collections

unarchive_collection has an option to limit the number of items per collection (sorted newest first).

To download the most recent release from the Bump Foot netlabel (total download size 137MB, for bump221, as of 2022-06-27).

unarchive_collection -n 1 bumpfoot

HISTORY

2007 – original prototype

2008 – verify download checksums

2009 – whole collection unarchiver

2010 – command line arguments, release v0.3

2011, 2014, 2016 – various bugfixes

2022 – rewritten with many improvements, release 1.0

COPYRIGHT

License AGPL-3.0-only: GNU Affero GPL version 3 https://www.gnu.org/licenses/agpl-3.0.html.

This is free software: you are free to change and redistribute it. There is NO WARRANTY.

The unarchive home page is at https://mathr.co.uk/unarchive.