mathr / blog / #

Semi-automatic downloading from the Internet Archive

I've uploaded a lot of stuff to the Internet Archive (over 50 items at last count). Most of the audio I've uploaded is in FLAC format, and is currently sitting on a computer that no longer works (I turned it off, went away, came back, and it wouldn't turn on again). So while I figure out what to do (probably involving getting an external hard drive enclosure to rescue the data), I wanted to download the VBR MP3 versions so I can listen to them again. I didn't fancy manually downloading the files one by one with a web browser, so I ventured into the icky world of XML and XSLT, using bash to glue it all together.

Here's the script that glues it all together. Note that the Internet Archive kindly makes meta-data files available with predictable names, so it's easy to grab given an Internet Archive identifier. expects a list of such identifiers, one per line, on its standard input:

while read ident
  mkdir -p "${ident}" &&
  wget -nv -O - "${ident}/${ident}_files.xml" |
  xsltproc unarchive.xsl - |
  while read file
    wget -nv -c -P "${ident}" "${ident}/${file}"

xsltproc is a tool that munges an XML file according to a stylesheet written in the XSLT language. unarchive.xsl generates a list of the filenames of all the files that are of format VBR MP3:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="">
<xsl:output method="text" />
<xsl:template match="/">
 <xsl:for-each select="files/file">
    <xsl:when test="format='VBR MP3'">
     <xsl:value-of select="@name" /><xsl:text>

If I'm bored one day I'll adapt it to download only "original" files (ignoring all the "derivative" files).

UPDATE the adaptation is just changing "format='VBR MP3'" to "@source='original'"