Automount for Hierarchical Storage Management

I heard of storage hierarchies first, when learning the ideas of the Tivoli Space Manager for AIX (formerly known as HSM - hierarchical storage manager). There is similar software like SAMfs for Solaris or OTG Diskextender for Microsoft's OSes, or even more.

Hierarchical Storage points to the fact, that there is a hierarchie of availability of data to a system. On the high end of the hierarchy is data in the cpu, on the low end there is data on removable media.

This article suggests a way to deal with removable media, while keeping them known to the system. It is not my objective to imitate one of the trademarked products mentioned above (they concentrate on moving files to tape). However there might be some similarities: HSM leaves a stub in the filesystem that indicates where to find the original file that was there, before it was moved to tape or another lower hierarchy.

Who would need that?

Short answer: me - for my video and audio recordings.
Long answer: Everybody who systematically archives data on external media.

The data most frequently stored on external media - besides backups - is probably such as TV recordings, radio plays, MP3 music, etc to CDROM or DVD. You do not want to put everything on harddisk, for say 1Euro/GB but you would prefer DVD for 0.1Euro/GB that you can take to friends or the DVD-Player in the living room.

Another reason for wanting this is file serving. I have five harddisks and three DVD drives in my system (this concept does not need to stop at system boundary, but that assumption makes the explanation simpler). Everytime I am asked for putting a DVD into a drive, I try the wrong drive first. I would like to access the media without caring about the drive. I want to replace media according to the media contents not according to the drive properties.

Too theoretically? Here is an example: Why do I prefer SuSE Linux from DVD instead of CDROM? I start the update and go away for an hour. I do not like to be asked to exchange the CDROM every 10 minutes. Even if I have 6 drives for all CDROMs, I do have to exchange them in the first named drive (or I always have to replace the mount path). Here I describe a way how to mount the correct volume no matter which drive is being used.

What are the restrictions?

Currently there is a first running implementation of this. I will try to improve the usability of the program interfaces which is no focal point in the current implementation efforts.

I also restricted myself to iso filesystems. The main reason was that I found a very quick way to generate hash numbers on the contents without mounting the volumes. The details see below.

In principle there is no need to consider a single system only. But there are other things to be done first. The current impetus is to mount available volumes quickly and to ask for those that are not accessible.

There are circumstances (e.g. 'ls' in color mode) where you do not want to be asked for media changes, but you prefer to ignore them silently.

How does it work?

It is best described when using an example. I start with a DVD that I burned myself and put it to a drive. I run (yes currently it requires manual registration) the isoregister program that generates the required links. I look to my link directory

# ls /video
.  ..  Godfell1.avi  Godfell2.avi
# ls -l /video
total 1024
drwxrwxrwt    9 aneuper  users        3256 2004-08-14 15:28 .
drwxr-xr-x    3 root     root          992 2004-08-13 20:24 ..
lrwxrwxrwx    1 aneuper  users          58 2004-03-20 18:20 Godfell1.avi
	 -> /hash/ae31386abe053e305ceb2b932e7bc005c6d89b70/Godfell1.avi
lrwxrwxrwx    1 aneuper  users          58 2004-03-20 18:20 Godfell2.avi
	 -> /hash/ae31386abe053e305ceb2b932e7bc005c6d89b70/Godfell2.avi
# mplayer /video/Godfell*

Whenever I now ask for /video/Godfell1.avi I try to access the media with the SHA1 hashnumber ae31386abe053e305ceb2b932e7bc005c6d89b70. If the media is in the drive I get its contents within a few seconds, otherwise I am asked for inserting it.

Please note, that this hashnumber is NOT for the complete image, but the content table only. (Therefore it does not ensure the integrity of the content.)

How is it implemented?

I do not want to talk about religion here, but I do believe into the KISS principle: Keep It Short and Simple Therefore I restricted myself to four parts, that the implementation builds on:

  1. shrinkfs, a wrapper for mkisofs/growisofs that helps to burn files on media and to replacethem with the apropriate links afterwards.
  2. /etc/auto.hash, the automount config file (which is the most vital part)
  3. register, a script to create links (which is a useful helper)
  4. reference information files to identify the media (which is for convenience)
The common configuration data is (for SuSE installations) moved to a config file called /etc/sysconfig/automount.

I suggest to keep a conventional symbolic link containing a hash number. The hash number is a reference for the media and should be unique. Each file on the media is available to the system by accessing:

/mountdir/hashnumber/mediapath

The mountdir is arbitrary in principle. But once selected and having generated links to it, you may hardly like to change it easily. The hashnumber is generated using the programs isoinfo and sha1sum. Maybe we depend on certain releases of isoinfo, since changes in the layout of the '-l' reply influences the checksum. The mediapath is identical to that from of the file on the filesystem when standard mount options apply.

The Automounter Config File

The most vital part is mounting the media automatically, if it is presented to the system.

For my current implementation I use the following line in /etc/auto.master:

/hash	program:/etc/auto.hash
Needless to say there should be an executable and correctly working /etc/auto.hash. I am currently testing this one:
#!/bin/sh
#
# This is currently for testing purposes only
#
KEY="$1"
LISTDIR=/hashlist
DEVICELIST="/images/*.iso /dev/scd0 /dev/hdc /dev/hdh"
# I ranked my drives by performance here,
# which is equivlent to the times I use it.
# The first hit wil be returned (should we return all?)
#
if [ ! -d $LISTDIR ]
then
	mkdir $LISTDIR
fi
#
if [ -x /usr/bin/sha1sum ]
then
    HASHSUM=/usr/bin/sha1sum
else
    if [ -x /usr/bin/sha1sum ]
    then
        HASHSUM=/usr/bin/md5sum
    else
        exit 1
    fi
fi
if [ -x /usr/bin/isoinfo ]
then
    ISOINFO=/usr/bin/isoinfo
else
    exit 1
fi
#
#
for DEVICE in $DEVICELIST
do
        HASHID=$($ISOINFO -l -i $DEVICE | $HASHSUM | /usr/bin/cut -d' ' -f1)
	if [ ! -f "$LISTDIR/$HASHID" ]
	then
		$ISOINFO -l -i "$DEVICE" >"$LISTDIR/$HASHID"
	fi
	if [ "$HASHID" = "$KEY" ]
	then
		case $(dirname $DEVICE) in
		/dev*)  echo -e "-fstype=iso9660,ro\t:$DEVICE"
			;;
		*)      echo -e "-fstype=iso9660,ro,loop\t:$DEVICE"
			;;
		esac
		exit 0
	fi
done
#/usr/X11R6/bin/xprompt -t $LISTDIR/$HASHID
exit 1

The Link Registration Script

The register script helps you to collect the information in a big link base. This is a kind of database. It really depends on the media contents that you want to register, how you want it to behave:

Feel free to suggest new options (as long as you explain it).

The Reference Directory

The reference directory is used to store information that helps you to find thee requested CDROM. With this distribuition comes a default collection of the CDROM/DVD header information (containing Volume ID, etc) and a listing of directory structure (done by isoinfo).

This information pops up, when you are asked to insert the media. The presence of this information is not vital, if you have other means to identify the media (I do recognise the media by the requested files - no not the hashnumber).

How to do it better?

Whenever you do something, that seem not to be there before, you think it could have been done better afterwards. However, I spent less time on putting this together than looking for an existing and working solution on the internet.

The point I am not happy about is the hash number. It is too proprietary from my point of view, since it allows only iso formats in the current version. I do not know a general solution yet. Do not forget, the identifier must be quick to obtain from the media and it must contain a component that identifies the media reliably.

Initially I wanted a human readable identifier. And I found that the first three CDs I took had a nearly identical ISO-Header. Therefore I was thinking of something unique and I immediately thought of checksums.

You might immediately think how long it would take to calculate a checksum of a CDROM. You are right, time is relative, but nobody requires the whole CDROM to be investigated. I suggest to do a checksum on the directory structure. Using isoinfo allows to read the directory structure without mounting the drive. This is fast.

But there is a weak point: Minor changes in the layout of isoinfo and all checksums are wrong. Please suggest a better solution, if you know one. Until then, I suppose it is best to adjust isoinfo with a standard volume.

The way the hashnumbers are currently implemented in the interface, there are only minor changes necessary to use it with automated libraries, either disk or tape. You can easily replace the hashnumber with the cartridge label or the the slot number (if you lack a barcode reader).

I suggested to access the external media by an symbolic link. The media does not need be present to read the contents. Further it does not take much space and can carry a lot of information in its filename. I think there is little improvement possible here.


©2004 Andreas Neuper