Sunday, April 27, 2008

MD5SUM reorganization revisited

I was running my shell script on a collection of over 10,000 files, and it was running for over three days. The shell interpreter is just way too slow when this amount of work is involved. So, I re-wrote the script in Perl and it is now lightning fast because I am able to take advantage of Perl hashes (associative arrays).



#!/usr/bin/perl

#
# This file takes a checksum file (output from md5sum utility)
# and attempts to reorganize the files in the directory to
# match the listing in the md5 file.
#
# Files not found in the md5 input are left alone.
# Files already in the right place are left alone.
# Other files have their checksums computed and, if they are found
# in the md5 input, they are moved to the appropriate location.
#
# WARNING: It confuses duplicate files!!!
#
use File::Basename;

if ( $#ARGV < reorgpath =" $ARGV[1];" reorgptah = "." lines =" ;

close(SUMSFILE);

foreach my $line (@lines) {
chomp($line);
$sums{substr($line,0,32)} = substr($line,34);
}

print "Read in ".($#lines+1)." checksums and paths.\n";

&reorg($reorgPath);

sub reorg {
my $dir = shift;
#print "Recurring for $dir\n";

opendir DIR, $dir or return;
my @contents =
map "$dir/$_",
sort grep !/^\.\.?$/,
readdir DIR;
closedir DIR;
foreach my $file (@contents) {
#print "Considering $file\n";
#next unless !-1 && -d;
if ( -d $file ) {
&reorg($file);
} else {
# my @args = ("md5sum", $file);
# system(@args) == 0 or die("System @args failed: $?\n");
$tmpLine = `md5sum "$file"`;
chomp($tmpLine);
$tmpHash = substr($tmpLine,0,32);
$tmpPath = substr($tmpLine,34);

#print "$tmpHash -> $tmpPath\n";
# Now lookup the hash, if the paths aren't the same, move the file
if (defined $sums{$tmpHash}) {
#print $sums{$tmpHash};
if ($tmpPath ne $sums{$tmpHash}) {
print "Moving ".$tmpPath." to ".$sums{$tmpHash}."\n";

# Make directory for move if it doesn't exist
@args = ("mkdir", "-p", dirname($sums{$tmpHash}) );
system(@args) == 0 or print STDERR "Couldn't create directory @args: $!\n";
# Move file to path found in checksum file
@args = ("mv", $tmpPath, $sums{$tmpHash});
system(@args) == 0 or print STDERR "Couldn't move: $!\n";
} else {
print "File $tmpPath is already in place.\n";
}
} else {
print "No hash for $tmpPath found.\n";
}
}
}
return;
}

exit;


A few things still need to be fixed: Inserting a new hash from the file into the hash table should fail if it is already there (hash collision or duplicate files). Also, a reverse lookup in the hash table could save needing to compute the MD5SUM, but at what computational cost?