Finding the Biggest File

in Files & Directories, Perl Tips
by William Ward on July 29, 2008 4:28 pm

How do I find the biggest files under a directory? There are many ways to do this, but it isn’t always as easy as it sounds.

First of all if the directory has no sub-directories, it’s easy. Just list the files sorted by size, which any operating system can do. But if there are sub-directories, or if you’re talking about the entire filesystem, it’s not so easy. Here’s a way to do it using Perl:

The Unix command “find” can recursively scan all the directories under a given point, and perform some action on each file. It has a lot of options, and if combined with commands such as “ls” and “sort” it can be done, but it is not trivial to get it right. It would be nice if we could do this within a program so we could have the full power of Perl to work with as we scan these files. For this reason, the “File::Find” module was created. It ships with Perl so you already have it on your system. And to make it easier to convert a Unix “find” command to a “File::Find” program, the script “find2perl” is included with Perl as well.

To get started, use the “find2perl” command to create the Perl script that scans the files:

% find2perl . -type f -print > find_biggest.pl

The file “find_biggest.pl” is created with a Perl script that just displays the files found. If we’re going to find the biggest files, we need to store the file sizes in some kind of data structure so that we can sort them by size. What I suggest is putting them into a hash with the filenames as keys and sizes as values. The “find_biggest.pl” script looks something like this so far:

#! /usr/bin/perl -w
eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
if 0; #$running_under_some_shell

use strict;
use File::Find ();

# Set the variable $File::Find::dont_use_nlink if you're using AFS,
# since AFS cheats.

# for the convenience of &wanted calls, including -eval statements:
use vars qw/*name *dir *prune/;
*name = *File::Find::name;
*dir = *File::Find::dir;
*prune = *File::Find::prune;

sub wanted;

# Traverse desired filesystems
File::Find::find({wanted => \&wanted}, '.');
exit;

sub wanted {
my ($dev,$ino,$mode,$nlink,$uid,$gid);

(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
-f _ &&
print("$name\n");
}

The important parts to look at are the call to File::Find::find() and the subroutine wanted. File::Find will execute the subroutine once for each file that it finds. So what we need to do is modify the subroutine to record the file names, and then after File::Find::find exits, sort the files.

First, we change wanted to store the filenames in a hash rather than print them. Change it to this:

sub wanted {
my ($dev,$ino,$mode,$nlink,$uid,$gid);

(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
-f _ &&
($size{$name} = -s _);
}

Since we’re introducing a new variable %size we need to declare it: add “my %size;” just before calling File::Find::find. Then we need to sort the keys of that hash according to the values. So, after calling File::Find::find but before exit, we want to do the following:

my @files = sort { $size{$a} < => $size{$b} } keys %size;
print “Biggest file: $files[-1] (Size: $size{$files[-1]} bytes)\n”;

You can download the final script here.

A Better Way to Slurp

in Files & Directories, Perl Tips
by William Ward on July 22, 2008 7:52 am

In an earlier entry (was it really six years ago?) I talked about the usage of $/ and the -0 command-line option to Perl to change the input delimiter. But there’s another way to read in “slurp” mode that isn’t described there, the File::Slurp Perl module.

File::Slurp provides a function read_file, which given a filename, returns its contents as a single string if called in scalar context (in array context, returns an array of lines, as defined by whatever delimiter $/ is set to). It’s basically the same thing as setting $/ to the empty string and reading, but contained in a subroutine.

There is also a subroutine write_file, which lets you “spew” the contents of a string into a file. It saves you a few lines of code: open, print, and close. It can also be called as overwrite_file as a synonym, or you can call append_file to add to rather than overwrite a file.

Finally, read_dir lets you get the contents of a directory in one go, which is a lot more convenient than using opendir/readdir.