Archive for the ‘Files & Directories’ Category

Searching files with multi-line entries

Monday, October 20th, 2008

Say that you have a file that looks something like this:

2008-01-02: first entry
2008-02-03: second entry on two lines
    here is the additional line
2008-03-04: third entry
   has
   three
   extra lines
2008-04-05: fourth entry has just one on line again

If you need to search for all entries that have “line” in the text, and display the entire entry when found, you can’t just search line-by-line — that would work for the first and fourth entries, but the second entry would miss the additional line, and in the third entry the word “line” is on the fourth line so you’d miss the first three.

What you need to do in a case like this is read line-by-line, but only process an entry once you’ve found the end of the entry. There are two ways to solve this, depending on your data and what your needs are:

  1. If the file is not very large (and never will be), and you need to do the search multiple times, then you could load the entire file into memory as an array of entries, and then search that array using grep or foreach.
  2. If the file is very large, or you only need to scan through it once to find one result, then just load each entry into a string, and display that string if it matches.

First I’ll show how to load the entire file since I think it’s easier to understand:

my @stuff;
while (<IN>) {
    if (/^\s/) { $stuff[-1] .= $_; }
    else { push @stuff, $_;  }
}
print grep { /line/ } @stuff;

If the line begins with space, then it’s a continuation line, so modify the previous entry found (the last item of the array, using index -1) to add the text to it. If the line doesn’t begin with space, it’s a new entry so push it onto the end of the array. Once the entire file is read, each element in @stuff would correspond to one record, including the multiple extra lines, so it’s easy to scan using grep to find what you need.

The second approach involves using a scalar, rather than an array, to build up each record. When the next new record starts, or end of file is reached, we check to see if the record we’ve just read matches the pattern:

my $last_entry;
while (<IN>) {
    if (/^\s/) {
        $last_entry .= $_;
    }
    else {
        print $last_entry if $last_entry =~ /line/;
        $last_entry = $_;
    }
    print $last_entry if $last_entry =~ /line/ && eof(IN);
}

Finding the Biggest File

Tuesday, July 29th, 2008

How do I find the biggest files under a directory? There are many ways to do this, but it isn’t always as easy as it sounds.

First of all if the directory has no sub-directories, it’s easy. Just list the files sorted by size, which any operating system can do. But if there are sub-directories, or if you’re talking about the entire filesystem, it’s not so easy. Here’s a way to do it using Perl:

The Unix command “find” can recursively scan all the directories under a given point, and perform some action on each file. It has a lot of options, and if combined with commands such as “ls” and “sort” it can be done, but it is not trivial to get it right. It would be nice if we could do this within a program so we could have the full power of Perl to work with as we scan these files. For this reason, the “File::Find” module was created. It ships with Perl so you already have it on your system. And to make it easier to convert a Unix “find” command to a “File::Find” program, the script “find2perl” is included with Perl as well.

To get started, use the “find2perl” command to create the Perl script that scans the files:

% find2perl . -type f -print > find_biggest.pl

The file “find_biggest.pl” is created with a Perl script that just displays the files found. If we’re going to find the biggest files, we need to store the file sizes in some kind of data structure so that we can sort them by size. What I suggest is putting them into a hash with the filenames as keys and sizes as values. The “find_biggest.pl” script looks something like this so far:

#! /usr/bin/perl -w
eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
if 0; #$running_under_some_shell

use strict;
use File::Find ();

# Set the variable $File::Find::dont_use_nlink if you're using AFS,
# since AFS cheats.

# for the convenience of &wanted calls, including -eval statements:
use vars qw/*name *dir *prune/;
*name = *File::Find::name;
*dir = *File::Find::dir;
*prune = *File::Find::prune;

sub wanted;

# Traverse desired filesystems
File::Find::find({wanted => \&wanted}, '.');
exit;

sub wanted {
my ($dev,$ino,$mode,$nlink,$uid,$gid);

(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
-f _ &&
print("$name\n");
}

The important parts to look at are the call to File::Find::find() and the subroutine wanted. File::Find will execute the subroutine once for each file that it finds. So what we need to do is modify the subroutine to record the file names, and then after File::Find::find exits, sort the files.

First, we change wanted to store the filenames in a hash rather than print them. Change it to this:

sub wanted {
my ($dev,$ino,$mode,$nlink,$uid,$gid);

(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
-f _ &&
($size{$name} = -s _);
}

Since we’re introducing a new variable %size we need to declare it: add “my %size;” just before calling File::Find::find. Then we need to sort the keys of that hash according to the values. So, after calling File::Find::find but before exit, we want to do the following:

my @files = sort { $size{$a} < => $size{$b} } keys %size;
print "Biggest file: $files[-1] (Size: $size{$files[-1]} bytes)\n";

You can download the final script here.

A Better Way to Slurp

Tuesday, July 22nd, 2008

In an earlier entry (was it really six years ago?) I talked about the usage of $/ and the -0 command-line option to Perl to change the input delimiter. But there’s another way to read in “slurp” mode that isn’t described there, the File::Slurp Perl module.

File::Slurp provides a function read_file, which given a filename, returns its contents as a single string if called in scalar context (in array context, returns an array of lines, as defined by whatever delimiter $/ is set to). It’s basically the same thing as setting $/ to the empty string and reading, but contained in a subroutine.

There is also a subroutine write_file, which lets you “spew” the contents of a string into a file. It saves you a few lines of code: open, print, and close. It can also be called as overwrite_file as a synonym, or you can call append_file to add to rather than overwrite a file.

Finally, read_dir lets you get the contents of a directory in one go, which is a lot more convenient than using opendir/readdir.

Platform-Specific Perl

Monday, October 2nd, 2006

As an interpreted language, Perl scripts can generally be run unmodified on any platform. But there are situations where the differences between platforms make it necessary to test what platform you are running on and act accordingly. (more…)

Finding the Largest File in a Directory

Tuesday, February 22nd, 2005

Here’s an easy way to find the largest file in a directory.

(more…)

Modifying a File Without Changing Its Timestamp

Monday, October 25th, 2004

Your file system keeps track of when each file was last modified. But have you ever wanted to edit a file without affecting its timestamp? Using the "utime" function, which is built in to Perl, you can! Here’s how:

(more…)

Writing Configuration Files in Perl

Thursday, July 17th, 2003

It is often useful to have a configuration file for a program, where you can specify certain variables that are used in the program. Examples of configuration parameters might include files, email addresses, usernames, or passwords the program uses, etc. If your Perl program needs to read a configuration file, there are lots of ways to do it.

(more…)

Input Delimiter

Monday, July 29th, 2002

Normally, reading from a file is done one line at a time. But sometimes that is not very convenient. What if you want to read in text one paragraph at a time? Or maybe your data is separated by TAB characters rather than newline characters?

(more…)