The web comic xkcd once again points out the importance of Perl:
Perl in xkcd again
December 19th, 2008A nice article from a recent student
October 30th, 2008At the end of each class, we always ask our students to fill out an evaluation form. There are two reasons for this: to find areas where we need to improve, and with the hope that they’ll put down some kind words that we can quote on the site in the testimonials page. A recent student, David “Zonker” Harris, took it one step farther and wrote a glowing blog entry about his experiences. Thanks Zonker!
Searching files with multi-line entries
October 20th, 2008Say that you have a file that looks something like this:
2008-01-02: first entry
2008-02-03: second entry on two lines
here is the additional line
2008-03-04: third entry
has
three
extra lines
2008-04-05: fourth entry has just one on line again
If you need to search for all entries that have “line” in the text, and display the entire entry when found, you can’t just search line-by-line — that would work for the first and fourth entries, but the second entry would miss the additional line, and in the third entry the word “line” is on the fourth line so you’d miss the first three.
What you need to do in a case like this is read line-by-line, but only process an entry once you’ve found the end of the entry. There are two ways to solve this, depending on your data and what your needs are:
- If the file is not very large (and never will be), and you need to do the search multiple times, then you could load the entire file into memory as an array of entries, and then search that array using grep or foreach.
- If the file is very large, or you only need to scan through it once to find one result, then just load each entry into a string, and display that string if it matches.
First I’ll show how to load the entire file since I think it’s easier to understand:
my @stuff;
while (<IN>) {
if (/^\s/) { $stuff[-1] .= $_; }
else { push @stuff, $_; }
}
print grep { /line/ } @stuff;
If the line begins with space, then it’s a continuation line, so modify the previous entry found (the last item of the array, using index -1) to add the text to it. If the line doesn’t begin with space, it’s a new entry so push it onto the end of the array. Once the entire file is read, each element in @stuff would correspond to one record, including the multiple extra lines, so it’s easy to scan using grep to find what you need.
The second approach involves using a scalar, rather than an array, to build up each record. When the next new record starts, or end of file is reached, we check to see if the record we’ve just read matches the pattern:
my $last_entry;
while (<IN>) {
if (/^\s/) {
$last_entry .= $_;
}
else {
print $last_entry if $last_entry =~ /line/;
$last_entry = $_;
}
print $last_entry if $last_entry =~ /line/ && eof(IN);
}
Sorting in Reverse Order
August 8th, 2008Say you have an array of names: @names=qw(Tom Dick Harry);
If you wanted to sort these, you could just use a simple sort() command: @sorted=sort(@names); That uses alphabetical order for sorting by default. The sort criteria is not given, but you could get the same results by giving a longer version of the sort function call, like so: @sorted=sort {$a cmp $b} @names;
Here, $a and $b are special variables which are used to compare two of the values in @names to see which should come first in the sort order. The cmp operator returns a positive value if $a > $b, a negative value if $a < $b, or zero if they are equal. By changing the formula that follows the sort keyword you can change the order of the sort function.
If you wanted to sort in reverse order, you could just use Perl's reverse() function: @revsort=reverse(@sorted); or in one statement, @revsort=reverse sort @names;. This is inefficient however as it must make a temporary copy of the list of names, which could get expensive if the array is large. A more efficient way is to just change the sort criteria to produce the reverse result. @revsort=sort { $b cmp $a } @names;. Now, the values returned by cmp are the opposite of what they were above, and so the sort order is the opposite.
Finding the Biggest File
July 29th, 2008How do I find the biggest files under a directory? There are many ways to do this, but it isn’t always as easy as it sounds.
First of all if the directory has no sub-directories, it’s easy. Just list the files sorted by size, which any operating system can do. But if there are sub-directories, or if you’re talking about the entire filesystem, it’s not so easy. Here’s a way to do it using Perl:
The Unix command “find” can recursively scan all the directories under a given point, and perform some action on each file. It has a lot of options, and if combined with commands such as “ls” and “sort” it can be done, but it is not trivial to get it right. It would be nice if we could do this within a program so we could have the full power of Perl to work with as we scan these files. For this reason, the “File::Find” module was created. It ships with Perl so you already have it on your system. And to make it easier to convert a Unix “find” command to a “File::Find” program, the script “find2perl” is included with Perl as well.
To get started, use the “find2perl” command to create the Perl script that scans the files:
% find2perl . -type f -print > find_biggest.pl
The file “find_biggest.pl” is created with a Perl script that just displays the files found. If we’re going to find the biggest files, we need to store the file sizes in some kind of data structure so that we can sort them by size. What I suggest is putting them into a hash with the filenames as keys and sizes as values. The “find_biggest.pl” script looks something like this so far:
#! /usr/bin/perl -w
eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
if 0; #$running_under_some_shelluse strict;
use File::Find ();# Set the variable $File::Find::dont_use_nlink if you're using AFS,
# since AFS cheats.# for the convenience of &wanted calls, including -eval statements:
use vars qw/*name *dir *prune/;
*name = *File::Find::name;
*dir = *File::Find::dir;
*prune = *File::Find::prune;sub wanted;
# Traverse desired filesystems
File::Find::find({wanted => \&wanted}, '.');
exit;sub wanted {
my ($dev,$ino,$mode,$nlink,$uid,$gid);(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
-f _ &&
print("$name\n");
}
The important parts to look at are the call to File::Find::find() and the subroutine wanted. File::Find will execute the subroutine once for each file that it finds. So what we need to do is modify the subroutine to record the file names, and then after File::Find::find exits, sort the files.
First, we change wanted to store the filenames in a hash rather than print them. Change it to this:
sub wanted {
my ($dev,$ino,$mode,$nlink,$uid,$gid);(($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
-f _ &&
($size{$name} = -s _);
}
Since we’re introducing a new variable %size we need to declare it: add “my %size;” just before calling File::Find::find. Then we need to sort the keys of that hash according to the values. So, after calling File::Find::find but before exit, we want to do the following:
my @files = sort { $size{$a} < => $size{$b} } keys %size;
print "Biggest file: $files[-1] (Size: $size{$files[-1]} bytes)\n";
You can download the final script here.
A Better Way to Slurp
July 22nd, 2008In an earlier entry (was it really six years ago?) I talked about the usage of $/ and the -0 command-line option to Perl to change the input delimiter. But there’s another way to read in “slurp” mode that isn’t described there, the File::Slurp Perl module.
File::Slurp provides a function read_file, which given a filename, returns its contents as a single string if called in scalar context (in array context, returns an array of lines, as defined by whatever delimiter $/ is set to). It’s basically the same thing as setting $/ to the empty string and reading, but contained in a subroutine.
There is also a subroutine write_file, which lets you “spew” the contents of a string into a file. It saves you a few lines of code: open, print, and close. It can also be called as overwrite_file as a synonym, or you can call append_file to add to rather than overwrite a file.
Finally, read_dir lets you get the contents of a directory in one go, which is a lot more convenient than using opendir/readdir.
Platform-Specific Perl
October 2nd, 2006As an interpreted language, Perl scripts can generally be run unmodified on any platform. But there are situations where the differences between platforms make it necessary to test what platform you are running on and act accordingly. Read the rest of this entry »
Ignorance is Bliss – non-memorizing parentheses
April 20th, 2006One of regular expressions’ most useful features is memorization. To do this, just put parentheses around part of your expression and the result will be memorized:
my($name) = /hello, (\w+)/
In this example, we look in $_ for the word “hello” followed by a comma, space, and a word. Since the word, \w+, has parentheses around it, the part of the string that it matches gets memorized. In this example, we are assigning the return value of the regular expression match to $name. So if $_ contains “hello, world” then $name gets “world” – very convenient.
But parentheses also do other things besides memorize their contents, and this feature can become annoying. Here’s an example. Read the rest of this entry »
Dates in Perl: Hawaiian Vacation Planning
January 5th, 2006Since we’re starting a new year, let’s look at handling dates in Perl. Let’s say the user enters a date and you want to check if it’s between a particular range of start/end dates.
In particular, let’s say you want to go to Hawaii and your kids are in school for the spring semester from January 9 through June 2. Your travel agent gives you a list of possible dates when you can go to Hawaii really cheaply, and you want to know which ones conflict with your kids’ school schedule so you can include the budget for a babysitter in the cost of the trip.
Read the rest of this entry »
Finding the Largest File in a Directory
February 22nd, 2005Here’s an easy way to find the largest file in a directory.
