Regular Expressions to Parse Data Files
Regular expressions are the best way to parse text in Perl. And when combined with the hash data structure, you can easily build an in-memory structure based on data read in from a file.
For example, say you have a file consisting of name/value pairs. Each line is of the form "Name: Value" where a colon separates the two. In addition, there might be extra spaces at the beginning or end of the line, or between the colon and the value (or no space at all). Finally, blank lines or lines beginning with a # sign (comments) are to be skipped. If the same name occurs twice, that is an error. We’ll load the data into a hash.
Here’s one way to do this in Perl:
my %data;
while(<>) {
next if /^\s*($|#)/;
my($name, $value) = /^\s*(.+?):\s*(.+?)\s*$/
or die "Syntax error: $_";
die "Duplicate entry for $name found: $_"
if exists $data{$name};
$data{$name} = $value;
}
Now, let’s take this line by line:
my %data;
Initialize the data structure that we will populate.
while(<>) {
Read each line of the input (filename from command line aka @ARGV, or STDIN if no command line arguments).
next if /^\s*($|#)/;
This regular expression means:
- Beginning of line
- Zero or more spaces
- Either end of line, or a "#" character
If this pattern matches, we skip the rest of the loop and read the next line ("next" in Perl is like "continue" in C, Java, and related languages). Note: While the end of line or # character is memorized, we don’t do anything with the value. We’re only using the parentheses for giving a list of alternate patterns to search for, and don’t care about what text it matched.
my($name, $value) = /^\s*(.+?):\s*(.+?)\s*$/
or die "Syntax error: $_";
This is the important part. The regular expression means:
- Beginning of line
- Zero or more spaces
- Memorize one or more characters up to a colon
- A colon
- Zero or more spaces
- Memorize one or more characters up to a space (if any) at the end of the line
- Zero or more spaces
- End of the line
The /.+?/ pattern is "non-greedy." That means that it is the same as /.+/ (one or more of any character except \n), except that it will stop as soon as it finds the thing that follows it. For example, the first one is followed by a colon, so it will stop matching when it finds a colon. Otherwise, you’d have to do something awkward like /[^:]+/ which means "one or more characters that are not a colon."
The two memorized parts are returned as a list, since it is being called in a list context. They are assigned to the variables $name and $value. You could also access them using the variables $1 and $2.
The "or die" part will be triggered only if the pattern fails to match. It will display the line ($_) that it was looking at. Since $_ includes a newline character, we don’t need "\n" in the message.
die "Duplicate entry for $name found: $_"
if exists $data{$name};
Here, we check if we have already seen that name. If so, exit the program and print a useful error message.
$data{$name} = $value;
Now we store the information in the hash. Note that if we were to leave out the second "die" statement, and a duplicate entry were found, Perl would silently replace the earlier value with the new one. This is probably not a good idea, so we have the "die" statement. A more advanced solution might be to have the value of the hash be a reference to a list containing the actual values. This might be done with "push(@{$data{$name}}, $value)" instead of the above line. But that’s a topic for another day.
}
End of the loop. Next you would perform some kind of operation on the hash %data to make use of the information from the file.

I have a text file with records in the form
xxxxxxxxxxxxxxxxx
and I would like to insert the end for each one. The current approach I am attempting is along these lines;
while ( )
{
if ($_ =~ s/(.+)()/g )
{
print (OUT “$2$3″) ;
}
}
print “All done\n”;
As this doesn’t seem to work can you suggest an alternative?
Thanks
Comment by Benjamin — April 12, 2007 @ 1:17 pm
Ah this form does not seem to like angle brackets. How can I post text containing them?
Comment by Benjamin — April 12, 2007 @ 1:19 pm
You can replace the < character with < and > with > to display angle brackets. Otherwise your angle brackets are being interpreted as an attempt at HTML. Also try using <pre> and </pre> tags around your quoted code so the indentation is preserved.
Comment by William Ward — April 17, 2007 @ 1:03 pm
Thank you, but I managed to get that bit working with a bit of tinkering.
I have another question though.
I have a text file of about 100,000 lines that I am trying to run a series of regexs over and I am using Tie::File to feed everything into an array first. It works fine if I only give it 10,000 lines or so, but anything over that it seems to insert a couple of hundred extra lines somewhere near the end and messes up the data. Here is a sample of the code and data:-
#########
AUTHOR
Aagaard, L. and Schmid, M. and Warburton, P. and Jenuwein, T.
TITLE
Mitotic phosphorylation of SUV39H1, a novel component of active centromeres, coincides with transient accumulation at mammalian centromeres
JOURNAL
J Cell Sci
VOLUME
113
ISSUE
Pt 5
##########
tie @array, ‘Tie::File’, \*FILE, recsep => ‘\n’;
for (@array)
{
s/\nAUTHOR\n/&\n<AUTHOR>\n/g; #AUTHOR
print “AUTHOR done…\n”;
s/\nTITLE\n/\n<TITLE>\n/g; #TITLE
print “TITLE done…\n”;
s/\nJOURNAL\n/\n<JOURNAL>\n/g; #JOURNAL
print “JOURNAL done…\n”;
}
untie @array;
There are about 2500 records in the above format in one text file and I want to turn each of the main headings into XML tags. Not evey record uses all of the headings which is why i’m trying to combine this meathod to open the tags and the method in my first post to close them (which works if I manually insert open tags with Textpad). With each ” + data” on the same line, here is my method from my first post:-
($line =~ s/<([A-Z_]+)>([^<]+)/<$1>$2<\/$1>/g )
Comment by Benjamin — April 19, 2007 @ 7:27 am
**the last line should read “woth each HEADING+data on the same line, here is my $lt;/CLOSE> method from my first post.
Comment by Benjamin — April 19, 2007 @ 7:29 am
I’ve never used Tie::File - it sounds interesting.
Why not just use a traditional approach of opening the file, iterating over each line, and doing your substitutions on each one?
Or you could use the input record separator $/ to have it load one whole record at a time instead of one line at a time…
Comment by William Ward — April 19, 2007 @ 11:02 am
How do you run multiple (or a series of) regexs over a text file when using a
while ( $line = <$IN>)
{
$line =~ s///g;
print (OUT $line);
}
Or is there another better method?
Comment by Benjamin — April 21, 2007 @ 11:24 am
Yeah that’s probably the best way to do it. I’d let Perl use $_ as the default variable for something like that though…
while(<IN>)
{
s/old/new/g;
print OUT;
}
Comment by William Ward — April 21, 2007 @ 11:47 pm
Thanks, that works much better.
I’ve hit another snag though. After running a number of successful regexs on the file, I am now trying to replace multiple newlines (\n\n\n) with </RECORD>:\n but I can’t seem to change more than one \n at a time. I have tried s/\n\n\n//g s/\n{3,}//g and others etc but have had no luck.
Comment by Benjamin — April 22, 2007 @ 9:09 am
That’s trickier. The way you’re reading line-by-line won’t let you do that. You might want to try experimenting with $/ (see the perlvar docs) to read in chunks other than line-by-line. Paragraph mode ($/=”") might help you, for example.
Comment by William Ward — April 22, 2007 @ 11:42 am