Sample Exercises
Regular Expressions: Analyze Mail Folder
Write a program to analyze the contents of an e-mail folder. Most mail programs, at least on Unix, store e-mail in a text file as follows:
Each message starts with a line called the "envelope" which consists of the word "From", then a space, then the e-mail address of the sender and the date it was sent. Note: this line does not contain the colon (:) character that the mail header lines have!
After this line comes the header of the message. Each line of the header consists of a header name, followed by colon (:) and (optional) space, then the value for that line. The header includes such things as "To:", "Cc:", "From:", "Subject:", and "Date:". Note that the "From:" line here is distinct from the "From " line which is the envelope. Some header lines may occur more than once per message and/or span multiple lines (e.g., "Received:"). However for this exercise the headers you need to worry about ("From:" and "Subject:") do not.
Next comes a blank line, to indicate the end of the header, and then the text of the message, until the next message's envelope line comes along. (Note: to make it easier to find the envelope line, the text should not contain any lines starting with "From"; most mail software automatically modifies these by adding a ">" character so as to avoid confusion with the next message's envelope line. So if it starts with "From " you can safely assume it's the envelope of the next message.)
After another blank line, the next message starts, with envelope, header, and body as above. This repeats until the end of the file is reached.
Your program is to be run with the name of the file on the command line, e.g., "perl mailrpt.plx mailboxname". Use the <> operator to read the file (Note: if no mailbox filename is given, this will read from STDIN instead). You can download the sample file from this site (see the Auxiliary Files area below) or use any standard-format mail folder for testing.
Your program will read the file from start to finish (do not store the entire file in memory at once, since it could be very large; just look at one line at a time), and look at each message in turn. Only look at the envelope and the "From:" and "Subject:" header lines (you may assume there is only one of each of these per message). Count the following statistics and issue a report after the entire file has been read:
- Number of messages in the file (i.e., number of envelope lines seen).
- Number of lines in the file.
- Average message length in lines (includes envelope, header, and body). (Hint: This is simply a function of the previous two numbers)
- Number of messages from each unique e-mail address (from the "From:" line)
- Number of messages with each unique subject (from the "Subject:" line)
For example, the output might look something like this:
Found 5 messages on 255 lines. Average message length is 51 lines. Summary of messages by e-mail address: wrw@bayview.com: 4 messages president@whitehouse.gov: 1 messages Summary of messages by subject: Ideas for new budget: 3 messages Project: 2 messages
When looking at the "From:" line, consider only the e-mail address and ignore any "real name" portion. An e-mail address is of the format "user@domain", where "user" and "domain" may contain letters, numbers, underscores, hyphens (-), or period (.) characters. In other words, the following "From:" lines should be considered the same:
From: wrw@bayview.com From: William R Ward <wrw@bayview.com> From: wrw@bayview.com (William R. Ward)
When people forward or reply to e-mail messages their mail software often modifies the "Subject:" line. When your program analyzes these messages, it should ignore those modifications. Specifically, remove "(fwd)" anywhere on the line, or "Re:" at the beginning, as well as any whitespace at the beginning or end of the line, before counting it. Do this in a case-insensitive manner. Your program should count all of these "Subject:" lines as equivalent:
Subject: Perl training class Subject: Perl training class Subject: Perl training class (fwd) Subject: (FWD)Perl training class Subject: Re: Perl training class (fwd) Subject: RE:Perl training class
Auxiliary File for Analyze Mail Folder
Solution for Analyze Mail Folder
SolutionLast updated:
Copyright © 1995-2008 William R. Ward dba Bay View Training. All Rights Reserved. “Bay View Training”, “Bay View Consulting Services”, “Bay View Software”, the sailboat logo, and the domain name “bayview.com” are trademarks and/or service marks of William R. Ward dba Bay View Training. For more information, contact training@bayview.com
