<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Regular Expressions to Parse Data Files</title>
	<atom:link href="http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/</link>
	<description>Bay View Consulting Services, Inc.</description>
	<pubDate>Thu, 20 Nov 2008 15:35:36 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: William Ward</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6775</link>
		<dc:creator>William Ward</dc:creator>
		<pubDate>Sun, 22 Apr 2007 19:42:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6775</guid>
		<description>That's trickier.  The way you're reading line-by-line won't let you do that.  You might want to try experimenting with $/ (see the perlvar docs) to read in chunks other than line-by-line.  Paragraph mode ($/="") might help you, for example.</description>
		<content:encoded><![CDATA[<p>That&#8217;s trickier.  The way you&#8217;re reading line-by-line won&#8217;t let you do that.  You might want to try experimenting with $/ (see the perlvar docs) to read in chunks other than line-by-line.  Paragraph mode ($/=&#8221;") might help you, for example.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Benjamin</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6770</link>
		<dc:creator>Benjamin</dc:creator>
		<pubDate>Sun, 22 Apr 2007 17:09:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6770</guid>
		<description>Thanks, that works much better. 

I've hit another snag though. After running a number of successful regexs on the file, I am now trying to replace multiple newlines (\n\n\n) with &#60;/RECORD&#38;gt:\n but I can't seem to change more than one \n at a time. I have tried s/\n\n\n//g s/\n{3,}//g and others etc but have had no luck.</description>
		<content:encoded><![CDATA[<p>Thanks, that works much better. </p>
<p>I&#8217;ve hit another snag though. After running a number of successful regexs on the file, I am now trying to replace multiple newlines (\n\n\n) with &lt;/RECORD&amp;gt:\n but I can&#8217;t seem to change more than one \n at a time. I have tried s/\n\n\n//g s/\n{3,}//g and others etc but have had no luck.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: William Ward</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6744</link>
		<dc:creator>William Ward</dc:creator>
		<pubDate>Sun, 22 Apr 2007 07:47:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6744</guid>
		<description>Yeah that's probably the best way to do it.  I'd let Perl use $_ as the default variable for something like that though...

while(&#60;IN&#62;)
{
s/old/new/g;
print OUT;
}</description>
		<content:encoded><![CDATA[<p>Yeah that&#8217;s probably the best way to do it.  I&#8217;d let Perl use $_ as the default variable for something like that though&#8230;</p>
<p>while(&lt;IN&gt;)<br />
{<br />
s/old/new/g;<br />
print OUT;<br />
}</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Benjamin</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6722</link>
		<dc:creator>Benjamin</dc:creator>
		<pubDate>Sat, 21 Apr 2007 19:24:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6722</guid>
		<description>How do you run multiple (or a series of) regexs over a text file when using a 

while ( $line = &#60;$IN&#62;)
{
$line =~ s///g;
print (OUT $line);
}

Or is there another better method?</description>
		<content:encoded><![CDATA[<p>How do you run multiple (or a series of) regexs over a text file when using a </p>
<p>while ( $line = &lt;$IN&gt;)<br />
{<br />
$line =~ s///g;<br />
print (OUT $line);<br />
}</p>
<p>Or is there another better method?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: William Ward</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6638</link>
		<dc:creator>William Ward</dc:creator>
		<pubDate>Thu, 19 Apr 2007 19:02:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6638</guid>
		<description>I've never used Tie::File - it sounds interesting.
Why not just use a traditional approach of opening the file, iterating over each line, and doing your substitutions on each one?
Or you could use the input record separator $/ to have it load one whole record at a time instead of one line at a time...</description>
		<content:encoded><![CDATA[<p>I&#8217;ve never used Tie::File - it sounds interesting.<br />
Why not just use a traditional approach of opening the file, iterating over each line, and doing your substitutions on each one?<br />
Or you could use the input record separator $/ to have it load one whole record at a time instead of one line at a time&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Benjamin</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6632</link>
		<dc:creator>Benjamin</dc:creator>
		<pubDate>Thu, 19 Apr 2007 15:29:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6632</guid>
		<description>**the last line should read "woth each HEADING+data on the same line, here is my $lt;/CLOSE&#62; method from my first post.</description>
		<content:encoded><![CDATA[<p>**the last line should read &#8220;woth each HEADING+data on the same line, here is my $lt;/CLOSE&gt; method from my first post.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Benjamin</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6631</link>
		<dc:creator>Benjamin</dc:creator>
		<pubDate>Thu, 19 Apr 2007 15:27:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6631</guid>
		<description>Thank you, but I managed to get that bit working with a bit of tinkering.

I have another question though.

I have a text file of about 100,000 lines that I am trying to run a series of regexs over and I am using Tie::File to feed everything into an array first. It works fine if I only give it 10,000 lines or so, but anything over that it seems to insert a couple of hundred extra lines somewhere near the end and messes up the data. Here is a sample of the code and data:-

#########
AUTHOR
Aagaard, L. and Schmid, M. and Warburton, P. and Jenuwein, T.
TITLE
Mitotic phosphorylation of SUV39H1, a novel component of active centromeres, coincides with transient accumulation at mammalian centromeres
JOURNAL
J Cell Sci
VOLUME
113
ISSUE
Pt 5
##########

        tie @array, 'Tie::File', \*FILE, recsep =&#62; '\n';
        for (@array)
                {

		s/\nAUTHOR\n/&#38;\n&#60;AUTHOR&#62;\n/g;     #AUTHOR
                print "AUTHOR done...\n";
		s/\nTITLE\n/\n&#60;TITLE&#62;\n/g;	#TITLE
		print "TITLE done...\n";
		s/\nJOURNAL\n/\n&#60;JOURNAL&#62;\n/g;	#JOURNAL
		print "JOURNAL done...\n";
		}
        untie @array;


There are about 2500 records in the above format in one text file and I want to turn each of the main headings into XML tags. Not evey record uses all of the headings which is why i'm trying to combine this meathod to open the tags and the method in my first post to close them (which works if I manually insert open tags with Textpad). With each " + data" on the same line, here is my  method from my first post:-

($line =~ s/&#60;([A-Z_]+)&#62;([^&#60;]+)/&#60;$1&#62;$2&#60;\/$1&#62;/g )</description>
		<content:encoded><![CDATA[<p>Thank you, but I managed to get that bit working with a bit of tinkering.</p>
<p>I have another question though.</p>
<p>I have a text file of about 100,000 lines that I am trying to run a series of regexs over and I am using Tie::File to feed everything into an array first. It works fine if I only give it 10,000 lines or so, but anything over that it seems to insert a couple of hundred extra lines somewhere near the end and messes up the data. Here is a sample of the code and data:-</p>
<p>#########<br />
AUTHOR<br />
Aagaard, L. and Schmid, M. and Warburton, P. and Jenuwein, T.<br />
TITLE<br />
Mitotic phosphorylation of SUV39H1, a novel component of active centromeres, coincides with transient accumulation at mammalian centromeres<br />
JOURNAL<br />
J Cell Sci<br />
VOLUME<br />
113<br />
ISSUE<br />
Pt 5<br />
##########</p>
<p>        tie @array, &#8216;Tie::File&#8217;, \*FILE, recsep =&gt; &#8216;\n&#8217;;<br />
        for (@array)<br />
                {</p>
<p>		s/\nAUTHOR\n/&amp;\n&lt;AUTHOR&gt;\n/g;     #AUTHOR<br />
                print &#8220;AUTHOR done&#8230;\n&#8221;;<br />
		s/\nTITLE\n/\n&lt;TITLE&gt;\n/g;	#TITLE<br />
		print &#8220;TITLE done&#8230;\n&#8221;;<br />
		s/\nJOURNAL\n/\n&lt;JOURNAL&gt;\n/g;	#JOURNAL<br />
		print &#8220;JOURNAL done&#8230;\n&#8221;;<br />
		}<br />
        untie @array;</p>
<p>There are about 2500 records in the above format in one text file and I want to turn each of the main headings into XML tags. Not evey record uses all of the headings which is why i&#8217;m trying to combine this meathod to open the tags and the method in my first post to close them (which works if I manually insert open tags with Textpad). With each &#8221; + data&#8221; on the same line, here is my  method from my first post:-</p>
<p>($line =~ s/&lt;([A-Z_]+)&gt;([^&lt;]+)/&lt;$1&gt;$2&lt;\/$1&gt;/g )</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: William Ward</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6572</link>
		<dc:creator>William Ward</dc:creator>
		<pubDate>Tue, 17 Apr 2007 21:03:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6572</guid>
		<description>You can replace the &#60; character with &#38;lt; and &#62; with &#38;gt; to display angle brackets.  Otherwise your angle brackets are being interpreted as an attempt at HTML.  Also try using &#60;pre&#62; and &#60;/pre&#62; tags around your quoted code so the indentation is preserved.</description>
		<content:encoded><![CDATA[<p>You can replace the &lt; character with &amp;lt; and &gt; with &amp;gt; to display angle brackets.  Otherwise your angle brackets are being interpreted as an attempt at HTML.  Also try using &lt;pre&gt; and &lt;/pre&gt; tags around your quoted code so the indentation is preserved.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Benjamin</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6423</link>
		<dc:creator>Benjamin</dc:creator>
		<pubDate>Thu, 12 Apr 2007 21:19:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6423</guid>
		<description>Ah this form does not seem to like angle brackets. How can I post text containing them? </description>
		<content:encoded><![CDATA[<p>Ah this form does not seem to like angle brackets. How can I post text containing them?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Benjamin</title>
		<link>http://www.bayview.com/blog/2004/03/25/regular-expressions-to-parse-data-files/#comment-6422</link>
		<dc:creator>Benjamin</dc:creator>
		<pubDate>Thu, 12 Apr 2007 21:17:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.bayview.com/blog/?p=15#comment-6422</guid>
		<description>I have a text file with records in the form

xxxxxxxxxxxxxxxxx

and I would like to insert the end  for each one. The current approach I am attempting is along these lines;

while  (  )
{
       if ($_ =~ s/(.+)()/g )    
       {
         print (OUT "$2$3") ;
       }
}
print "All done\n";

As this doesn't seem to work can you suggest an alternative?

Thanks</description>
		<content:encoded><![CDATA[<p>I have a text file with records in the form</p>
<p>xxxxxxxxxxxxxxxxx</p>
<p>and I would like to insert the end  for each one. The current approach I am attempting is along these lines;</p>
<p>while  (  )<br />
{<br />
       if ($_ =~ s/(.+)()/g )<br />
       {<br />
         print (OUT &#8220;$2$3&#8243;) ;<br />
       }<br />
}<br />
print &#8220;All done\n&#8221;;</p>
<p>As this doesn&#8217;t seem to work can you suggest an alternative?</p>
<p>Thanks</p>
]]></content:encoded>
	</item>
</channel>
</rss>
