[CentOS] perl code to remove newlines

Thu Dec 30 16:11:03 UTC 2010
Bowie Bailey <Bowie_Bailey at BUC.com>

On 12/30/2010 10:24 AM, ken wrote:
> On 12/30/2010 09:18 AM Bowie Bailey wrote:
>> On 12/30/2010 8:19 AM, ken wrote:
>>> Given an HTML file which looks like this:
>>>
>>> --------- begin snippet ---------
>>> <HTML
>>>> <HEAD
>>>> <TITLE
>>>> We've Lied to You…</TITLE
>>>> <META
>>> NAME="GENERATOR"
>>> CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
>>> REL="HOME"
>>> TITLE="Maximum RPM"
>>> HREF="index.html"><LINK
>>> REL="UP"
>>> TITLE="Using RPM to Verify Installed Packages"
>>> HREF="ch-rpm-verify.html"><LINK
>>> ...
>>> --------- end snippet ---------
>>>
>>> I'm coding some perl to make it look something like this:
>>>
>>> --------- begin snippet ---------
>>> <html>
>>> <head>
>>> <title>We've Lied to You…</title>
>>>
>>> <meta name="generator" content="Modular DocBook HTML Stylesheet Version
>>> 1.79">
>>>
>>> <link rel="HOME" title="Maximum RPM" href="index.html">
>>>
>>> <line rel="UP" title="Using RPM to Verify Installed Packages"
>>> href="ch-rpm-verify.html">
>>>
>>> <link ....
>>> --------- end snippet ---------
>>>
>>> I've hit a wall trying to remove all the newlines.  I've tried it
>>> several ways... here's just one:
>>>
>>> --------- begin snippet ---------
>>> while (<$in>)
>>> {
>>>     s/<(\w*\W)/<\L$1/g;		# Downcase XXX in "<XXX".
>>>     s/<\/(\w*\W)/<\/\L$1/g;	# Downcase XXX in "</XXX".
>>>     if(/^>/)			# if this line starts with '>'
>>>     {				# then
>>> 	$curr = tell $in;	# Note current file position,
>>> 	seek $in, $prev, 0;	# go back to previous line,
>>> 	chomp;			# remove its trailing newline char,
>>> 	seek $in, $curr, 0;	# and reset position to current line.
>>>     }
>>>     else
>>>     {
>>> 	$curr = tell $in;	# Note current file position,
>>> 	seek $in, $prev, 0;	# go back to previous line
>>> 	s/\n/ /; 		# Append a space,
>>> 	chop;			# and then chomp.
>>> 	seek $in, $curr, 0;	# and reset position to current line.
>>>     }
>>>     print;
>>>     print $out;
>>>     $prev = tell $in;		# Location of previous line.
>>> }
>>> --------- end snippet ---------
>>>
>>> When I cat the output file, it looks like this:
>>>
>>> --------- begin snippet ---------
>>> GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
>>> Lied to
>>> You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
>>> DocBook HTML Stylesheet Version
>>> 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
>>> RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
>>> RPM to Verify Installed
>>> Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
>>> --------- end snippet ---------
>>>
>>> The output I should say *is* all on one line, not line-wrapped the way
>>> you see it above.  I have a hunch as to why there are the
>>> "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ')
>>> should be.  If some expert here could explain them, that would be really
>>> good.  More importantly though would be some instruction as to how to
>>> remove the newlines without creating all the GLOB(...) garbage.  Might I
>>> have to rewrite the script so to open the file in binary mode... or what?
>> So you are trying to remove all of the newlines inside the tags?
>>
>> I would approach it from the other direction.  Remove ALL of the
>> newlines and then add back the ones you want.
>>
>> Something like this (untested):
>>
>> $irs = $/;
>> $/ = undef;
>> $html = <$in>;
>> $/ = $irs;
>>
>> $html =~ s/\n/ /g;                 # Replace all newlines with spaces
>> $html =~ s/(<\w+)/\n$1/g;  # Add a newline before all begin tags
>> print $html . "\n";
>>
>> This pulls in the whole file before it starts processing, but as long as
>> it is not ridiculously huge, this should not be a problem.
> Some file this script would need to process could very well be
> ridiculously huge, which is why I chose to process line-by-line.
>
> Secondly, yes, I was already using the general strategy of taking out
> the newlines (where they're misplaced) and then putting them back in
> (where they should be).  It was probably difficult to discern that just
> from the code.
>
> Thanks for your reply, but it doesn't really address the problem.

In that case, how about this?

$html = undef;
while (<$in>)
    {
    chomp;
    $html .= " " . $_;               # Add the new line to what we
already have
    $html =~ s/^\s+//;               # Get rid of any leading spaces
    $html =~ s/(<\/?\w*\W)/\L$1/g;   # Lowercase tags
    $html =~ s/(?<=.)(<\w+)/\n$1/g;  # Add in needed newlines
    while ($html =~ /\n/)
        {
        $html =~ s/^(.*?\n)//;
        print $1;                    # Print completed lines
        }
    }
print "$html\n";                     # Print whatever is left over at
the end

-- 
Bowie