[CentOS] perl code to remove newlines

Thu Dec 30 14:18:21 UTC 2010

On 12/30/2010 8:19 AM, ken wrote:
> Given an HTML file which looks like this:
>
> --------- begin snippet ---------
> <HTML
>> <HEAD
>> <TITLE
>> We've Lied to You…</TITLE
>> <META
> NAME="GENERATOR"
> CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
> REL="HOME"
> TITLE="Maximum RPM"
> HREF="index.html"><LINK
> REL="UP"
> TITLE="Using RPM to Verify Installed Packages"
> HREF="ch-rpm-verify.html"><LINK
> ...
> --------- end snippet ---------
>
> I'm coding some perl to make it look something like this:
>
> --------- begin snippet ---------
> <html>
> <head>
> <title>We've Lied to You…</title>
>
> <meta name="generator" content="Modular DocBook HTML Stylesheet Version
> 1.79">
>
> <link rel="HOME" title="Maximum RPM" href="index.html">
>
> <line rel="UP" title="Using RPM to Verify Installed Packages"
> href="ch-rpm-verify.html">
>
> <link ....
> --------- end snippet ---------
>
> I've hit a wall trying to remove all the newlines.  I've tried it
> several ways... here's just one:
>
> --------- begin snippet ---------
> while (<$in>)
> {
>     s/<(\w*\W)/<\L$1/g;		# Downcase XXX in "<XXX".
>     s/<\/(\w*\W)/<\/\L$1/g;	# Downcase XXX in "</XXX".
>     if(/^>/)			# if this line starts with '>'
>     {				# then
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line,
> 	chomp;			# remove its trailing newline char,
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     else
>     {
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line
> 	s/\n/ /; 		# Append a space,
> 	chop;			# and then chomp.
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     print;
>     print $out;
>     $prev = tell $in;		# Location of previous line.
> }
> --------- end snippet ---------
>
> When I cat the output file, it looks like this:
>
> --------- begin snippet ---------
> GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
> Lied to
> You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
> DocBook HTML Stylesheet Version
> 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
> RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
> RPM to Verify Installed
> Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
> --------- end snippet ---------
>
> The output I should say *is* all on one line, not line-wrapped the way
> you see it above.  I have a hunch as to why there are the
> "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ')
> should be.  If some expert here could explain them, that would be really
> good.  More importantly though would be some instruction as to how to
> remove the newlines without creating all the GLOB(...) garbage.  Might I
> have to rewrite the script so to open the file in binary mode... or what?

So you are trying to remove all of the newlines inside the tags?

I would approach it from the other direction.  Remove ALL of the
newlines and then add back the ones you want.

Something like this (untested):

$irs = $/;
$/ = undef;
$html = <$in>;
$/ = $irs;

$html =~ s/\n/ /g;                 # Replace all newlines with spaces
$html =~ s/(<\w+)/\n$1/g;  # Add a newline before all begin tags
print $html . "\n";

This pulls in the whole file before it starts processing, but as long as
it is not ridiculously huge, this should not be a problem.

-- 
Bowie