[CentOS] perl code to remove newlines

Thu Dec 30 13:19:00 UTC 2010
ken <gebser at mousecar.com>

Given an HTML file which looks like this:

--------- begin snippet ---------
<HTML
><HEAD
><TITLE
>We've Lied to You…</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
REL="HOME"
TITLE="Maximum RPM"
HREF="index.html"><LINK
REL="UP"
TITLE="Using RPM to Verify Installed Packages"
HREF="ch-rpm-verify.html"><LINK
...
--------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet ---------
<html>
<head>
<title>We've Lied to You…</title>

<meta name="generator" content="Modular DocBook HTML Stylesheet Version
1.79">

<link rel="HOME" title="Maximum RPM" href="index.html">

<line rel="UP" title="Using RPM to Verify Installed Packages"
href="ch-rpm-verify.html">

<link ....
--------- end snippet ---------

I've hit a wall trying to remove all the newlines.  I've tried it
several ways... here's just one:

--------- begin snippet ---------
while (<$in>)
{
    s/<(\w*\W)/<\L$1/g;		# Downcase XXX in "<XXX".
    s/<\/(\w*\W)/<\/\L$1/g;	# Downcase XXX in "</XXX".
    if(/^>/)			# if this line starts with '>'
    {				# then
	$curr = tell $in;	# Note current file position,
	seek $in, $prev, 0;	# go back to previous line,
	chomp;			# remove its trailing newline char,
	seek $in, $curr, 0;	# and reset position to current line.
    }
    else
    {
	$curr = tell $in;	# Note current file position,
	seek $in, $prev, 0;	# go back to previous line
	s/\n/ /; 		# Append a space,
	chop;			# and then chomp.
	seek $in, $curr, 0;	# and reset position to current line.
    }
    print;
    print $out;
    $prev = tell $in;		# Location of previous line.
}
--------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet ---------
GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
Lied to
You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
DocBook HTML Stylesheet Version
1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
RPM to Verify Installed
Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
--------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way
you see it above.  I have a hunch as to why there are the
"GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ')
should be.  If some expert here could explain them, that would be really
good.  More importantly though would be some instruction as to how to
remove the newlines without creating all the GLOB(...) garbage.  Might I
have to rewrite the script so to open the file in binary mode... or what?


Maximum thanks for your assistance.