perl code to remove newlines

List overview All Threads
Download

newer

older

segfault

Script Output Format

ken

30 Dec 2010 30 Dec '10

1:19 p.m.

Given an HTML file which looks like this:

--------- begin snippet --------- <HTML

...

<HEAD <TITLE We've Lied to You…</TITLE <META

NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="Maximum RPM" HREF="index.html"><LINK REL="UP" TITLE="Using RPM to Verify Installed Packages" HREF="ch-rpm-verify.html"><LINK ... --------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet --------- <html> <head> <title>We've Lied to You…</title>

I've hit a wall trying to remove all the newlines. I've tried it several ways... here's just one:

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX". if(/^>/) # if this line starts with '>' { # then $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line, chomp; # remove its trailing newline char, seek $in, $curr, 0; # and reset position to current line. } else { $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line s/\n/ /; # Append a space, chop; # and then chomp. seek $in, $curr, 0; # and reset position to current line. } print; print $out; $prev = tell $in; # Location of previous line. } --------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet --------- GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've Lied to You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using RPM to Verify Installed Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... --------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way you see it above. I have a hunch as to why there are the "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') should be. If some expert here could explain them, that would be really good. More importantly though would be some instruction as to how to remove the newlines without creating all the GLOB(...) garbage. Might I have to rewrite the script so to open the file in binary mode... or what?

Maximum thanks for your assistance.

Show replies by date

Bowie Bailey

30 Dec 30 Dec

2:18 p.m.

On 12/30/2010 8:19 AM, ken wrote:

...

Given an HTML file which looks like this:

--------- begin snippet ---------

<HTML > <HEAD > <TITLE > We've Lied to You…</TITLE > <META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="Maximum RPM" HREF="index.html"><LINK REL="UP" TITLE="Using RPM to Verify Installed Packages" HREF="ch-rpm-verify.html"><LINK ... --------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet ---------

<html> <head> <title>We've Lied to You…</title>

<meta name="generator" content="Modular DocBook HTML Stylesheet Version 1.79">

<link rel="HOME" title="Maximum RPM" href="index.html">

<line rel="UP" title="Using RPM to Verify Installed Packages" href="ch-rpm-verify.html">

<link .... --------- end snippet ---------

I've hit a wall trying to remove all the newlines. I've tried it several ways... here's just one:

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX". if(/^>/) # if this line starts with '>' { # then $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line, chomp; # remove its trailing newline char, seek $in, $curr, 0; # and reset position to current line. } else { $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line s/\n/ /; # Append a space, chop; # and then chomp. seek $in, $curr, 0; # and reset position to current line. } print; print $out; $prev = tell $in; # Location of previous line. } --------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet --------- GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've Lied to You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using RPM to Verify Installed Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... --------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way you see it above. I have a hunch as to why there are the "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') should be. If some expert here could explain them, that would be really good. More importantly though would be some instruction as to how to remove the newlines without creating all the GLOB(...) garbage. Might I have to rewrite the script so to open the file in binary mode... or what?

So you are trying to remove all of the newlines inside the tags?

I would approach it from the other direction. Remove ALL of the newlines and then add back the ones you want.

Something like this (untested):

$irs = $/; $/ = undef; $html = <$in>; $/ = $irs;

$html =~ s/\n/ /g; # Replace all newlines with spaces $html =~ s/(<\w+)/\n$1/g; # Add a newline before all begin tags print $html . "\n";

This pulls in the whole file before it starts processing, but as long as it is not ridiculously huge, this should not be a problem.

-- Bowie

ken

3:24 p.m.

On 12/30/2010 09:18 AM Bowie Bailey wrote:

...

On 12/30/2010 8:19 AM, ken wrote:

...
Given an HTML file which looks like this:

--------- begin snippet ---------

<HTML > <HEAD > <TITLE > We've Lied to You…</TITLE > <META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="Maximum RPM" HREF="index.html"><LINK REL="UP" TITLE="Using RPM to Verify Installed Packages" HREF="ch-rpm-verify.html"><LINK ... --------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet ---------

<html> <head> <title>We've Lied to You…</title>

<meta name="generator" content="Modular DocBook HTML Stylesheet Version 1.79">

<link rel="HOME" title="Maximum RPM" href="index.html">

<line rel="UP" title="Using RPM to Verify Installed Packages" href="ch-rpm-verify.html">

<link .... --------- end snippet ---------

I've hit a wall trying to remove all the newlines. I've tried it several ways... here's just one:

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX". if(/^>/) # if this line starts with '>' { # then $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line, chomp; # remove its trailing newline char, seek $in, $curr, 0; # and reset position to current line. } else { $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line s/\n/ /; # Append a space, chop; # and then chomp. seek $in, $curr, 0; # and reset position to current line. } print; print $out; $prev = tell $in; # Location of previous line. } --------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet --------- GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've Lied to You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using RPM to Verify Installed Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... --------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way you see it above. I have a hunch as to why there are the "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') should be. If some expert here could explain them, that would be really good. More importantly though would be some instruction as to how to remove the newlines without creating all the GLOB(...) garbage. Might I have to rewrite the script so to open the file in binary mode... or what?

So you are trying to remove all of the newlines inside the tags?

I would approach it from the other direction. Remove ALL of the newlines and then add back the ones you want.

Something like this (untested):

$irs = $/; $/ = undef; $html = <$in>; $/ = $irs;

$html =~ s/\n/ /g; # Replace all newlines with spaces $html =~ s/(<\w+)/\n$1/g; # Add a newline before all begin tags print $html . "\n";

This pulls in the whole file before it starts processing, but as long as it is not ridiculously huge, this should not be a problem.

Some file this script would need to process could very well be ridiculously huge, which is why I chose to process line-by-line.

Secondly, yes, I was already using the general strategy of taking out the newlines (where they're misplaced) and then putting them back in (where they should be). It was probably difficult to discern that just from the code.

Thanks for your reply, but it doesn't really address the problem.

John Doe

4:01 p.m.

From: ken gebser@mousecar.com

...

Some file this script would need to process could very well be ridiculously huge, which is why I chose to process line-by-line.

Secondly, yes, I was already using the general strategy of taking out the newlines (where they're misplaced) and then putting them back in (where they should be). It was probably difficult to discern that just from the code.

Thanks for your reply, but it doesn't really address the problem.

Not really an answer but why not use an html beautifier...? http://www.w3.org/People/Raggett/tidy/

$ cat $FILE | tr "\n" " " | sed 's/ *></>\n</g' <HTML> <HEAD> <TITLE >We've Lied to You…</TITLE> <META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"> <LINK REL="HOME" TITLE="Maximum RPM" HREF="index.html"> <LINK REL="UP" TITLE="Using RPM to Verify Installed Packages" HREF="ch-rpm-verify.html"> <LINK...

cpolish＠surewest.net

7:44 p.m.

John Doe wrote:

...

$ cat $FILE | tr "\n" " " | sed 's/ *></>\n</g'

I was yearning for someone to chime with that! sed is clearly the best & most straightforward way to do this task.

I can't help myself - there's a "useless use of cat":

$ < $FILE tr "\n" " " | sed 's/ *></>\n</g'

http://www.partmaps.org/era/unix/award.html#uucaletter

-- "Pedantic, I?" -- Alexei Sayle

ken

9:59 p.m.

On 12/30/2010 11:01 AM John Doe wrote:

...

From: ken gebser@mousecar.com

...
Some file this script would need to process could very well be ridiculously huge, which is why I chose to process line-by-line.

Secondly, yes, I was already using the general strategy of taking out the newlines (where they're misplaced) and then putting them back in (where they should be). It was probably difficult to discern that just from the code.

Thanks for your reply, but it doesn't really address the problem.

Not really an answer but why not use an html beautifier...? http://www.w3.org/People/Raggett/tidy/

Thanks. Cool program. But for some reason it mangles the file in a few ways. So I can't use it.

...

....

Bowie Bailey

4:11 p.m.

On 12/30/2010 10:24 AM, ken wrote:

...

On 12/30/2010 09:18 AM Bowie Bailey wrote:

...
On 12/30/2010 8:19 AM, ken wrote:

...
Given an HTML file which looks like this:

--------- begin snippet ---------

<HTML > <HEAD > <TITLE > We've Lied to You…</TITLE > <META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="Maximum RPM" HREF="index.html"><LINK REL="UP" TITLE="Using RPM to Verify Installed Packages" HREF="ch-rpm-verify.html"><LINK ... --------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet ---------

<html> <head> <title>We've Lied to You…</title>

<meta name="generator" content="Modular DocBook HTML Stylesheet Version 1.79">

<link rel="HOME" title="Maximum RPM" href="index.html">

<line rel="UP" title="Using RPM to Verify Installed Packages" href="ch-rpm-verify.html">

<link .... --------- end snippet ---------

I've hit a wall trying to remove all the newlines. I've tried it several ways... here's just one:

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX". if(/^>/) # if this line starts with '>' { # then $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line, chomp; # remove its trailing newline char, seek $in, $curr, 0; # and reset position to current line. } else { $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line s/\n/ /; # Append a space, chop; # and then chomp. seek $in, $curr, 0; # and reset position to current line. } print; print $out; $prev = tell $in; # Location of previous line. } --------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet --------- GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've Lied to You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using RPM to Verify Installed Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... --------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way you see it above. I have a hunch as to why there are the "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') should be. If some expert here could explain them, that would be really good. More importantly though would be some instruction as to how to remove the newlines without creating all the GLOB(...) garbage. Might I have to rewrite the script so to open the file in binary mode... or what?

So you are trying to remove all of the newlines inside the tags?

I would approach it from the other direction. Remove ALL of the newlines and then add back the ones you want.

Something like this (untested):

$irs = $/; $/ = undef; $html = <$in>; $/ = $irs;

$html =~ s/\n/ /g; # Replace all newlines with spaces $html =~ s/(<\w+)/\n$1/g; # Add a newline before all begin tags print $html . "\n";

This pulls in the whole file before it starts processing, but as long as it is not ridiculously huge, this should not be a problem.

Some file this script would need to process could very well be ridiculously huge, which is why I chose to process line-by-line.

Secondly, yes, I was already using the general strategy of taking out the newlines (where they're misplaced) and then putting them back in (where they should be). It was probably difficult to discern that just from the code.

Thanks for your reply, but it doesn't really address the problem.

In that case, how about this?

$html = undef; while (<$in>) { chomp; $html .= " " . $_; # Add the new line to what we already have $html =~ s/^\s+//; # Get rid of any leading spaces $html =~ s/(</?\w*\W)/\L$1/g; # Lowercase tags $html =~ s/(?<=.)(<\w+)/\n$1/g; # Add in needed newlines while ($html =~ /\n/) { $html =~ s/^(.*?\n)//; print $1; # Print completed lines } } print "$html\n"; # Print whatever is left over at the end

-- Bowie

Sean

7:20 p.m.

Not sure exactly what you are trying to do, but Tie::File might be worth a look at if you haven't done so already? Sean

ken wrote:

...

Given an HTML file which looks like this:

--------- begin snippet ---------

<HTML > <HEAD > <TITLE > We've Lied to You…</TITLE > <META > NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="Maximum RPM" HREF="index.html"><LINK REL="UP" TITLE="Using RPM to Verify Installed Packages" HREF="ch-rpm-verify.html"><LINK ... --------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet ---------

<html> <head> <title>We've Lied to You…</title>

<meta name="generator" content="Modular DocBook HTML Stylesheet Version 1.79">

<link rel="HOME" title="Maximum RPM" href="index.html">

<line rel="UP" title="Using RPM to Verify Installed Packages" href="ch-rpm-verify.html">

<link .... --------- end snippet ---------

I've hit a wall trying to remove all the newlines. I've tried it several ways... here's just one:

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX". if(/^>/) # if this line starts with '>' { # then $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line, chomp; # remove its trailing newline char, seek $in, $curr, 0; # and reset position to current line. } else { $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line s/\n/ /; # Append a space, chop; # and then chomp. seek $in, $curr, 0; # and reset position to current line. } print; print $out; $prev = tell $in; # Location of previous line. } --------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet --------- GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've Lied to You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using RPM to Verify Installed Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... --------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way you see it above. I have a hunch as to why there are the "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') should be. If some expert here could explain them, that would be really good. More importantly though would be some instruction as to how to remove the newlines without creating all the GLOB(...) garbage. Might I have to rewrite the script so to open the file in binary mode... or what?

Maximum thanks for your assistance.

Jerry McAllister

10:41 p.m.

On Thu, Dec 30, 2010 at 08:19:00AM -0500, ken wrote:

It isn't perl, but does 'tr' exist in CentOS (it does in FreeBSD)? It would do it.

////jerry

...

Given an HTML file which looks like this:

--------- begin snippet ---------

<HTML ><HEAD ><TITLE >We've Lied to You…</TITLE ><META NAME="GENERATOR" CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK REL="HOME" TITLE="Maximum RPM" HREF="index.html"><LINK REL="UP" TITLE="Using RPM to Verify Installed Packages" HREF="ch-rpm-verify.html"><LINK ... --------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet ---------

<html> <head> <title>We've Lied to You…</title>

<meta name="generator" content="Modular DocBook HTML Stylesheet Version 1.79">

<link rel="HOME" title="Maximum RPM" href="index.html">

<line rel="UP" title="Using RPM to Verify Installed Packages" href="ch-rpm-verify.html">

<link .... --------- end snippet ---------

I've hit a wall trying to remove all the newlines. I've tried it several ways... here's just one:

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX". if(/^>/) # if this line starts with '>' { # then $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line, chomp; # remove its trailing newline char, seek $in, $curr, 0; # and reset position to current line. } else { $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line s/\n/ /; # Append a space, chop; # and then chomp. seek $in, $curr, 0; # and reset position to current line. } print; print $out; $prev = tell $in; # Location of previous line. } --------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet --------- GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've Lied to You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using RPM to Verify Installed Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... --------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way you see it above. I have a hunch as to why there are the "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') should be. If some expert here could explain them, that would be really good. More importantly though would be some instruction as to how to remove the newlines without creating all the GLOB(...) garbage. Might I have to rewrite the script so to open the file in binary mode... or what?

Maximum thanks for your assistance.

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Bart Schaefer

31 Dec 31 Dec

3:52 a.m.

On Thu, Dec 30, 2010 at 5:19 AM, ken gebser@mousecar.com wrote:

...

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX".

chomp; # Always remove the newline unless (/<html/) { # Not on first line, so

...

if(/^>/) # if this line starts with '>' { # then $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line, chomp; # remove its trailing newline char, seek $in, $curr, 0; # and reset position to current line. } else { $curr = tell $in; # Note current file position, seek $in, $prev, 0; # go back to previous line s/\n/ /; # Append a space, chop; # and then chomp. seek $in, $curr, 0; # and reset position to current line. } print; print $out; $prev = tell $in; # Location of previous line. } --------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet --------- GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've Lied to You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using RPM to Verify Installed Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... --------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way you see it above. I have a hunch as to why there are the "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') should be. If some expert here could explain them, that would be really good. More importantly though would be some instruction as to how to remove the newlines without creating all the GLOB(...) garbage. Might I have to rewrite the script so to open the file in binary mode... or what?

Maximum thanks for your assistance.

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Bart Schaefer

3:59 a.m.

(Drat, keyboard glitch caused that to be sent before I was finished.)

On Thu, Dec 30, 2010 at 5:19 AM, ken gebser@mousecar.com wrote:

...

--------- begin snippet --------- while (<$in>) { s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". s/</(\w*\W)/</\L$1/g; # Downcase XXX in "</XXX".

chomp; # Always remove the newline unless (/<html/) { # Not on first line, so insert a newline # whenever this line does not begin with > s/^(^[>])/\n$1/; } }

That's it, except for an END block to print a final newline. If there are blank lines in the input that you want to retain, you'll need a little more to avoid having them swallowed.

Bart Schaefer

4:01 a.m.

Oops again, typo:

...

s/^(^[>])/\n$1/;

Should be s/^([^>])/\n$1/

5571

Age (days ago)

5572

Last active (days ago)

discuss@lists.centos.org

11 comments

7 participants

tags (0)

participants (7)

Bart Schaefer
Bowie Bailey
cpolish＠surewest.net
Jerry McAllister
John Doe
ken
Sean