On 12/30/2010 8:19 AM, ken wrote: > Given an HTML file which looks like this: > > --------- begin snippet --------- > <HTML >> <HEAD >> <TITLE >> We've Lied to You…</TITLE >> <META > NAME="GENERATOR" > CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK > REL="HOME" > TITLE="Maximum RPM" > HREF="index.html"><LINK > REL="UP" > TITLE="Using RPM to Verify Installed Packages" > HREF="ch-rpm-verify.html"><LINK > ... > --------- end snippet --------- > > I'm coding some perl to make it look something like this: > > --------- begin snippet --------- > <html> > <head> > <title>We've Lied to You…</title> > > <meta name="generator" content="Modular DocBook HTML Stylesheet Version > 1.79"> > > <link rel="HOME" title="Maximum RPM" href="index.html"> > > <line rel="UP" title="Using RPM to Verify Installed Packages" > href="ch-rpm-verify.html"> > > <link .... > --------- end snippet --------- > > I've hit a wall trying to remove all the newlines. I've tried it > several ways... here's just one: > > --------- begin snippet --------- > while (<$in>) > { > s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". > s/<\/(\w*\W)/<\/\L$1/g; # Downcase XXX in "</XXX". > if(/^>/) # if this line starts with '>' > { # then > $curr = tell $in; # Note current file position, > seek $in, $prev, 0; # go back to previous line, > chomp; # remove its trailing newline char, > seek $in, $curr, 0; # and reset position to current line. > } > else > { > $curr = tell $in; # Note current file position, > seek $in, $prev, 0; # go back to previous line > s/\n/ /; # Append a space, > chop; # and then chomp. > seek $in, $curr, 0; # and reset position to current line. > } > print; > print $out; > $prev = tell $in; # Location of previous line. > } > --------- end snippet --------- > > When I cat the output file, it looks like this: > > --------- begin snippet --------- > GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've > Lied to > You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular > DocBook HTML Stylesheet Version > 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum > RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using > RPM to Verify Installed > Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... > --------- end snippet --------- > > The output I should say *is* all on one line, not line-wrapped the way > you see it above. I have a hunch as to why there are the > "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') > should be. If some expert here could explain them, that would be really > good. More importantly though would be some instruction as to how to > remove the newlines without creating all the GLOB(...) garbage. Might I > have to rewrite the script so to open the file in binary mode... or what? So you are trying to remove all of the newlines inside the tags? I would approach it from the other direction. Remove ALL of the newlines and then add back the ones you want. Something like this (untested): $irs = $/; $/ = undef; $html = <$in>; $/ = $irs; $html =~ s/\n/ /g; # Replace all newlines with spaces $html =~ s/(<\w+)/\n$1/g; # Add a newline before all begin tags print $html . "\n"; This pulls in the whole file before it starts processing, but as long as it is not ridiculously huge, this should not be a problem. -- Bowie