On 12/30/2010 10:24 AM, ken wrote: > On 12/30/2010 09:18 AM Bowie Bailey wrote: >> On 12/30/2010 8:19 AM, ken wrote: >>> Given an HTML file which looks like this: >>> >>> --------- begin snippet --------- >>> <HTML >>>> <HEAD >>>> <TITLE >>>> We've Lied to You…</TITLE >>>> <META >>> NAME="GENERATOR" >>> CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK >>> REL="HOME" >>> TITLE="Maximum RPM" >>> HREF="index.html"><LINK >>> REL="UP" >>> TITLE="Using RPM to Verify Installed Packages" >>> HREF="ch-rpm-verify.html"><LINK >>> ... >>> --------- end snippet --------- >>> >>> I'm coding some perl to make it look something like this: >>> >>> --------- begin snippet --------- >>> <html> >>> <head> >>> <title>We've Lied to You…</title> >>> >>> <meta name="generator" content="Modular DocBook HTML Stylesheet Version >>> 1.79"> >>> >>> <link rel="HOME" title="Maximum RPM" href="index.html"> >>> >>> <line rel="UP" title="Using RPM to Verify Installed Packages" >>> href="ch-rpm-verify.html"> >>> >>> <link .... >>> --------- end snippet --------- >>> >>> I've hit a wall trying to remove all the newlines. I've tried it >>> several ways... here's just one: >>> >>> --------- begin snippet --------- >>> while (<$in>) >>> { >>> s/<(\w*\W)/<\L$1/g; # Downcase XXX in "<XXX". >>> s/<\/(\w*\W)/<\/\L$1/g; # Downcase XXX in "</XXX". >>> if(/^>/) # if this line starts with '>' >>> { # then >>> $curr = tell $in; # Note current file position, >>> seek $in, $prev, 0; # go back to previous line, >>> chomp; # remove its trailing newline char, >>> seek $in, $curr, 0; # and reset position to current line. >>> } >>> else >>> { >>> $curr = tell $in; # Note current file position, >>> seek $in, $prev, 0; # go back to previous line >>> s/\n/ /; # Append a space, >>> chop; # and then chomp. >>> seek $in, $curr, 0; # and reset position to current line. >>> } >>> print; >>> print $out; >>> $prev = tell $in; # Location of previous line. >>> } >>> --------- end snippet --------- >>> >>> When I cat the output file, it looks like this: >>> >>> --------- begin snippet --------- >>> GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've >>> Lied to >>> You…</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular >>> DocBook HTML Stylesheet Version >>> 1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum >>> RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using >>> RPM to Verify Installed >>> Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c).... >>> --------- end snippet --------- >>> >>> The output I should say *is* all on one line, not line-wrapped the way >>> you see it above. I have a hunch as to why there are the >>> "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces (' ') >>> should be. If some expert here could explain them, that would be really >>> good. More importantly though would be some instruction as to how to >>> remove the newlines without creating all the GLOB(...) garbage. Might I >>> have to rewrite the script so to open the file in binary mode... or what? >> So you are trying to remove all of the newlines inside the tags? >> >> I would approach it from the other direction. Remove ALL of the >> newlines and then add back the ones you want. >> >> Something like this (untested): >> >> $irs = $/; >> $/ = undef; >> $html = <$in>; >> $/ = $irs; >> >> $html =~ s/\n/ /g; # Replace all newlines with spaces >> $html =~ s/(<\w+)/\n$1/g; # Add a newline before all begin tags >> print $html . "\n"; >> >> This pulls in the whole file before it starts processing, but as long as >> it is not ridiculously huge, this should not be a problem. > Some file this script would need to process could very well be > ridiculously huge, which is why I chose to process line-by-line. > > Secondly, yes, I was already using the general strategy of taking out > the newlines (where they're misplaced) and then putting them back in > (where they should be). It was probably difficult to discern that just > from the code. > > Thanks for your reply, but it doesn't really address the problem. In that case, how about this? $html = undef; while (<$in>) { chomp; $html .= " " . $_; # Add the new line to what we already have $html =~ s/^\s+//; # Get rid of any leading spaces $html =~ s/(<\/?\w*\W)/\L$1/g; # Lowercase tags $html =~ s/(?<=.)(<\w+)/\n$1/g; # Add in needed newlines while ($html =~ /\n/) { $html =~ s/^(.*?\n)//; print $1; # Print completed lines } } print "$html\n"; # Print whatever is left over at the end -- Bowie