I'm trying to wget some very specific files off a web page, but some of the paths are relative paths (e.g. ../path/to/file) rather than abosolute (e.g. http://direct/path/to/file ). Obviously, when wget gets to that part, it craps out... Is there a switch in wget (in CentOS 5 - latest wget package) that lets me maintain this session? I tried some of the options here (http://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html ), but it's not working, and I'm hoping someone here might point me in the right direction.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Rogelio Sent: November 22, 2007 12:08 PM To: CentOS mailing list Subject: [CentOS] wget'ing files relative paths? I'm trying to wget some very specific files off a web page, but some of the paths are relative paths (e.g. ../path/to/file) rather than abosolute (e.g. http://direct/path/to/file ). Obviously, when wget gets to that part, it craps out... I am a little unsure on what you are trying to do...are you mirroring a certain section of a website, and the relative paths are causing problems? that would be pretty strange, because I am pretty sure that i have done that before and not had any problems (just using wget -m http://hostname/path/I/care/about/file.html). or do you have a list of URL's you are trying to grab (say in a file or something), that only have the relative paths? something like this: http://direct/path/to/file1 http://direct/path/to/file2 http://direct/path/to/file3 http://direct/path/to/file4 ../../file5 ../../file6 if possible, it would be great if you could show the exact command line that you are using, along with the exact error message. Is there a switch in wget (in CentOS 5 - latest wget package) that lets me maintain this session? I tried some of the options here ( http://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html http://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html ), but it's not working, and I'm hoping someone here might point me in the right direction. Mike
On 11/22/07, mike.redan@bell.ca mike.redan@bell.ca wrote:
"I am a little unsure on what you are trying to do...are you mirroring a certain section of a website, and the relative paths are causing problems? that would be pretty strange, because I am pretty sure that i have done that before and not had any problems (just using wget -m http://hostname/path/I/care/about/file.html)."
Basically, I'd like to run this command:
wget (options to grab all the mp3s) http://www.2600.com/offthehook/1988/1088.html
e.g.
wget -r -l3 -H -t1 -nd -N -np -A .mp3 -erobots=off http://www.2600.com/offthehook/1988/1088.html It's not working like it works on other websites (e.g. democracynow.org), and I suspect it's because when you look through the source code (which I do with "lynx -source"), I see that the mp3 files are ../../path/to/file.mp3 While I could just write a bash file to pick through the source code, piece together a real path to the mp3 file, and then wget that, but I was hoping to figure out the switch necessary to let me use wget properly.
For what it's worth, I finally figured out my wget problem.
I used Lynx to grab the source of 2600's webpage, grep'd out the URLs in the pull down menus, sed'd these URL fragments into real URLs, and then piped these URLs in a file that was wget-friendly so I could get some MP3s.
lynx -source http://www.2600.com/offthehook/archive_ra.html | grep /offthehook | sed 's_">.*__g' | sed 's_ __g' | sed 's_<optionvalue=".._http://www.2600.com_g' | sed 's_<option selected value=".._http://www.2600.com_g' | sed 's_<optionselectedvalue=".._http://www.2600.com_g' | sed 's_\t__g' > OTH
wget -r -l1 -t1 -nd -N -A.mp3 -erobots=off -i OTH