Hello
I have a large list of URLs (from a database, generated automatically during tests) that I want to download using several wget processes at the same time. With our internal web servers, this will be a lot faster than downloading the pages one at a time with a single process.
So I create 20 pipes in my script with `mkfifo´ and connect the read end of each one to a new wget process for that fifo. The write end of each pipe is then connected to my script, with shell commands like `exec 18>>fifo_file_name´
Then my script outputs, in a loop, one line with an URL to each of the pipes, in turn, and then starts over again with the first pipe until there are no more URLs from the database client.
Much to my dismay I find that there is no concurrent / parallel download with the child `wget´ processes, and that for some strange reason only one wget process can download pages at a time, and after that process completes, another one can begin.
My script does feed *all* the pipes with data, one line to each pipe in turn, and has all the pipes written and closed by the time the first child process has even finished downloading.
Do you know why my child processes manifest this behavior of waiting in turn for each other in order to start reading the fifo and download ?
I figure it must be something about the pipes, because if I use regular files instead (and reverse the order: first write the URLs, then start wget to read them) than the child processes run in parallel as expected. The child processes also run in parallel if I open the write end of the pipes first, and the start the wget processes for the read end.
They even run in parallel with my pipes, but I could see them run like this only for once in all my attempts. I do not know what was special about that attempt, it happened at the beginning of the day, and the computers where not restarted nor logged off over night.
The pipes are created and deleted on ever run, with mkfifo and rm.
Is there something special about fifos to make them run in sequence if I open the read end first ?
My script is attached here, I believe it is nicely formatted and clear enough.
Thank you, Timothy Madden
This really belongs on a shell list rather than the centos list, but:
On Fri, Nov 25, 2011 at 1:05 PM, Timothy Madden terminatorul@gmail.com wrote:
So I create 20 pipes in my script with `mkfifo´ and connect the read end of each one to a new wget process for that fifo. The write end of each pipe is then connected to my script, with shell commands like `exec 18>>fifo_file_name´
Then my script outputs, in a loop, one line with an URL to each of the pipes, in turn, and then starts over again with the first pipe until there are no more URLs from the database client.
Much to my dismay I find that there is no concurrent / parallel download with the child `wget´ processes, and that for some strange reason only one wget process can download pages at a time, and after that process completes, another one can begin.
I believe the problem is with creating all the fifos and their readers first and then creating the writers.
What happens is that you create wget #1, which has some file descriptors associated with both it and the parent shell.
Next you create wget #2, which (because it was forked from the parent shell) shares all the file descriptors that the shell had open to wget #1, e.g., including the input to the fifo. Repeat for all the rest of the wget. By the time you have created the last one, each of them has a set of descriptors shared with every other that was created ahead of them.
Thus, even though you write to the fifo for wget #2 and close it from the parent shell, it doesn't actually see EOF and begin processing the input until the corresponding descriptor shared by wget #1 is closed when wget #1 exits. wget #3 then doesn't see EOF until #2 exits (#3 would have waited for #1, too, except #1 is already gone by then). Then #4 waits for #3, etc.
So you're either going to need to do a lot more clever descriptor wrangling to make sure wget #1 is not holding open any descriptors visible to wget #2, or you're going to have to use a simpler concurrency scheme that doesn't rely on having all those fifos opened ahead of time.
The child processes also run in parallel if I open the write end of the pipes first, and the start the wget processes for the read end.
Probably you inadvertently resolved the shared open descriptor problem by whatever change you made to the script to invert that ordering.
On Fri, Nov 25, 2011 at 6:34 PM, Bart Schaefer barton.schaefer@gmail.com wrote:
Next you create wget #2, which (because it was forked from the parent shell) shares all the file descriptors that the shell had open to wget #1, e.g., including the input to the fifo. Repeat for all the rest of the wget. By the time you have created the last one, each of them has a set of descriptors shared with every other that was created ahead of them.
Thus, even though you write to the fifo for wget #2 and close it from the parent shell, it doesn't actually see EOF and begin processing the input until the corresponding descriptor shared by wget #1 is closed when wget #1 exits.
I wrote that backwards. Actually I think the *last* one (#20) exits first, and then #19, and so on down to #1 ... but the descriptor management issue is the same.
On 26.11.2011 07:41, Bart Schaefer wrote:
On Fri, Nov 25, 2011 at 6:34 PM, Bart Schaefer barton.schaefer@gmail.com wrote:
Next you create wget #2, which (because it was forked from the parent shell) shares all the file descriptors that the shell had open to wget #1, e.g., including the input to the fifo. Repeat for all the rest of the wget. By the time you have created the last one, each of them has a set of descriptors shared with every other that was created ahead of them.
Thus, even though you write to the fifo for wget #2 and close it from the parent shell, it doesn't actually see EOF and begin processing the input until the corresponding descriptor shared by wget #1 is closed when wget #1 exits.
I wrote that backwards. Actually I think the *last* one (#20) exits first, and then #19, and so on down to #1 ... but the descriptor management issue is the same.
Wow ! You guys are so great !
That is exactly how my script behaves. The POSIX (actually SuS) text on the matter said that the implementation may or may not inherit the descriptors in the forked child processes, and that a portable script should close the descriptors to be sure they are not inherited. Somehow I did not pay to much attention to the issue and I just assumed I need not worry about it.
But you are right, in my loop every new wget process inherits the file descriptors opened by the script for the previous childs (for their pipes, to be accurate), and as such keeps the files open even if my script later closes its copies of the descriptors. And I though I am the toughest programmer ever ... !
The solution was simple, no file descriptor wrangling: just use two loops: one to start all the wget processes and connect them to the pipes, and the other to open all those fd's. In this way the wgets have nothing special to inherit.
Sorry about the wrong group. After asking in a Unix shell group without much success, I suspected it must be CentOS doing something strange with the FIFOs at the system level.
Thank you, Timothy Madden