[CentOS] Socket behavior change from 6.5 to 6.6

Fri Jan 16 18:21:07 UTC 2015
Warren Young <wyml at etr-usa.com>

A couple more thoughts...

On Jan 16, 2015, at 10:42 AM, Warren Young <wyml at etr-usa.com> wrote:

> On Jan 15, 2015, at 11:40 AM, Glenn Eychaner <geychaner at mac.com> wrote:
> 
>> When the DOS box exits, crashes, or is rebooted, it fails to shut down the
>> socket properly.
> 
> Yes, that’s what happens when you use an OS that doesn’t implement sockets in kernel space: there is no program still running that can send the RST packet for the dead socket.

That said, your Linux/Python side code shouldn’t be relying on the RST anyway.  A power blip that unceremoniously reboots the DOS box will also skip the RST.  That happens with *all* TCP stacks, even in-kernel ones.

True war story, seen on devices from multiple vendors: 

The setup: An embedded system has a TCP listener.  Some network problem [*] causes packet loss for an extended period, causing an established peer to time out and drop its conn.  The packet loss also prevents the RST/FIN from getting to the embedded device, so it thinks it’s still connected.  Because the embedded device’s programmer is counting every processor cycle, he makes it so it only handles a single TCP connection at a time.

The result: The embedded box is now unreachable until boots on the ground walk over and power-cycle it.

The fix: Make the embedded TCP listener either a) allow multiple TCP connections; or b) drop the prior TCP conn when a new one comes in.

The lesson: If your TCP/IP program was easy to write, it isn’t robust.  You’ve missed *something*.


[*] It could be a misconfiguration, broken cable, firmware update, power-cycled wiring closet, etc.

> The correct fix is to change the DOS app to use an ephemeral port number.

That also fixes the “missing RST” problem I’ve described above.  If by some bad bit of luck the DOS box happens to pick the same ephemeral port number after a reboot that it was using before, it will get RST.  The DOS app will then retry, causing the DOS TCP stack to pick a different ephemeral port, so it will succeed.

A different fix is to exploit the real-time nature of video camera imagery: if your Python app goes more than a second without receiving an image frame, it can presume that the DOS box has disappeared again, and drop its conn.  By the time the DOS box reboots, TIME_WAIT may have expired, so the DOS box might reconnect without a problem.

You may wish to reduce tcp_fin_timeout to ensure that TIME_WAIT does indeed expire before the DOS box reboots, per http://goo.gl/zQCzqK