[CentOS] Socket behavior change from 6.5 to 6.6

Thu Jan 15 18:40:08 UTC 2015

I will try to explain this as best I can. I have two computers; one a
Supermicro X10SAE running CentOS 6, the other a very old DOS box.[*] The DOS
box runs a CCD camera, sending images via Ethernet to the X10SAE.  Thus, the
X10SAE runs a Python server on port 5700 (a socket which binds to 5700 and
listens, and then accepts a connection from the DOS box; nothing fancy).[**]
The DOS box connects to the server and sends images.  This all works great,
except:

When the DOS box exits, crashes, or is rebooted, it fails to shut down the
socket properly. Under CentOS 6.5, upon reboot, when the DOS box would attempt
to reconnect, the original accepted server socket would (after a couple of
connection attempts from the DOS box) see a 0-length recv and close, allowing
the server to accept a new connection and resume receiving images.

Under CentOS 6.6, the server never sees the 0-length recv. The DOS box flails
away attempting to reconnect forever, and the server never seems to get any
type of signal that the DOS box is attempting to reconnect.

Possibly relevant facts:
- The DOS box uses the same local port (1025) every time it tries to connect. It
does not use a random ephemeral port.
- The exact same code was tested on a CentOS 6.5 and 6.6 box, resulting in the
described behavior. The boxes were identical clones except for the O/S upgrade.
- The Python interpreter was not changed during the upgrade, because I run this
code using my own 2.7.2 install. However, both glibc and the kernel were
upgraded as part of the O/S upgrade.

My only theory is that this has something to do with non-ephemeral ports and
socket reuse, but I'm not sure what. It is entirely possible that some
low-level socket option default has changed between 6.5 and 6.6, and I
wouldn't know it. It is also possible that I have been relying on unsupported
behavior this whole time, and that the current behavior is actually correct.

Does anyone have any insight they can offer?

[*] Hardware is not an issue; in fact, I have two identical systems, each of
which has one X10SAE and three DOS boxes.  But the problem can be boiled down
to a single pair.
[**] I'm actually using an asyncore.dispatcher to do the bind/listen, and then
tossing the accept()ed socket into an asynchat. But I actually went ahead and
put a trap on socket.recv() just to be sure that I'm not swallowing the
0-length recv by accident.

-G.
--
Glenn Eychaner (geychaner at lco.cl)
Telescope Systems Programmer, Las Campanas Observatory