[CentOS] Socket behavior change from 6.5 to 6.6

Wed Jan 21 16:49:21 UTC 2015
Glenn Eychaner <geychaner at mac.com>

I'd like to thank everyone for their replies and advice. I'm sorry it took so
long for me to respond; I took a long weekend after a long shift. Some
remaining questions can be found in the final section of this posting. The
summary (I hope i have all of this correct):

Problem:
A DOS box (client) connects to a Linux box (server) using the same local port
(1025) on the client each time. The client sends data which the server reads;
the server is passive and does not write any data. If the client crashes and
fails to properly close the connection, under CentOS 6.5, the unclosed
listener on the server receives a 0-length recv(), allowing for a "clean"
reconnect; under 6.6, it does not, and the client unsuccessfully retries the
reconnect endlessly.

Diagnosis:
Because the client is connecting using the same port every time, the server
sees the same 5-tuple each time. At that point, the reconnection should fail
until the old socket on the server is closed, and the previous behavior of
receiving a 0-length recv() on the old server socket is unsupported and
unreliable. Until the update to CentOS 6.6 'broke' the existing functionality,
I had never looked deeply into the connection between the client and the
server; it 'just worked', so I left it alone. Once it did break, I realized
that because the client was connecting on the same port every time, the
whole setup might have been relying on unsupported behavior.

My workaround:
I unfortunately had to implement an emergency workaround before receiving any
replies. Fortunately, the client also sends status messages to the same
computer (but a different server program) over a serial-port side-channel
(well, it's more complicated than that, but anyway). I set up a listener for a
"failed connection" status message which signal()s the server program to close
all client connections (but not the bound dispatchers) and thereby force all
clients to reconnect. It's a cheat and a cheesy hack, but it works.

Other diagnostics:
One test I intend to run in a couple of weeks (next opportunity) is to boot
the CentOS 6.6 box with the older kernel, in order to find out whether the
behavior change is in the kernel or in the libraries.

Correct solutions:
1) Client port: The client should be connecting on a random, ephemeral port
like a good client instead of on a fixed port, which I suspected. I don't know
if this can be changed (due to a really dumb binary TCP driver).
2) Protocol change: The server never writes to the socket in the existing
protocol, and can therefore never find out that the connection is dead.
Writing to the socket would reveal this. But what happens if the server writes
to the socket, and the client never reads? (We do, as it happens, have access
to the client software, so the protocol can be fixed eventually. But I'm still
curious as to the answer.)
3) Several people suggested using SO_REUSEADDR and/or an SO_LINGER of zero to
drop the socket out of TIME_WAIT, but does the socket enter TIME_WAIT as soon
as the client crashes? I didn't think so, but I may be wrong.
4) Several people suggested SO_KEEPALIVE, but those occur only after hours
unless you change kernel parameters via procfs and/or sysctl, and when the
client crashes, I need recovery right away, not hours down the road. Time here
is literally worth a dollar per second, roughly.

Anyway, thanks for the discusssion and helpful links. At one time I knew all
this stuff, but it has been 20 years since I had to dig into the TCP protocol
this deeply.

-G.
--
Glenn Eychaner (geychaner at lco.cl)
Telescope Systems Programmer, Las Campanas Observatory