[CentOS] OpenSwan Drop Out Issue

Tue Feb 9 15:04:43 UTC 2016
John Cenile <jcenile1983 at gmail.com>

Hello,

I'm cross posting this from the OpenSwan mailing list, in case someone here
can help.

We have two sites connected via OpenSwan 2.6.32-9 on CentOS 5, sharing 6
/24 subnets each (so 12 in total).

The problem we're having is completely randomly, be it in the middle of the
day, or in the middle of the night (so I don't believe it's traffic
related), certain (and sometimes all) routes will drop. They usually
recover after a few minutes, but it's still long enough for our monitoring
to detect downtime.

The configuration we have on each device is:

conn site-a
        keyingtries=0
        keylife=1h
        ikelifetime=8h
        left=1.1.1.1
        right=2.2.2.2

leftsubnets={x.x.x.x/24,x.x.x.x/24,x.x.x.x/24,x.x.x.x/24,x.x.x.x/24,x.x.x.x/24}

rightsubnets={x.x.x.x/24,x.x.x.x/24,x.x.x.x/24,x.x.x.x/24,x.x.x.x/24,x.x.x.x/24}
        pfs=yes
        auto=start
        authby=secret
        dpddelay=30
        dpdtimeout=120
        dpdaction=hold
        phase2alg=aes256-sha1;modp1536
        phase2=esp
        ike=aes256-sha1;modp1536

It's mirrored exactly the same on the other side.

I have tried changing the dead peer detection timeout to something high (5
minutes), and removing it completely (which I believe defaults it to 30
seconds), neither of which made any difference.

I can't see any very obvious errors in the logs, however the most recent
drop out produced the following message around the same time:

Feb 10 00:53:09 site-b-vpn pluto[30584]: "site-a/5x5" #39: max number of
retransmissions (2) reached STATE_QUICK_I1
Feb 10 00:53:09 site-b-vpn pluto[30584]: "site-a/5x5" #39: starting keying
attempt 2 of an unlimited number
Feb 10 00:53:09 site-b-vpn pluto[30584]: "site-a/5x5" #95: initiating Quick
Mode PSK+ENCRYPT+TUNNEL+PFS+UP+IKEv2ALLOW+SAREFTRACK to replace #39 {using
isakmp#52 msgid:119495de proposal=AES(12)_256-SHA1(2)_160
pfsgroup=OAKLEY_GROUP_MODP1536}

and also

Feb 10 00:52:25 site-a-vpn pluto[2414]: "site-b/6x6" #1: ignoring Delete SA
payload: PROTO_IPSEC_ESP SA(0xde58eea3) not found (maybe expired)
Feb 10 00:52:25 site-a-vpn pluto[2414]: "site-b/6x6" #1: received and
ignored informational message
Feb 10 00:52:25 site-a-vpn pluto[2414]: "site-b/6x6" #1: ignoring Delete SA
payload: PROTO_IPSEC_ESP SA(0xa5298d7d) not found (maybe expired)
Feb 10 00:52:25 site-a-vpn pluto[2414]: "site-b/6x6" #1: received and
ignored informational message

Before we move to another solution, does anyone have any suggestions on
what the problem might be? Running a constant ping between the two hosts
doesn't drop *any* packets (even when the IPSec connection itself drops
out).

Thanks in advance.