[CentOS] strange memory issues with CentOS 6.2 on VPS

Sun Feb 26 18:59:07 UTC 2012
Tomas Vondra <tv at fuzzy.cz>

Hi all,

today we've encountered quite strange issues with memory allocation on
one of our VPS running CentOS 6.2. So far I've been unable to determine
what's causing it - hopefully someone here will know what's up.

The VPS is a "small" machine - just 512MB of RAM, 1 CPU, running 6.2.
with current kernel (2.6.32-220.4.2.el6.x86_64, but I've tried the
2.6.32-71.el6.x86_64 too). There's a quite common stack installed, i.e.
Nainstalován je na něm celkem standardní stack - apache, php,
postgresql, postfix, dovecot, memcached and ssh. Basically nothing
exotic, everything from official repos (except from postfix and
postgresql). The machine is not heavily used.

More detailed logs (than posted here) are available here:

  http://pastebin.com/vYxRUyUX

We've been hitting some I/O utilization issues (cause by other VPS
instances on the same hw) so we've migrated to a different physical hw.
After the migration, the VPS started failing because of memory alloc
issues - the services fail either at startup time or when processing the
requests - although there's enough free mem:

[root at vps audit]# free
         total       used       free     shared    buffers     cached
Mem:    502728     294224     208504          0      18604     163608
-/+ buffers/cache: 112012     390716
Swap:        0          0          0

i.e. about 200 MB of free memory, but apache fails because of segfaults
when forking a child process:

  [16:49:51 2012] [error] (12)Cannot allocate memory: fork: Unable to
                          fork new process
  [16:51:17 2012] [notice] child pid 2577 exit signal Segmentation
                           fault (11)

or when processing requests:

  [26 16:30:16 2012] [error] [client 66.249.72.1] PHP Fatal error:  Out
  of memory (allocated 262144) (tried to allocate 523800 bytes) in
  Unknown on line 0

The memory_limit in PHP is set to 32MB, so it's not the case. Similar
issues happen to PostgreSQL:

  16:42:01 CET pid=2504 db=xxxxxx-drupal user=xxxxxx FATAL:  out of
               memory
  16:42:01 CET pid=2504 db=xxxxxx-drupal user=xxxxxx DETAIL:  Failed on
               request of size 2488.
  16:42:01 CET pid=2438 db= user= LOG:  could not fork new process for
               connection: Nelze alokovat paměť
  16:42:01 CET pid=2438 db= user= 4f4a5247.986:21 LOG:  could not fork
               new process for connection: cannot allocate memory

I have absolutely no clue what's causing this / how to fix it. According
to free/vmstat there's about 200MB of free RAM  all the time, so I have
no idea why the alloc calls fail.

What makes is even more puzzling is that after adding a swapfile, all
the issues suddenly disappear, although the swapfile is not used at all
... and it's not possible to disable it because of memory alloc.


  # dd if=/dev/zero of=swap.img bs=1024 count=409600
  # mkswap swap.img
  # swapon swap.img

  ... now the services are starting fine ...

  # swapon -s

    Filename             Type        Size    Used    Priority
    /root/swap.img       file        399992  0       -1

  # free
           total     used     free   shared  buffers   cached
    Mem:  503412   294192   209220        0    11740    99980
    -/+ buffers/cache: 182472   320940
    Swap: 399992        0   399992

  # swapoff swap.img
    swapoff: swap.img: swapoff selhal: Nelze alokovat paměť

Any ideas what might cause this?

The fact that I haven't noticed these issues before the migration are
probably caused by a swap file - I've manually added it during a
maintenance and forgot to remove it after that, but it disappeared when
the machine was rebooted during migration.

There's a SELinux enable, but I doubt it's causing the issues - there's
nothing in audit logs except for an information that there was a
segfault. Nothing suspicious.

Otherwise it's just a standard CentOS install, the only thing I had to
tune a bit were kernel limits (in sysctl.conf) related to shared memory
(because of the database). Currently there's

  kernel.shmmax = 68719476736
  kernel.shmall = 134217728
  vm.swappiness = 0
  vm.overcommit_memory = 2

which should be fine IMHO ... any ideas?

regards
Tomáš