Begin forwarded message:

From: Deepak Shetty <dpkshetty@gmail.com>
Subject: Re: [Gluster-devel] OOM issue in openstack Cinder - GlusterFS CI env
Date: 26 February 2015 05:24:48 GMT
To: Niels de Vos <ndevos@redhat.com>
Cc: Gluster Devel <gluster-devel@gluster.org>

A quick update

   By doing more experiments, we were able to figure that the failures are happening only on the rax (rackspace) cloud VMs, but not on the hpcloud VMs
Also the OOM issue happened on rax VMs even when gluster was not used as the storage (which means the cinder default storage uses LVMs)

So right now its seems like a combination of rax & centos7 is causing the fail, making centos7 CI job unsuitable for openstack tempest runs.
Since openstack-infra jobs can be schedule across any cloud provider, we cannot use centos7 as the distro, until the rax+centos7 oom issue is resolved

Given the strict cinder community guidelines for having a CI job in place, and openstack infra is having other issues with F20 and F21 based VMs,
(which are under debug as I was told on -infra IRC by infra core members ) the only option left is to use ubuntu as the base OS for running
cinder-gluster CI job, which we are in the process of evaluating and working on.

For full details refer to thread @
http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861

For a quick summary, see this mail of the above thread @
http://article.gmane.org/gmane.comp.cloud.openstack.devel/47287

thanx,
deepak

On Mon, Feb 23, 2015 at 6:01 PM, Deepak Shetty <dpkshetty@gmail.com> wrote:

On Sun, Feb 22, 2015 at 8:46 PM, Niels de Vos <ndevos@redhat.com> wrote:
Hi Deepak and others,

On Sat, Feb 21, 2015 at 10:00:18PM +0530, Deepak Shetty wrote:
> Hi All,
> I am looking for some help from glusterfs side for the Out of Memory
> (OOM) issue
> we are seeing when using GlusterFS as a storage backend for openstack
> Cinder (block storage service)

This seems to be using the FUSE mount points for the VM images, right?

Yes, no libgfapi.

> openstack has a upstream CI env managed by openstack infra team, where
> we added a new job that creates a devstack env (openstack all in one for
> newbies) and configures the block service (Cinder) with GlusterFS as
> storage backend. Once setup, the CI job runs openstack tempest (Integration
> test suite of openstack) that does API level testing of the whole openstack
> env.
>
> As part of that testing, ~1.5 to 2 hours into the run, the tempest job
> (VM) hits OOM and the kernel oom-killer kills the process with the max
> memory to reduce memory pressure.
>
> The tempest job is based on CentOS 7 and uses glusterfs 3.6.2 as the
> storage backend for openstack Cinder
>
> The openstack-dev thread @
> *http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861
> <http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861>* has
> details including links to the logs captured from the tempest jobs.

http://fungi.yuggoth.org/tmp/logs2.tar seems to be the most useful.

> Per the openstack infra folks, they have other jobs based off CentOS7 that
> doesn't hit this issue, the only change we are adding is configuring cinder
> with glusterfs when this happens, so right now glusterfs is in the
> spotlight for causing this.

Well, there definitely is something wrong on the Gluster side. After the
'blackout' between 17:47 - 18:17, this is logged in
logs/screen-c-vol.txt.gz:

mount -t glusterfs 127.0.0.1:/vol1 /opt/stack/data/cinder/mnt/f4c6ad7e3bba4ad1195b3b538efab64a" returned: 32

32 as exit code from "mount" means "mount failure".

Yeah, but thats the time when things went bad in the system leading to OOM
So can't we safely assume that system was in a bad state, hence mount failure...

We actually need to look before 17:47 and at/around 17:47 to figure what was running that caused the
blackout to happen.

I do see some errors I see in the brick logs and warnings in the mount logs
but I Don't think they are anyway related to or cause of creatign OOM.

The mountpoint would log details in
logs/glusterfs/opt-stack-data-cinder-mnt-f4c6ad7e3bba4ad1195b3b538efab64a.log.
At 18:00 in that file, there is this message:

0-vol1-client-0: server 104.130.30.179:49152 has not responded in the last 42 seconds, disconnecting.

Its a one-brick volume, so this error is pretty fatal. This error does
not cause an unmount, but the mountpoint will return errors. Subsequent
mount tries will fail, because the mountpoint is not unmounted yet.
Tests to see if the target directory can be used to mount a volume, will
fail, causing the mount process to return error 32.

Thats in the middle of the blackout, so disconnections are ok, i feel.

A little later, the (still) existing fuse mount client manages to
connect to the brick again:

0-vol1-client-0: changing port to 49152 (from 0)

The log of the brick does not contain anything useful for this time :-/

> I am looking thru the logs trying to co-relate syslog, dstat, tempest info
> to figure the state of the system and what was happening at and before the
> OOM to get any clues, but wanted to start this thread in gluster-devel to
> see if others can pitch in with their ideas to accelerate the debug and
> help root cause.

I have not yet found any pointer to what could have taken place between
17:47-18:17. It seems that network I/O stopped (ping timeout in FUSE).

During that tiem from the tempest logs i could figure that among other tests
VolumesBackupV1Test was running, which reads from Cinder- Gluster backend
and write to swift service, which will add good IO traffic.

One of the few logs that I came across which contains some messages
during this 'blackout', is logs/screen-n-cond.txt.gz. But I assume these
messages are triggered by the underlying network issues.

You mention that there was a OOM? Which logs contain some details about
that? Also, could you explain what test (workload) was running in the
VM(s?) on the FUSE mountpoint?

Look at logs/syslog.txt (thats /var/log/messages), grep for oom-killer
so at 18:17 when it happend it killed nova-compute and mysqld as they were the
most memory hoggers. But in talking with openstack infra folks, they indicated
that they have similar CI jobs where mysqld doesn't get killed, so adding gluster to
the mix is definitely the reason for reaching OOM state.

There is really no application running workload here. Most of the IO is generated as part
of oepnstack orchestration tests, liek backup (which reads from cinder and writes to swift), clone
(create a snap and then copy it), create/delete cinder volumes (which are files in glusterfs) etc

From what i understood in talking with the openstack infra folks they referred me to
the dstat logs sayign the kernel buffer cache is going as low as double digit M and system is in
70-90 % cpu wait time, which to them indicates that the IO of the system has gone bonkers

Below is the exact thing from the chat log (for the 1st re-create / logs) :

Feb 20 21:46:28 <sdague>    deepakcs: you are at 70% wait time at the end of that

Feb 20 21:46:37 <sdague>    so your io system is just gone bonkers

Feb 20 21:47:14 <fungi>    sdague: that would explain why the console login prompt and ssh daemon both stopped working, and the df loop in had going in my second ssh session hung around the same time
Feb 20 21:47:26 <sdague>    yeh, dstat even says it's skipping ticks there
Feb 20 21:47:29 <sdague>    for that reason

Feb 20 21:47:48 <fungi>    likely complete i/o starvation for an extended period at around that timeframe
Feb 20 21:48:05 <fungi>    that would also definitely cause jenkins to give up on the worker if it persisted for very long at all

Feb 20 21:48:09 <sdague>    yeh, cached memory is down to double digit M

Feb 20 21:49:19 <fungi>    because that's 00% of what it entails, at least for the hard problems
Feb 20 21:49:21 <sdague>    deepakcs: so, honestly, what it means to me is that glusterfs is may be too inefficient to function in this environment
Feb 20 21:49:25 <fungi>    er, 99%
Feb 20 21:49:34 <sdague>    because it's kind of a constrained environment

Feb 20 21:51:18 <deepakcs>    sdague, skipping ticks means ?

Feb 20 21:51:32 <sdague>    deepakcs: dstat runs once a second
Feb 20 21:51:38 <sdague>    unless the system gets under preasure
Feb 20 21:51:54 <sdague>    then it starts skipping it's runs to prevent things from getting worse
Feb 20 21:52:01 <sdague>    i.e. to not be part of the problem

Feb 20 21:53:24 <dansmith>    deepakcs: dstat is a well-understood thing, so you can probably figure out all you need to know from the docs

Feb 20 21:53:41 <dansmith>    deepakcs: it's run in the jobs so we can identify issues like this where the system is overly taxed and unable to actually run the tests
Feb 20 21:53:57 <deepakcs>    dansmith, agreed, just trying to understand system debug better :)

Feb 20 21:54:39 <dansmith>    deepakcs: http://linux.die.net/man/1/dstat
Feb 20 21:54:54 <fungi>    deepakcs: i have another manual run started just now, with the devstack-gate timeout cranked up to 5 hours just to make sure we capture the full job duration

Feb 20 21:55:37 <deepakcs>    dansmith, thanks :)
Feb 20 21:55:41 <fungi>    deepakcs: i'll keep you posted in here or follow up to the e-mail thread if you're gone

Feb 20 21:57:42 <deepakcs>    sdague, dansmith what i am not clear is that mysqld was the one with the max mem taken, which is why oom killer killed it, so howz gluster connected to creating oom here ?
Feb 20 21:58:19 <dansmith>    deepakcs: buffer cache can create huge amounts of memory pressure
Feb 20 21:58:34 <dansmith>    deepakcs: the largest memory user gets the boot just because it's the most likely to reduce the pressure

Feb 20 21:58:48 <dansmith>    deepakcs: there is a good data point here that mysql doesn't get killed in the other jobs

Feb 20 21:59:20 <deepakcs>    dansmith, ok, i looked up the glusterfsd (the process doing the IO to the disk) and its max 1.7G ram
Feb 20 21:59:21 <dansmith>    deepakcs: nor do I know what's going on in your job
Feb 20 21:59:31 <dansmith>    deepakcs: that's user memory
Feb 20 21:59:49 <deepakcs>    dansmith, yes and glusterfs is a user space FS entirely
Feb 20 22:00:03 <dansmith>    deepakcs: is it writing with O_DIRECT?

Feb 20 22:00:38 <deepakcs>    dansmith, the logs doesn't have glusterfs logs, so cant tell, but we don't configure it w/ o_direct by default, so it shouldn't
Feb 20 22:00:55 <dansmith>    deepakcs: so if not, every write it makes generates buffer cache load in the kernel

Feb 20 22:01:24 <dansmith>    deepakcs: regardless, this is really not the right forum to discuss this level of system debug I think
Feb 20 22:01:35 <deepakcs>    dansmith, ok
Feb 20 22:01:59 <dansmith>    deepakcs: the stats runs from your job and other non-failing jobs should be helpful, and it's clear that something in your job is eating all the resources

In logs/screen-dstat.txt.gz there seems to be a process called
"cinder-backup" that is running around that time. From the details of
dstat, I guess that this process causes the reading of files from the
Gluster Volume. When the system becomes responsive again, dstat has 11G
as "dsk/total read", it jumped from ~400M in 30 minutes. If this amount
of data is not flushed but kept in memory, I don't doubt that an OOM
could be triggered.

You do or don't doubt ? - thats confusing :)

Yes, I am also doubting that IO traffic generated by doing backup tests is the only scenario
where most IO is generated (from Cinder-GlusterFS perspective). I looked at all the tests
that openstack Cinder has, and this was the only intensive test.

Given that Backup was running before the time system went OOM, it seems liek a good
one to doubt upon.

Plan is to disable Backup tests and see if we still hit the OOM issue.

Do you know where this backup is written to? If we can setup a
reproducer environment that emulates the workload, things would become
much easier for debugging :)

In the current setup, backup source is GlusterFS and target is swift (configured to
store its data in /opt/stack/swift/ mounted on 2nd disk)

One of the 1st things I would try, is to configure the kernel to flush
dirty memory more aggressively ('sysctl' values for 'vm.*' options). An
other thing that may help to identify the memory usage, is capturing the
output of 'slabtop --once' in a 1-minute interval.

Unfortunately, openstack infra doesn't provide that kind of job control. The idea is to have a uniform
env for all storage backends and just change the specific backend that one wants to test, so that the runs are
comparable agaisnt a common base. Atleast that my understanding.

But yes there are knobs given to tweak/tune the kind of tests you wanna run, which is what I plan
to exploit and turn off Backup tests and see if it helps keep system in sane state.

Thanks Neils for your time and analysis, it really helped.

I m hoping now with the backup tests disabled, we won't hit, will update here once openstack infra folks co-operate
to run the CI job manually for us!

thanx,
deepak

_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several

petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift