Wondering if anyone in the CentOS Community has the time/interest to investigate this CentOS + Rackspace issue?
Regards and best wishes,
Justin Clift
Begin forwarded message:
From: Deepak Shetty dpkshetty@gmail.com Subject: Re: [Gluster-devel] OOM issue in openstack Cinder - GlusterFS CI env Date: 26 February 2015 05:24:48 GMT To: Niels de Vos ndevos@redhat.com Cc: Gluster Devel gluster-devel@gluster.org
A quick update
By doing more experiments, we were able to figure that the failures are happening only on the rax (rackspace) cloud VMs, but not on the hpcloud VMs Also the OOM issue happened on rax VMs even when gluster was not used as the storage (which means the cinder default storage uses LVMs)
So right now its seems like a combination of rax & centos7 is causing the fail, making centos7 CI job unsuitable for openstack tempest runs. Since openstack-infra jobs can be schedule across any cloud provider, we cannot use centos7 as the distro, until the rax+centos7 oom issue is resolved
Given the strict cinder community guidelines for having a CI job in place, and openstack infra is having other issues with F20 and F21 based VMs, (which are under debug as I was told on -infra IRC by infra core members ) the only option left is to use ubuntu as the base OS for running cinder-gluster CI job, which we are in the process of evaluating and working on.
For full details refer to thread @ http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861
For a quick summary, see this mail of the above thread @ http://article.gmane.org/gmane.comp.cloud.openstack.devel/47287
thanx, deepak
On Mon, Feb 23, 2015 at 6:01 PM, Deepak Shetty dpkshetty@gmail.com wrote:
On Sun, Feb 22, 2015 at 8:46 PM, Niels de Vos ndevos@redhat.com wrote: Hi Deepak and others,
On Sat, Feb 21, 2015 at 10:00:18PM +0530, Deepak Shetty wrote:
Hi All, I am looking for some help from glusterfs side for the Out of Memory (OOM) issue we are seeing when using GlusterFS as a storage backend for openstack Cinder (block storage service)
This seems to be using the FUSE mount points for the VM images, right?
Yes, no libgfapi.
openstack has a upstream CI env managed by openstack infra team, where
we added a new job that creates a devstack env (openstack all in one for newbies) and configures the block service (Cinder) with GlusterFS as storage backend. Once setup, the CI job runs openstack tempest (Integration test suite of openstack) that does API level testing of the whole openstack env.
As part of that testing, ~1.5 to 2 hours into the run, the tempest job
(VM) hits OOM and the kernel oom-killer kills the process with the max memory to reduce memory pressure.
The tempest job is based on CentOS 7 and uses glusterfs 3.6.2 as the
storage backend for openstack Cinder
The openstack-dev thread @
*http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861 http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861* has details including links to the logs captured from the tempest jobs.
http://fungi.yuggoth.org/tmp/logs2.tar seems to be the most useful.
Per the openstack infra folks, they have other jobs based off CentOS7 that doesn't hit this issue, the only change we are adding is configuring cinder with glusterfs when this happens, so right now glusterfs is in the spotlight for causing this.
Well, there definitely is something wrong on the Gluster side. After the 'blackout' between 17:47 - 18:17, this is logged in logs/screen-c-vol.txt.gz:
mount -t glusterfs 127.0.0.1:/vol1 /opt/stack/data/cinder/mnt/f4c6ad7e3bba4ad1195b3b538efab64a" returned: 32
32 as exit code from "mount" means "mount failure".
Yeah, but thats the time when things went bad in the system leading to OOM So can't we safely assume that system was in a bad state, hence mount failure...
We actually need to look before 17:47 and at/around 17:47 to figure what was running that caused the blackout to happen.
I do see some errors I see in the brick logs and warnings in the mount logs but I Don't think they are anyway related to or cause of creatign OOM.
The mountpoint would log details in logs/glusterfs/opt-stack-data-cinder-mnt-f4c6ad7e3bba4ad1195b3b538efab64a.log. At 18:00 in that file, there is this message:
0-vol1-client-0: server 104.130.30.179:49152 has not responded in the last 42 seconds, disconnecting.
Its a one-brick volume, so this error is pretty fatal. This error does not cause an unmount, but the mountpoint will return errors. Subsequent mount tries will fail, because the mountpoint is not unmounted yet. Tests to see if the target directory can be used to mount a volume, will fail, causing the mount process to return error 32.
Thats in the middle of the blackout, so disconnections are ok, i feel.
A little later, the (still) existing fuse mount client manages to connect to the brick again:
0-vol1-client-0: changing port to 49152 (from 0)
The log of the brick does not contain anything useful for this time :-/
I am looking thru the logs trying to co-relate syslog, dstat, tempest info to figure the state of the system and what was happening at and before the OOM to get any clues, but wanted to start this thread in gluster-devel to see if others can pitch in with their ideas to accelerate the debug and help root cause.
I have not yet found any pointer to what could have taken place between 17:47-18:17. It seems that network I/O stopped (ping timeout in FUSE).
During that tiem from the tempest logs i could figure that among other tests VolumesBackupV1Test was running, which reads from Cinder- Gluster backend and write to swift service, which will add good IO traffic.
One of the few logs that I came across which contains some messages during this 'blackout', is logs/screen-n-cond.txt.gz. But I assume these messages are triggered by the underlying network issues.
You mention that there was a OOM? Which logs contain some details about that? Also, could you explain what test (workload) was running in the VM(s?) on the FUSE mountpoint?
Look at logs/syslog.txt (thats /var/log/messages), grep for oom-killer so at 18:17 when it happend it killed nova-compute and mysqld as they were the most memory hoggers. But in talking with openstack infra folks, they indicated that they have similar CI jobs where mysqld doesn't get killed, so adding gluster to the mix is definitely the reason for reaching OOM state.
There is really no application running workload here. Most of the IO is generated as part of oepnstack orchestration tests, liek backup (which reads from cinder and writes to swift), clone (create a snap and then copy it), create/delete cinder volumes (which are files in glusterfs) etc
From what i understood in talking with the openstack infra folks they referred me to the dstat logs sayign the kernel buffer cache is going as low as double digit M and system is in 70-90 % cpu wait time, which to them indicates that the IO of the system has gone bonkers
Below is the exact thing from the chat log (for the 1st re-create / logs) :
Feb 20 21:46:28 <sdague> deepakcs: you are at 70% wait time at the end of that
Feb 20 21:46:37 <sdague> so your io system is just gone bonkers
Feb 20 21:47:14 <fungi> sdague: that would explain why the console login prompt and ssh daemon both stopped working, and the df loop in had going in my second ssh session hung around the same time Feb 20 21:47:26 <sdague> yeh, dstat even says it's skipping ticks there Feb 20 21:47:29 <sdague> for that reason
Feb 20 21:47:48 <fungi> likely complete i/o starvation for an extended period at around that timeframe Feb 20 21:48:05 <fungi> that would also definitely cause jenkins to give up on the worker if it persisted for very long at all
Feb 20 21:48:09 <sdague> yeh, cached memory is down to double digit M
Feb 20 21:49:19 <fungi> because that's 00% of what it entails, at least for the hard problems Feb 20 21:49:21 <sdague> deepakcs: so, honestly, what it means to me is that glusterfs is may be too inefficient to function in this environment Feb 20 21:49:25 <fungi> er, 99% Feb 20 21:49:34 <sdague> because it's kind of a constrained environment
Feb 20 21:51:18 <deepakcs> sdague, skipping ticks means ?
Feb 20 21:51:32 <sdague> deepakcs: dstat runs once a second Feb 20 21:51:38 <sdague> unless the system gets under preasure Feb 20 21:51:54 <sdague> then it starts skipping it's runs to prevent things from getting worse Feb 20 21:52:01 <sdague> i.e. to not be part of the problem
Feb 20 21:53:24 <dansmith> deepakcs: dstat is a well-understood thing, so you can probably figure out all you need to know from the docs
Feb 20 21:53:41 <dansmith> deepakcs: it's run in the jobs so we can identify issues like this where the system is overly taxed and unable to actually run the tests Feb 20 21:53:57 <deepakcs> dansmith, agreed, just trying to understand system debug better :)
Feb 20 21:54:39 <dansmith> deepakcs: http://linux.die.net/man/1/dstat Feb 20 21:54:54 <fungi> deepakcs: i have another manual run started just now, with the devstack-gate timeout cranked up to 5 hours just to make sure we capture the full job duration
Feb 20 21:55:37 <deepakcs> dansmith, thanks :) Feb 20 21:55:41 <fungi> deepakcs: i'll keep you posted in here or follow up to the e-mail thread if you're gone
Feb 20 21:57:42 <deepakcs> sdague, dansmith what i am not clear is that mysqld was the one with the max mem taken, which is why oom killer killed it, so howz gluster connected to creating oom here ? Feb 20 21:58:19 <dansmith> deepakcs: buffer cache can create huge amounts of memory pressure Feb 20 21:58:34 <dansmith> deepakcs: the largest memory user gets the boot just because it's the most likely to reduce the pressure
Feb 20 21:58:48 <dansmith> deepakcs: there is a good data point here that mysql doesn't get killed in the other jobs
Feb 20 21:59:20 <deepakcs> dansmith, ok, i looked up the glusterfsd (the process doing the IO to the disk) and its max 1.7G ram Feb 20 21:59:21 <dansmith> deepakcs: nor do I know what's going on in your job Feb 20 21:59:31 <dansmith> deepakcs: that's user memory Feb 20 21:59:49 <deepakcs> dansmith, yes and glusterfs is a user space FS entirely Feb 20 22:00:03 <dansmith> deepakcs: is it writing with O_DIRECT?
Feb 20 22:00:38 <deepakcs> dansmith, the logs doesn't have glusterfs logs, so cant tell, but we don't configure it w/ o_direct by default, so it shouldn't Feb 20 22:00:55 <dansmith> deepakcs: so if not, every write it makes generates buffer cache load in the kernel
Feb 20 22:01:24 <dansmith> deepakcs: regardless, this is really not the right forum to discuss this level of system debug I think Feb 20 22:01:35 <deepakcs> dansmith, ok Feb 20 22:01:59 <dansmith> deepakcs: the stats runs from your job and other non-failing jobs should be helpful, and it's clear that something in your job is eating all the resources
In logs/screen-dstat.txt.gz there seems to be a process called "cinder-backup" that is running around that time. From the details of dstat, I guess that this process causes the reading of files from the Gluster Volume. When the system becomes responsive again, dstat has 11G as "dsk/total read", it jumped from ~400M in 30 minutes. If this amount of data is not flushed but kept in memory, I don't doubt that an OOM could be triggered.
You do or don't doubt ? - thats confusing :)
Yes, I am also doubting that IO traffic generated by doing backup tests is the only scenario where most IO is generated (from Cinder-GlusterFS perspective). I looked at all the tests that openstack Cinder has, and this was the only intensive test.
Given that Backup was running before the time system went OOM, it seems liek a good one to doubt upon.
Plan is to disable Backup tests and see if we still hit the OOM issue.
Do you know where this backup is written to? If we can setup a reproducer environment that emulates the workload, things would become much easier for debugging :)
In the current setup, backup source is GlusterFS and target is swift (configured to store its data in /opt/stack/swift/ mounted on 2nd disk)
One of the 1st things I would try, is to configure the kernel to flush dirty memory more aggressively ('sysctl' values for 'vm.*' options). An other thing that may help to identify the memory usage, is capturing the output of 'slabtop --once' in a 1-minute interval.
Unfortunately, openstack infra doesn't provide that kind of job control. The idea is to have a uniform env for all storage backends and just change the specific backend that one wants to test, so that the runs are comparable agaisnt a common base. Atleast that my understanding.
But yes there are knobs given to tweak/tune the kind of tests you wanna run, which is what I plan to exploit and turn off Backup tests and see if it helps keep system in sane state.
Thanks Neils for your time and analysis, it really helped.
I m hoping now with the backup tests disabled, we won't hit, will update here once openstack infra folks co-operate to run the CI job manually for us!
thanx, deepak
Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
-- GlusterFS - http://www.gluster.org
An open source, distributed file system scaling to several petabytes, and handling thousands of clients.
My personal twitter: twitter.com/realjustinclift
On 27 February 2015 at 10:48, Justin Clift justin@gluster.org wrote:
Wondering if anyone in the CentOS Community has the time/interest to investigate this CentOS + Rackspace issue?
Regards and best wishes,
Justin Clift
So the going through the issue it looks like some particular on the Rax instances. I don't have access to those so I am not sure if they use a custom kernel or something similar. [I am also not clear if the Rax is xen and HP is a different VM technology.] Any ideas on where I could get access to compare the setup that was causing problems with one that wasn't?
On 27 Feb 2015, at 20:26, Stephen John Smoogen smooge@gmail.com wrote:
On 27 February 2015 at 10:48, Justin Clift justin@gluster.org wrote:
Wondering if anyone in the CentOS Community has the time/interest to investigate this CentOS + Rackspace issue?
So the going through the issue it looks like some particular on the Rax instances. I don't have access to those so I am not sure if they use a custom kernel or something similar. [I am also not clear if the Rax is xen and HP is a different VM technology.] Any ideas on where I could get access to compare the setup that was causing problems with one that wasn't?
Deepak, Stephen is interested in investigating the problems with CentOS + Rackspace.
Are you able to get him the access he needs for investigation? :)
Regards and best wishes,
Justin Clift
-- GlusterFS - http://www.gluster.org
An open source, distributed file system scaling to several petabytes, and handling thousands of clients.
My personal twitter: twitter.com/realjustinclift