[CentOS] High load averages with latest kernel and USB drives?

Tue Nov 17 18:46:37 UTC 2009
Benjamin Smith <lists at benjamindsmith.com>

See comments below... 

On Tuesday 17 November 2009 07:52:01 Todd Denniston wrote:
> Benjamin Smith wrote, On 11/16/2009 10:56 PM:
> > I have a 1TB USB drive plugged into a USB2 port that I use to back up the
> > production drives (which are SCSI). It's working fine, but while doing
> > backups (hourly) the load average on the server shoots up from the normal
> > 0.5 - 1.5 or so up to a high between 10 and 30. Strangely, even though
> > the "load is high" the server is completely responsive, even the USB
> > drives being accessed are!
> >
> > Backup script is really simple, run via cron, pretty much just:
> >
> > #! /bin/sh
> > hour=`date +%k`;
> > pg_dump <options> mydatabase > /media/backups/mydatabase.$hour.pgsql;
> >
> > where /media/backups is the mount point for the USB drive.
> >
> > Using top to diagnose, nothing seems to be particularly high! IoWait
> > seems reasonable (10-30%) and CPUs are 0.5%, Idle is 70-90%. Even
> > accessing the USB partition while the load is "high" is responsive!
> >
> > I'm guessing that something changed in how load average is counted?
> >
> > Server Stats:
> > 	Late model 8-way Xeon, SuperMicro brand.
> > 	CentOS 4.x  / 64 (all updates applied, booted after last kernel update)
> > 	Kernel 2.6.9-89.0.16.ELsmp
> > 	4 GB ECC RAM
> > 	300 GB SCSI HDD.
> > 	Standard Apache/PHP, Postgres 8.4.
> >
> > Any idea how to revert to the old load average tracking behavior short of
> > using a stale and potentially insecure kernel?



> Are you saying that when you were running a previous kernel the same
>  operations with the same devices did not have the high load? 

Correct! 

>  Which
>  specific kernels worked as desired (if someone is going to bisect the
>  problem they need a start point)?

kernel-smp-devel-2.6.9-89.0.15.EL  (I always keep my machines updated on at 
least a weekly scheduule) 

> Are there other processes on the machine that are waiting to use the db
>  while the dump is occurring? 

No. Database is actually on a different machine and backups are being done over 
the network. 

>  How many postgres processes are waiting for
>  the dump to finish (it has been a while since I ran postgres so I don't
>  recall how it deals with query's during a dump)?

One - the one performing the backup. Postgres uses MVCC so pg_dump doesn't 
block any other connections from continuing/finishing. 

> As workarounds perhaps asking the kernel to schedule in a specific way
>  might help, i.e.: #1 set the backup on a particular set of processors,
> #  replace the pg_dump line above with
> taskset -c 3-4 pg_dump <options> mydatabase > \
> 	/media/backups/mydatabase.$hour.pgsql;

There are 8 cores on the machine, none of which are reporting more than 5% 
load. That's what has me perplexed. When I run top, I see a max of about 30% 
user. Everything else is zero. When I run the backup script to a non-USB 
drive, the load average is completely normal (below 0.50, often below 0.10) 

> #2 set the usb-storage on a particular set of processors,
> # Note USBSTORPID= line prototyped on CentOS 5 machine not 4.
> USBSTORPID=`ps aux |grep usb-storage|head -1 |awk '{print $2}'`
> taskset -p -c 3-4 $USBSTORPID
> #you might even go back and reduce the processor list
> #to just 3 or 4 instead of both.

Could you explain to me what this should accomplish? I'm curious as to why you 
went this route... 

> #3 don't update atime
> # (should at worst be a minor thing, and you say that
> # the usb mounted file system is responsive,
> # but perhaps it would help some.)
> mount -oremount,noatime /media/backups/

Already mounted noatime... here's the mount line in the backup script: 
# mount -o rw,noatime -t ext3 /dev/sdc1 /home/backup/localdb/


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.