IO causing major performance issues

List overview All Threads
Download

newer

older

Seeking docbook-to-man package

Re: [CentOS] IO causing major...

Antonio Varni

15 Nov 2007 15 Nov '07

11:06 p.m.

Hello everyone.

I'm wondering what other people's experiences are WRT systems becoming unresponsive (unable to ssh in, etc) for brief periods of time when a large amount of IO is being performed. It's really starting to cause a problem for us. We're on Dell PowerEdge 1955 blades - but this same issue has caused us problems on PE1950, PE1850, PE1750 servers.

We're running Centos 4.5 right now. I know Centos 5 includes ionice, more io scheduler/elevator selections like deadlock/etc. Perhaps that would fix this issue. We're running the latest PERC firmware.

The specific issue I'm referring to at this point is on a system running mysql. All mysql data files are on a netapp filer but mysql's tmp directory is on local disk. Whenever a lot of temp tables are created (and thus written and deleted from local disk quickly) we can't even log in to the machine - and our monitoring system gets all freaked out and we get lots of pages, etc... FYI this is two disks with hardware raid 1.

Is it just me? Or is this specific to Dell systems, or is this just the state of the Linux kernel these days? Is there some magical patch I can apply to make this issue go away :)

Thanks in advance for any insight into this issue.

Antonio

Show replies by date

Ross S. W. Walker

15 Nov 15 Nov

10:17 p.m.

Antonio Varni wrote:

...

Hello everyone.

I'm wondering what other people's experiences are WRT systems becoming unresponsive (unable to ssh in, etc) for brief periods of time when a large amount of IO is being performed. It's really starting to cause a problem for us. We're on Dell PowerEdge 1955 blades

but this same

issue has caused us problems on PE1950, PE1850, PE1750 servers.

We're running Centos 4.5 right now. I know Centos 5 includes ionice, more io scheduler/elevator selections like deadlock/etc. Perhaps that would fix this issue. We're running the latest PERC firmware.

The specific issue I'm referring to at this point is on a system running mysql. All mysql data files are on a netapp filer but mysql's tmp directory is on local disk. Whenever a lot of temp tables are created (and thus written and deleted from local disk quickly) we can't even log in to the machine - and our monitoring system gets all freaked out and we get lots of pages, etc... FYI this is two disks with hardware raid 1.

Is it just me? Or is this specific to Dell systems, or is this just the state of the Linux kernel these days? Is there some magical patch I can apply to make this issue go away :)

Thanks in advance for any insight into this issue.

Yes, IO starvation can occur under heavy load.

Don't put database temp tables on system disks (or data tables for that matter).

How much memory do you have in this box and how big does the temp directory usage get?

Why I ask is you could create a tempfs and have mysql use that, just make sure you have enough memory that you can spare X (whatever your temp table usage is) for a cache filesystem.

You would also notice a dramatic speed increase in MySQL.

-Ross

______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.

Les Mikesell

10:59 p.m.

Ross S. W. Walker wrote:

...

Yes, IO starvation can occur under heavy load.

But it should stall the process needing to write, not everything.

...

Don't put database temp tables on system disks (or data tables for that matter).

How much memory do you have in this box and how big does the temp directory usage get?

Why I ask is you could create a tempfs and have mysql use that, just make sure you have enough memory that you can spare X (whatever your temp table usage is) for a cache filesystem.

You would also notice a dramatic speed increase in MySQL.

I'm not sure the needed temp table space is predictable. I've seen mysql be pretty dumb about how it does a select that joins several tables.

-- Les Mikesell lesmikesell@gmail.com

Ross S. W. Walker

11:02 p.m.

Les Mikesell wrote:

...

Ross S. W. Walker wrote:

...
Yes, IO starvation can occur under heavy load.

But it should stall the process needing to write, not everything.

There is only 1 disk though and if that disk is busy writing it can't read.

It would be nice if disk manufacturers made full duplex disks with 1 set of heads for writing and another set of heads for reading, but they don't :-(

...

...
Don't put database temp tables on system disks (or data

tables for that matter).

...
How much memory do you have in this box and how big does

the temp directory

...
usage get?

Why I ask is you could create a tempfs and have mysql use

that, just make sure

...
you have enough memory that you can spare X (whatever your

temp table usage is)

...
for a cache filesystem.

You would also notice a dramatic speed increase in MySQL.

I'm not sure the needed temp table space is predictable. I've seen mysql be pretty dumb about how it does a select that joins several tables.

True, but take an average and that should suffice, MySQL will wait if temp space fills up.

-Ross

Les Mikesell

16 Nov 16 Nov

midnight

Ross S. W. Walker wrote:

...

...
Ross S. W. Walker wrote:

...
Yes, IO starvation can occur under heavy load.

But it should stall the process needing to write, not everything.

There is only 1 disk though and if that disk is busy writing it can't read.

But if it weren't for the accumulation of stuff in the raid card queue, other processes should get an equal shot fairly quickly. I think that raid card just treats draining the whole queue as one operation.

...

...
...
You would also notice a dramatic speed increase in MySQL.

I'm not sure the needed temp table space is predictable. I've seen mysql be pretty dumb about how it does a select that joins several tables.

True, but take an average and that should suffice, MySQL will wait if temp space fills up.

What does that mean if it needs more than the available space to complete a single operation - like a multi-table join that decides to copy the whole tables to temp files?

-- Les Mikesell lesmikesell@gmail.com

redhat＠mckerrs.net

15 Nov 15 Nov

10:39 p.m.

----- Original Message ----- From: "Antonio Varni" avarni@estalea.com To: centos@centos.org Sent: Friday, November 16, 2007 9:06:52 AM (GMT+1000) Australia/Brisbane Subject: [CentOS] IO causing major performance issues

Hello everyone.

Is it just me? Or is this specific to Dell systems, or is this just the state of the Linux kernel these days? Is there some magical patch I can apply to make this issue go away :)

Thanks in advance for any insight into this issue.

Antonio

-- _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ antonio varni [ technology ] ESTALEA, L.P. 629 State Street #222 Santa Barbara, CA 93101 v 805.252.0115 f 805.899.2697 e avarni@estalea.com w www.estalea.com _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. I have noticed similar behaviour on all sort of linuxes (in particular, ssh into the box is really slow when it's doing IO) and wondered why, but never really thought about investigating any further. Unfortunately, I do a lot of work with solaris and the funny thing is that I have *never* seen a solaris kernel exhibit this sort of behaviour. Even if it is installed on normal IDE/SATA disks. And, in fact, even if installed on the exact same hardware. Now I'm curious.....especially given that I'm right in the middle of pushing to get rid of solaris in favour of RHEL. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.

Ross S. W. Walker

10:45 p.m.

redhat@mckerrs.net wrote:

...

Antonio Varni wrote:

...
Hello everyone.

I'm wondering what other people's experiences are WRT systems becoming unresponsive (unable to ssh in, etc) for brief periods of time when a large amount of IO is being performed. It's really starting to cause a problem for us. We're on Dell PowerEdge 1955 blades

but this same

issue has caused us problems on PE1950, PE1850, PE1750 servers.

We're running Centos 4.5 right now. I know Centos 5 includes ionice, more io scheduler/elevator selections like deadlock/etc. Perhaps that would fix this issue. We're running the latest PERC firmware.

The specific issue I'm referring to at this point is on a system running mysql. All mysql data files are on a netapp filer but mysql's tmp directory is on local disk. Whenever a lot of temp tables are created (and thus written and deleted from local disk quickly) we can't even log in to the machine - and our monitoring system gets all freaked out and we get lots of pages, etc... FYI this is two disks with hardware raid 1.

Is it just me? Or is this specific to Dell systems, or is this just the state of the Linux kernel these days? Is there some magical patch I can apply to make this issue go away :)

Thanks in advance for any insight into this issue.

Antonio

I have noticed similar behaviour on all sort of linuxes (in particular, ssh into the box is really slow when it's doing IO) and wondered why, but never really thought about investigating any further.

Unfortunately, I do a lot of work with solaris and the funny thing is that I have *never* seen a solaris kernel exhibit this sort of behaviour. Even if it is installed on normal IDE/SATA disks. And, in fact, even if installed on the exact same hardware.

Now I'm curious.....especially given that I'm right in the middle of pushing to get rid of solaris in favour of RHEL.

It really depends what the system is doing, what services you are running and how you have it configured.

You had Solaris installed, what services was it running?

You had Linux installed, what services was it running?

Database temp tables and logs can generate an enormous amount of io which can swamp the file systems of any system.

I have seen it on Windows and Linux, so I don't see why Solaris would be any different.

You could always try a different scheduler to see if that helps, for instance if you are using 'cfq' try 'deadline'.

-Ross

Antonio Varni

16 Nov 16 Nov

12:29 a.m.

Of course IO can swamp the file system. My point is that the kernel should at least give enough time-slices to the other processes (like sshd) so we can still log in. It's not asking a lot from the kernel - to just log in via ssh really.

-- _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ antonio varni [ technology ] ESTALEA, L.P. 629 State Street #222 Santa Barbara, CA 93101 v 805.252.0115 f 805.899.2697 e avarni@estalea.com w www.estalea.com On Thu, 15 Nov 2007, Ross S. W. Walker wrote: > redhat@mckerrs.net wrote: > > Antonio Varni wrote: > > > > > > Hello everyone. > > > > > > I'm wondering what other people's experiences are WRT systems becoming > > > unresponsive (unable to ssh in, etc) for brief periods of time when > > > a large amount of IO is being performed. It's really starting to > > > cause a problem for us. We're on Dell PowerEdge 1955 blades > > > - but this same > > > issue has caused us problems on PE1950, PE1850, PE1750 servers. > > > > > > We're running Centos 4.5 right now. I know Centos 5 includes > > > ionice, more > > > io scheduler/elevator selections like deadlock/etc. Perhaps that would > > > fix this issue. We're running the latest PERC firmware. > > > > > > The specific issue I'm referring to at this point is on a > > > system running > > > mysql. All mysql data files are on a netapp filer but mysql's > > > tmp directory > > > is on local disk. Whenever a lot of temp tables are created (and thus > > > written and deleted from local disk quickly) we can't even > > > log in to the > > > machine - and our monitoring system gets all freaked out and we get > > > lots of pages, etc... FYI this is two disks with hardware raid 1. > > > > > > Is it just me? Or is this specific to Dell systems, or is this just > > > the state of the Linux kernel these days? Is there some magical patch > > > I can apply to make this issue go away :) > > > > > > > > > Thanks in advance for any insight into this issue. > > > > > > Antonio > > > > I have noticed similar behaviour on all sort of linuxes (in > > particular, ssh into the box is really slow when it's doing > > IO) and wondered why, but never really thought about > > investigating any further. > > > > Unfortunately, I do a lot of work with solaris and the funny > > thing is that I have *never* seen a solaris kernel exhibit > > this sort of behaviour. Even if it is installed on normal > > IDE/SATA disks. And, in fact, even if installed on the exact > > same hardware. > > > > > > Now I'm curious.....especially given that I'm right in the > > middle of pushing to get rid of solaris in favour of RHEL. > > It really depends what the system is doing, what services you are > running and how you have it configured. > > You had Solaris installed, what services was it running? > > You had Linux installed, what services was it running? > > Database temp tables and logs can generate an enormous amount of > io which can swamp the file systems of any system. > > I have seen it on Windows and Linux, so I don't see why Solaris > would be any different. > > You could always try a different scheduler to see if that helps, > for instance if you are using 'cfq' try 'deadline'. > > -Ross > > ______________________________________________________________________ > This e-mail, and any attachments thereto, is intended only for use by > the addressee(s) named herein and may contain legally privileged > and/or confidential information. If you are not the intended recipient > of this e-mail, you are hereby notified that any dissemination, > distribution or copying of this e-mail, and any attachments thereto, > is strictly prohibited. If you have received this e-mail in error, > please immediately notify the sender and permanently delete the > original and any copy or printout thereof. > > _______________________________________________ > CentOS mailing list > CentOS@centos.org > http://lists.centos.org/mailman/listinfo/centos >

Les Mikesell

15 Nov 15 Nov

10:54 p.m.

redhat@mckerrs.net wrote:

...

The specific issue I'm referring to at this point is on a system running mysql. All mysql data files are on a netapp filer but mysql's tmp directory is on local disk. Whenever a lot of temp tables are created (and thus written and deleted from local disk quickly) we can't even log in to the machine - and our monitoring system gets all freaked out and we get lots of pages, etc... FYI this is two disks with hardware raid 1.

Is it just me? Or is this specific to Dell systems, or is this just the state of the Linux kernel these days? Is there some magical patch I can apply to make this issue go away :)

Does the Dell have a raid controller? I saw something like this long ago on a Dell with a raid card that appeared to queue up thousands of operations, then hit some kind of high water mark and stay busy (basically locking the system) for several minutes while it caught up. It seemed pretty fast as long as you never completely filled its queue... These days I mostly run software raid1.

-- Les Mikesell lesmikesell@gmail.com

6545

Age (days ago)

6546

Last active (days ago)

discuss@lists.centos.org

8 comments

4 participants

tags (0)

participants (4)

Antonio Varni
Les Mikesell
redhat＠mckerrs.net
Ross S. W. Walker