Painfully slow NetApp with databas

List overview All Threads
Download

newer

older

Re: [CentOS] CentOS 5.2 and Bacula

Check my math please

Mindaugas Riauba

6 Nov 2008 6 Nov '08

8:44 a.m.

Hello,

We have long running problem with NetApp filers. When we connect server to the filer sequential read performance is ~70MB/s. But once we run database on the server seq read performance drops to ~11MB/s. That's happening with two servers. One is running Oracle another - MySQL. During speed tests database load is very light (less than 1MB/s of reads and writes). During the tests NetApp was serving only these two servers.

What could be causing such slowdown and overall slow read performance? Writes are fast >100MB/s.

NetApp support is claiming that such performance is normal. Somehow I do not believe that 2007 model should deliver such XXth century performance levels. :)

RHEL4 fully updated 64 bits connected to NetApp FAS3040 via SAN using Qlogic 2Gb/s FC adapters. Tried both drivers from Redhat and QLogic (for NetApp). Also changing of FC parameters (like queue size) did not help.

Regards,

Mindaugas

Show replies by date

nate

6 Nov 6 Nov

12:46 p.m.

Mindaugas Riauba wrote:

...

NetApp support is claiming that such performance is normal. Somehow I do not believe that 2007 model should deliver such XXth century performance levels. :)

How many disks, what RPM are they running at, what I/O block size is being used and what protocol (NFS/iSCSI/FC) is being used?

Checking an Oracle DB I used to run it averages 7k I/Os with spikes to 59k. For a 10k RPM disk, 7k I/O size means roughly 800kBytes/second before latency starts to become an issue depending on the controller type. Really high end controllers can go up to about 1,312kB instead of 800kB. The array reports Oracle is using an average of 985 kBytes/second with spikes to 28MBytes/second.

A MySQL DB I used to run averages 41k I/Os with spikes up to 333k. For a 10k RPM disk 41k I/O is 4500 kBytes/second. The array reports MySQL using an average of 3200 kBytes/second with spikes to 34.1MBytes/second.

The array throughput numbers include benefit from the disk cache, while the raw spindle performance assumes no cache (or "worst case" performance).

Both of those are connected via Fiber channel so performance will be quite a bit higher than that of NFS or iSCSI.

So the numbers your seeing could be perfectly reasonable as NetApp suggests depending on what the exact workload is and your array configuration. For the workload you should look to the array for statistics, I'm not too familiar with NetApp arrays but I assume they offer a wide range of statistics, hopefully I/O size is among them as that is the most critical to determine throughput.

The array running the above databases is a system running 40 10k RPM disks with the data evenly distributed across all spindles for max performance. The array also is host to about 25 other systems as well.

NetApps certainly aren't the fastest thing in the west but given your performance levels it sounds like you don't have many disks connected and are limited by the disks rather than the controller(s).

Most low end arrays don't offer the level of visibility that the enterprise ones do.

On that note I'm getting a new 150TB array in today, pretty excited about that. 3PAR T400 virtualized storage system.

nate

David Ross

9:32 p.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org]On Behalf Of Mindaugas Riauba Sent: Thursday, November 06, 2008 3:45 AM To: nahant-list@redhat.com; CentOS@centos.org Subject: [CentOS] Painfully slow NetApp with databas

We have long running problem with NetApp filers. When we connect server to the filer sequential read performance is ~70MB/s. But once we run database on the server seq read performance drops to ~11MB/s.

There many factors which impact database i/o performance. You are comparing two probably dissimilar i/o paths. Configuration differences (disk, controller, connection, kernel parameters) may be part of the problem.

Physical vs. logical sequence, i.e. disk head movement, is particularly significant for sequential access. Throughput increases of 200% to 500% are not uncommon after packing (dump/restore) of a large table.

The database translates and theoretically optimizes sql statements generating query plans. They should be examined to determine if there are doing what is expected.

...

That's happening with two servers. One is running Oracle another - MySQL. During speed tests database load is very light (less than 1MB/s of reads and writes). During the tests NetApp was serving only these two servers.

What could be causing such slowdown and overall slow read performance? Writes are fast >100MB/s.

Database performance data is another good source of diagnostic info. It can help pinpoint query plan problems.

This article may help you put all the pieces together. It is one of the best I have found:

http://www.miracleas.com/BAARF/oow2000_same.pdf

The author is Juan Loaitza from Oracle.

Hope some of this helps.

david

_______________________________________ No viruses found in this outgoing message Scanned by iolo AntiVirus 1.5.5.5 http://www.iolo.com

William L. Maltby

11:15 p.m.

On Thu, 2008-11-06 at 16:32 -0500, David Ross wrote:

...

<snip>

...

There many factors which impact database i/o performance. You are comparing two probably dissimilar i/o paths. Configuration differences (disk, controller, connection, kernel parameters) may be part of the problem.

Physical vs. logical sequence, i.e. disk head movement, is particularly significant for sequential access. Throughput increases of 200% to 500% are not uncommon after packing (dump/restore) of a large table.

The database translates and theoretically optimizes sql statements generating query plans. They should be examined to determine if there are doing what is expected.

...
That's happening with two servers. One is running Oracle another - MySQL. During speed tests database load is very light (less than 1MB/s of reads and writes). During the tests NetApp was serving only these two servers.

What could be causing such slowdown and overall slow read performance? Writes are fast >100MB/s.

Database performance data is another good source of diagnostic info. It can help pinpoint query plan problems.

This article may help you put all the pieces together. It is one of the best I have found:

http://www.miracleas.com/BAARF/oow2000_same.pdf

The author is Juan Loaitza from Oracle.

Hope some of this helps.

I don't know if what I'm about to add is too dated. But here goes. I'm sure much still applies because the basic issues and concepts in play remain unchanged.

What I'm unsure of is if any of this might be related to the problem seen.

Back in the early days of SQL databases, a lot of folks had no understanding of the logical/physical interactions. I helped many folks improve performance (during my contracting/consulting days) just by examination of these two components in conjunction with access patterns.

The critical components were the most frequent access/update patterns versus the physical organization of the primary table(s) and their _primary_ indexes, or hashes. Do DBMs still use those things? Regardless, even with hash tables, these relationships still are important. Also, selection of proper hashing algorithms was important.

Further, keeping the underlying DBs (the physical file(s) organization) physical sequence optimal for the "critical" access profiles was often overlooked. As updates are done, especially add/delete of entries, the physical organization quickly deteriorates with regard to the "critical" access profile(s). It used to be that "holes" would be left in the primary data space, resulting in constantly growing files. IIRC, most major DBMs have addressed this by tracking and re-using these areas. However, that tends to further "corrode" the physical sequence from that desired. Over time, it may be very noticable.

On a very active heavily loaded DB, this can occur quite quickly. The immediate effects could be ameliorated a bit by having sufficient "overflow" areas - "empty spaces" at the end that received new records when using indexes of hash tables. During reorganization the overflow contents would be merged into the appropriate areas and the "overflow" was again pristine.

Periodic "reorganization", a.k.a "reload" can be used to keep performance within acceptable limits.

If the problem system has not been properly organized for the activity profiles encountered, or if it has had lots of updates without being "reorganized", maybe this causes the problem being seen.

In later implementations, these ill effects are somewhat reduced by making transaction queues that attempt to optimize the sequence of activities. However, if not running on a "raw" partition, where the DBM knows the physical location of items, this may not be as effective as possible.

Further, with the advent of these added "in memory" operations, there is a risk of "thrashing" the VM. With heavy use of buffering, caching, etc. even the fantastic speed of today's equipment may suffer enough slowdown to be noticeable, apart from the physical disk activities. This is exacerbated by problems with the physical organization issues mentioned above.

I can't say if any of this is your problem or not. But barring discovery of any system/physical problems, I suspect this might be useful.

On a last note, I'll give an example of potential gains. Keep in mind this was long ago and far away (in my mind) on much slower equipment. The application was written in C (a fair amount by myself) and so I know the code was fast. Speed has always been my thing. A fellow with relatively little experience was in charge of the database(s).

My inquiry application, fairly complex but I knew to be code-efficient, was taking over two minutes (about 2:30 minutes IIRC) to fill a screen (a text sceeen at that). The other _much_ simpler applications ran much faster. But those applications were "peripheral" to the main task, for which I was responsible.

I entered discussion with the database person, educated him a bit on what I though was happening, and suggested some changes based on those items I discussed above. He made the changes and we ran some tests.

All the "peripheral" applications ran just as faster of faster.

Over the next several days of hitting it every way we could think of, as hard as we could, the _maximum_ wait was 9 seconds. The median came in at 6-7 seconds.

I don't know what capabilities are provided to the administrators these days, but if access to tools that give sufficient information and control are available, maybe there is a (partial?) solution hidden in there?

...

david

<snip sig stuff>

HTH

-- Bill

John

7 Nov 7 Nov

12:33 a.m.

...

...
The critical components were the most frequent access/update patterns versus the physical organization of the primary table(s) and their _primary_ indexes, or hashes. Do DBMs still use those things?

Prmary Indexes have a lot more to do with I/O than most people think. You really have to understand how a database stores the physical data on rows and columns. The biggest index killer is Clustered Indexes. Among other things those type of indexes take longer to execute. Why? Because the database takes and reorders all the data for Index 1 - Index Infinite. Regular indexes (standard) do not do so. Also Heape Indexes well, just so to speak slap the data to screen and just don't care what the order it is. Can add in Foreign Keys Relations also.

This can be verified by digging way deep into your db documentation. Then you can actually exec queries to pull the raw data Pages from the database.

Bill, you're the first one I have herd say this in some time.

Among other things is the DB running in memory? Is the DB making temp tables in memory and then dropping them.

6134

Age (days ago)

6135

Last active (days ago)

discuss@lists.centos.org

4 comments

5 participants

tags (0)

participants (5)

David Ross
John
Mindaugas Riauba
nate
William L. Maltby