Shrinking a volume group

List overview All Threads
Download

newer

older

yam help

ERRATA in...

Steve Bergman

12 Sep 2006 12 Sep '06

9:41 p.m.

I did something stupid when setting up a server which is now live. I think I'm stuck.

I set up swap on Raid1 which is decidedly non-optimal. I'd prefer 2 independent swap partitions at the same priority.

Here is the set up:

/dev/sda1 + /dev/sdb1 -> md0 (raid1)

md0 -> VolGroup00

VolGroup00 -> VG00LV00(/), VG00LV01(/home), VG00LV02(swap)

I've gotten as far as :

Turning off swap and removing LV02.

Next step would be to shrink VG00. But it looks like you can only shrink a volume group by removing entire PV's. I only have 1 PV.

So I think I'm stuck.

Any ideas?

Thanks, Steve

Show replies by date

Matt Hyclak

12 Sep 12 Sep

9:48 p.m.

On Tue, Sep 12, 2006 at 02:41:33PM -0500, Steve Bergman enlightened us:

...

I did something stupid when setting up a server which is now live. I think I'm stuck.

I set up swap on Raid1 which is decidedly non-optimal. I'd prefer 2 independent swap partitions at the same priority.

Here is the set up:

/dev/sda1 + /dev/sdb1 -> md0 (raid1)

md0 -> VolGroup00

VolGroup00 -> VG00LV00(/), VG00LV01(/home), VG00LV02(swap)

I've gotten as far as :

Turning off swap and removing LV02.

Next step would be to shrink VG00. But it looks like you can only shrink a volume group by removing entire PV's. I only have 1 PV.

So I think I'm stuck.

Any ideas?

Yes, leave it as a mirrored partition so when memory is swapped out to a disk that dies, your server does not panic.

Matt

-- Matt Hyclak Department of Mathematics Department of Social Work Ohio University (740) 593-1263

Steve Bergman

9:55 p.m.

On Tue, 2006-09-12 at 15:48 -0400, Matt Hyclak wrote:

...

Yes, leave it as a mirrored partition so when memory is swapped out to a disk that dies, your server does not panic.

Aha! It's been a while since I set this up and now that you mention it, I think I did consider that at the time and set things up the way I did for that reason. Stupid being a conserved quantity, I'm simply being stupid now rather than back then.

Seriously, Thanks!

You're right.

-Steve

Daniel de Kok

10:22 p.m.

On Tue, 2006-09-12 at 15:48 -0400, Matt Hyclak wrote:

...

Yes, leave it as a mirrored partition so when memory is swapped out to a disk that dies, your server does not panic.

Yep, that's pretty sensible, especially if you have fast disks. With older, slower disks, and relatively little memory it becomes a trade-off that is worth consideration.

-- Daniel

Les Mikesell

10:25 p.m.

On Tue, 2006-09-12 at 22:22 +0200, Daniel de Kok wrote:

...

...
Yes, leave it as a mirrored partition so when memory is swapped out to a disk that dies, your server does not panic.

Yep, that's pretty sensible, especially if you have fast disks. With older, slower disks, and relatively little memory it becomes a trade-off that is worth consideration.

If the disks are scsi the writes happen pretty much in parallel and you only read one copy when you read. It should only slow you down if you are using 2 ide drives on the same controller for the mirrors in which case you have to wait for one write to complete before the other starts.

-- Les Mikesell lesmikesell@gmail.com

Steve Bergman

10:39 p.m.

On Tue, 2006-09-12 at 15:25 -0500, Les Mikesell wrote:

...

If the disks are scsi the writes happen pretty much in parallel and you only read one copy when you read. It should only slow you down if you are using 2 ide drives on the same controller for the mirrors in which case you have to wait for one write to complete before the other starts.

They are SATA. My original thinking today was that performance would be better if I had 2 swap partitions: /dev/sda2, /dev/sdb2 set at the same priority so that *different* pages could be swapped simultaneously to/from the 2 drives. But as Matt pointed out, that would be less robust.

-Steve

Aleksandar Milivojevic

13 Sep 13 Sep

1:23 p.m.

Steve Bergman wrote:

...

On Tue, 2006-09-12 at 15:25 -0500, Les Mikesell wrote:

...
If the disks are scsi the writes happen pretty much in parallel and you only read one copy when you read. It should only slow you down if you are using 2 ide drives on the same controller for the mirrors in which case you have to wait for one write to complete before the other starts.

They are SATA. My original thinking today was that performance would be better if I had 2 swap partitions: /dev/sda2, /dev/sdb2 set at the same priority so that *different* pages could be swapped simultaneously to/from the 2 drives. But as Matt pointed out, that would be less robust.

The best performance is if you don't swap at all. Avoid swapping, don't relay on "fast swap". There's no such thing as "fast swap" ;-)

Anyhow, theoretically when you read from RAID1, you read different pages from different drives in parallel. That's why RAID1 theoretically has two times faster reads than single drive. Writing to RAID1 is theoretically the same speed as writing to single drive. Of course, as Les pointed out, it all depends on the actual hardware. If hardware isn't capable of doing read/writes in parallel (for example two IDE drives on the same controller), you get slower performance. Historically, SCSI with its command queuing was very good in doing things in parallel to several devices (even though it was chain-type of bus), hence its popularity in high performance applications. Some (not all) SATA devices come with support for queuing (they call it "native command queuing"), however I've no idea if Linux supports it or not.

Steve Bergman

4:12 p.m.

On Wed, 2006-09-13 at 06:23 -0500, Aleksandar Milivojevic wrote:

...

The best performance is if you don't swap at all. Avoid swapping, don't relay on "fast swap". There's no such thing as "fast swap" ;-)

Anyhow, theoretically when you read from RAID1, you read different pages from different drives in parallel. That's why RAID1 theoretically has two times faster reads than single drive. Writing to RAID1 is theoretically the same speed as writing to single drive.

Hi Aleksandar,

Yes, you are right. I had forgotten about that. Just having a blonde day yesterday, I guess.

Since swapin performance is more important than swapout, I really do have the best of both worlds the way things are set up.

When it comes to swap, I'm a big believer that swap is a good thing. I've been reviewing this discussion on lkml:

http://kerneltrap.org/node/3000/13875

My thinking is squarely in the Andrew Morton camp:

------

"""Swapout is good. It frees up unused memory. I run my desktop machines at swappiness=100."""

and

"""My point is that decreasing the tendency of the kernel to swap stuff out is wrong. You really don't want hundreds of megabytes of BloatyApp's untouched memory floating about in the machine. Get it out on the disk, use the memory for something useful."""

------

Though if the maximum latency muust be kept below a certain value, I can see where one might want to completely avoid swap.

Of course, if a machine has *so* much memory that it *can't* use all available memory. i.e. the memory used by apps + the total amount of data read from disk <= physical memory, that would indeed be optimal.

What I have is:

2 Pentium 4 Xeons. 3.2Ghz, Hyperthreaded, 2MB L2 per processor 4GB physical memory 2 250GB SATA drives on separate SATA channels

running 40 Gnome desktops + 100 instances of a character based point of sale and accounting app. Plus samba file server, lightly loaded database server, lightly loaded intranet web server/Ruby on Rails app server.

Typically, I have about 300MB in swap with very little swapin occurring and about 700MB cache, 50mb buffers.

But as I've been told by one employee that they have noticed that one of the major incentives for other employees to choose to switch from their Windows desktop to a Linux desktop (via XDMCP) is the greatly improved speed, I'm interested in keeping it optimal as the load increases.

I'm considering running the swappiness value higher. But I do start seeing some significant swapin at swappiness=100. (Typically, from 0 - 50 pages /sec averaged over 10 minutes according to systat/sar.

That was what got me to thinking about how I had set up the swap.

However, even at 50 p/s, that only represents 200k/s i/o which seems pretty trivial.

I'm going to try a full day at swappiness=100 today and see how things look. I'll be NX'd into the machine my self doing some development work, which is about the best metric I can come up with. It's so very hard to come up with good solid metrics when it comes to this kind of tuning.

-Steve

Aleksandar Milivojevic

5:43 p.m.

Quoting Steve Bergman steve@rueb.com:

...

When it comes to swap, I'm a big believer that swap is a good thing.

Yes, swap is a good thing. As long as it is used only to swap out never or almost-never used data. On the other hand, if an app needs 1 gig of memory, and it really crunches on that data, relaying on swap to give it 1 gig of space is a bad idea. It'll kill machine.

-- NOTICE: If you are not intended recipient, you are hereby notified that by reading this message you agreed not to disturb frogs during mating season. For more info, visit http://www.8-P.ca/

William L. Maltby

7:06 p.m.

On Wed, 2006-09-13 at 10:43 -0500, Aleksandar Milivojevic wrote:

...

Quoting Steve Bergman steve@rueb.com:

...
When it comes to swap, I'm a big believer that swap is a good thing.

Yes, swap is a good thing. As long as it is used only to swap out never or almost-never used data. On the other hand, if an app needs 1 gig of memory, and it really crunches on that data, relaying on swap to give it 1 gig of space is a bad idea. It'll kill machine.

Yes. But all should keep in mind that swap was originally not for performance, but to allow (*e.g.*) PCs running UNIX system III or V (and other *IXs as they appeared) to support multiple users (usually dedicated to one or two apps) when hardware was very expensive. A "reasonably configured" node might have a 286/386/486/586 CPU, 100MB disk or two and 640K of ram (later 1MB). Depending on time frame, maybe 4 or 5 thousand dollars? Add a couple serial I/O cards and some printers and terminals (Wyse 50/55/60, e.g.) and you had "state of the art" capability. This was about the time that 4GL was coming into vogue and relational data bases with SQL interfaces were just starting to supplant proprietary stuff. "Canned" business applications were beginning to make their appearance.

As the cost structure and technological capabilities changed, the utility of "swap" (in terms of dollars) decreased but the people using it did not see it that way. If one wants to promote *IX (any flavor) across the widest possible potential user base, then one must continue to support swap for those whose $s matter more than latency. But the tuning ability is needed for those to whom latency is more important.

The problem in the current discussions, IMO, is that the tunability of the original *IX (rife with high and low water marks for cache, swap activity, text, data, I/O of various types, ...) could satisfy the needs of admins willing to learn it, but was considered too "complex" by the community at large (that is, those who paid for it all - business) because they had a hard time finding the expertise and paying the price demanded by that expertise. Plus, it needed real data (performance statistics) to be useful. This meant no instant decisions *or* a high risk that several iterations would be needed to get it right. This meant even more cost. If the platform was changed, the cycle repeated.

That complexity has been replaced by (apparent) simplicity and so now people gripe because it is not skewed to their personal needs and they have insufficient tools to bias their system in the direction needed.

So you have those saying "...interactivity should never be sacrificed..." and those saying "... but look what you gained that you didn't notice". Both are right in their environments.

Whenever you have highly technical skilled people working in environments defined by the bean-counters, you'll hit situations like this. Iteratively.

For that reason, I tend to both ignore and discount all such discussions. Cost structure is such now that those who desire minimal latency should have it with zero-swap (if they are satisfied with the risk) and those more concerned with hardware costs and reduced risk should have the swap they want.

But in no case can we ever go back to the complexity that used to provide the almost infinite tunability that allowed all needs to be satisfied (reasonably so) because the folks controlling the money would not tolerate it. The needed expertise would be both missing and too costly in the view of business (they'll go offshore until the salaries rise sufficiently there too).

A better solution is potentially laying in the proper application of "pre-configurations" that exist already: "workstation", "server", ... The problem is that they are not "tuned" in the sufficiently in the direction needed for the implied use of the particular "pre- configuration". And if they were more aggressively tuned, some customers would be dissatisfied because it was too aggressive while others thought it not aggressive enough.

And other would just plain apply the wrong tool to the job (use a "server" setup for workstation, ...).

IMHO -- Bill

Steve Bergman

14 Sep 14 Sep

2:27 a.m.

On Wed, 2006-09-13 at 13:06 -0400, William L. Maltby wrote:

...

The problem in the current discussions, IMO, is that the tunability of the original *IX (rife with high and low water marks for cache, swap activity, text, data, I/O of various types, ...) could satisfy the needs of admins willing to learn it, but was considered too "complex" by the community at large (that is, those who paid for it all - business) because they had a hard time finding the expertise and paying the price demanded by that expertise.

I've always been rather skeptical of the claim that smart admins could analyze their "workload" and tune the system based upon their keen understanding of what the appropriate watermarks should be.

What is the "workload" for a multi-user desktop server? What is the proper high watermark for a system when it's 12:30pm CST and everyone is at lunch, except for Charlie, who decided to stay late to run his daily reports. Or maybe Charlie was sick that day, and Angela ran them, right at 5pm. (Oops! I almost forgot!)

Of course, Veronica always does End Of Day at 5PM. And she complains that something was wrong with the system "yesterday".

Hank and his crew usually come in at 8. But he was under a deadline and had his guys working on reports in OpenOffice from 7am on so that his department could meet the deadline at noon. (Tony prefers Gnumeric. Tina refuses to work in anything but Microsoft word via X-Over Office. And yes, I recommended against that but the General Manager over-ruled me.)

It's not that admins aren't smart enough, these days.

It's that it's just plain silly to think that a human being could tune for these things.

There is no such thing as a "workload" to be tuned for. Every time I see that word, I have to laugh. Because it doesn't exist.

Perhaps on a large enough system, an admin can reasonably treat a workload as a statistical entity, ala thermodynamics.

But CS equations are never going to be as neat as thermodynamic ones. So it just means that when the hammer falls, it's just going to be that much more impossible to deal with.

The system really needs to tune itself dynamically.

I know that you are saying that we can't go back to the days of manual tuning. And I agree. But for different reasons, I think.

It's not that admins aren't smart enough, these days.

It's that they never were...

-Steve

William L. Maltby

3:58 p.m.

On Wed, 2006-09-13 at 19:27 -0500, Steve Bergman wrote:

...

On Wed, 2006-09-13 at 13:06 -0400, William L. Maltby wrote:

...
<snip some good personal opinions>

...

It's not that admins aren't smart enough, these days.

It's that it's just plain silly to think that a human being could tune for these things.

There is no such thing as a "workload" to be tuned for. Every time I see that word, I have to laugh. Because it doesn't exist.

As to "workload", I respectfully disagree, based on the below. I am glad that you can enjoy the laugh. There's not enough in the world.

I could raise the same sort of objections for automotive "tuning" that you raise for OS tuning. You may respond that there are tangibles that can be measured. And so it is in a computer. And as in a computer system, precision is lacking in how/when those variables are applied. If a car runs different circuits, runs on days when ambient conditions vary, track conditions change during the race, multiple drivers (ala ALMS), ...

Yet, folks successfully tune automobiles today and have done so for over a century. But not every "mechanic" is capable of producing the desired results alone, although the mechanic may be capable of rebuilding the car totally.

And the one who can "tune" the car may be very poor at rebuilding it. Or even of applying the tuning principles by turning a wrench.

As with anything that has variable conditions and/or intangibles that must be considered (such as necessarily ill-defined workload traits) or has imprecise available or predictive metrics (like incomplete definition of every possible performance related activity, load and timing), the problem is not finding a solution.

The task is to properly formulate a problem that is solvable and applies to the intended environment. Think "subsets". Of all possible problems, of all possible solutions, of available expertise, of available manpower, of available money, ...

Although their task was relatively simple (essentially, they defined and solved their problem in isolation), that is what folks who developed the vm sub-system did. They achieved "success" only because no one else can find a better "problem definition" that allows solution for an audience any broader than the current audience *unless* a higher level of required expertise and expense is to be tolerated (not likely in this cheap-ass-only world we now inhabit). Better results could be achieved for a well-defined subset of that audience. And *that* is why the pointless debates continue. Each speaks to a "local" environment.

I did tuning successfully for many years. How do I know I was successful? Because folks kept paying me money, based on word-of-mouth, to come and help them make their system "run better". Was almost always able to do so. But in some cases I had to suggest upgrades instead because, after typical interviews and analysis, it was obvious their system was under-powered for the load vs. performance vs. time-frame they desired.

There was a willingness to dedicate oneself to the necessary hard-work and study required to understand the tools available, the OS, the user "profile", etc. And the environment then endorsed that concept: "craftsmanship" I guess.

In todays "quick-fix-only, lowest-possible-cost, instant-response- required" world, may not be possible.

...

Perhaps on a large enough system, an admin can reasonably treat a workload as a statistical entity, ala thermodynamics.

On a large enough system, there is no debate. Cost is justified.

...

But CS equations are never going to be as neat as thermodynamic ones. So it just means that when the hammer falls, it's just going to be that much more impossible to deal with.

There is a valid point: just give up, roll over and spend more money on more hardware. It's cheaper than developing/obtaining/maintaining seldom-used expertise. And since only business drives this, their parameters are the determining factors.

...

The system really needs to tune itself dynamically.

And so you believe that it will be as good as, or better, at predicting that some external "random" event will occur that needs "this" particular piece of cache? Or buffer? Theoretically, a heuristic algorithm could be developed that does better. But that ultimately comes down to just redefining the problem to be solvable with a changed set of parameters. The same thing a "human" tuner would do. But it would do it "cheaper" and so biz will be happy.

...

I know that you are saying that we can't go back to the days of manual tuning. And I agree. But for different reasons, I think.

Yes, I think reasons are different. Apparently, from your comments, it is because you see the problem as not able to be defined. I see it as being due to an environment where all things are driven by cost and there is no need or regard for certain "craftsmanship".

...

It's not that admins aren't smart enough, these days.

It's that they never were...

Only so you see from where I come: I started working on UNIX in 1978. Doing computer "stuff" since 1969 (including school). I disagree with the "smart enough" assertion. I believe it is likely the old "80/20" rule (or minor variation thereof) applies. And it would not be a "smart enough" issue, it would be an "exposure and need" issue. 80% never needed and were not exposed to...

...

-Steve

<snip sig stuff>

-- Bill

Steve Bergman

5:11 p.m.

On Thu, 2006-09-14 at 09:58 -0400, William L. Maltby wrote:

...

And so you believe that it will be as good as, or better, at predicting that some external "random" event will occur that needs "this" particular piece of cache? Or buffer? Theoretically, a heuristic algorithm could be developed that does better. But that ultimately comes down to just redefining the problem to be solvable with a changed set of parameters. The same thing a "human" tuner would do. But it would do it "cheaper" and so biz will be happy.

I think that the above paragraph shows where we really agree. This is like chess. The machine has the ability to gather the relevant info and come to a decision a hundred time a second. No admin could ever be so dynamic. Humans are good at having a "feel" for the game. An intuition about the right thing to do. But the machine has gotten good enough that it can beat most human chess players. It can even beat the Grand Masters sometimes.

Is this because (human) Grand Masters are less competent today?

No. It is because the machine has gotten faster and smarter.

Your Unix experience predates mine. I didn't start until 1987. But I've much experience with the old 3B2's, and Unix/386 3.1+. And Xenix (LOL) and SCO Unix, Unixware and the other old school RTFM Unixes.

I'm thinking about tuning the buffer cache (NBUF). Let's see, IIRC, the general rule of thumb was to start at about 10%-25% of memory. Each NBUF took up 1k, I believe. Of course, then you had to set the buffer headers (NHBUF). It was recommended to start those out at about 1/4 of NBUF. But they had to be (were recommended to be?) a power of 2. Each NHBUF was, I believe, 64 bytes.

These values were set at kernel link time and never changed.

This is all from memory. I didn't Google or Wikipedia for any of it since it's so much fun to try to dredge up obsolete and useless info out of my own brain. ;-)

But that's really my point. This info, interesting as it was at the time, *is* quite useless and obsolete.

I'm looking at my desktop system right now. It has about 70% of physical memory devoted to disk buffers and cache. I start OpenOffice and that number decreases.

I would *never* have set NHBUF that high! And rightly so. Unixes back then were so stupid in comparison to today's Linux that it would have been suicide. Because as soon as someone started OpenOffice^WWordPerfect, the system would have thrashed to a halt. And I *couldn't retune and relink the kernel every time someone started or stopped an app.

Certainly, a combination of the machine's speed at doing things the "mechanical" way, and human knowledge and intuition would be optimal.

But if Linux has fewer control knobs, I see that as a good thing. They're not as much needed. And in fact, if they existed, would do more harm than good.

Every admin "knows" what the "best" policies are. No two agree, of course, but that doesn't stop them from "knowing" it.

OK. Maybe it depends on your "workload". But I see plenty of people telling each other to set swappiness to 0 or 10 or 90 or 100 to reduce latency. No one ever seems to recommend leaving it at 60. 60 is just bad, I guess.

The only hard data that I have ever seen anyone present, indicated that 60 was close, but a bit high for their "workload".

Just read over that thread from April 2004 that I linked. You can see that there is more going on than just different "workloads". There are fundamental differences in the way even kernel devs think about swap. These can be roughly divided into the "swap is good" and "swap is bad" camps. They can't both be right.

Anyway, Linux does have a few knobs. (No pun intended.) ;-)

For an interesting look at what the RHEL4 (two year old kernel) has, see:

http://people.redhat.com/nhorman/papers/rhel4_vm.pdf#search=%22vm% 20rhel4%20pdf%22

So, in conclusion, I will say that a *truly* well studied admin, armed with today's tools, including the kernel's automatic mechanisms, can do better than the automatic mechanisms alone. The average admin is likely to make things worse.

Both will probably do better than the smartest admin was able to do with old school Unix and its panoply of (rather static) tunables.

Do we kinda agree on that?

Oh, and my original post might have come across as a bit more confrontational than intended. If it did, I apologize. Confrontational is counterproductive. ;-)

William L. Maltby

6:12 p.m.

On Thu, 2006-09-14 at 10:11 -0500, Steve Bergman wrote:

...

On Thu, 2006-09-14 at 09:58 -0400, William L. Maltby wrote:

...
And so you believe that it will be as good as, or better, at predicting that some external "random" event will occur that needs "this" particular piece of cache? Or buffer? Theoretically, a heuristic algorithm could be developed that does better. But that ultimately comes down to just redefining the problem to be solvable with a changed set of parameters. The same thing a "human" tuner would do. But it would do it "cheaper" and so biz will be happy.

I think that the above paragraph shows where we really agree. This is like chess. The machine has the ability to gather the relevant info and come to a decision a hundred time a second. No admin could ever be so dynamic. Humans are good at having a "feel" for the game. An intuition about the right thing to do. But the machine has gotten good enough that it can beat most human chess players. It can even beat the Grand Masters sometimes.

Is this because (human) Grand Masters are less competent today?

No. It is because the machine has gotten faster and smarter.

What the admin can bring to the game, *if* the developers chose to implement it, would be a "foreknowledge". When a new system is set up, admin could tell machine, e.g., lots of print activity, little serial terminal activity, lots of HTTP, ... time frames, etc. This would "seed" the heuristics to achieve "optimal" a bit faster (or slower if admin is in error). Ditto if admin knew a big change was coming.

...

Your Unix experience predates mine. I didn't start until 1987. But I've much experience with the old 3B2's, and Unix/386 3.1+. And Xenix (LOL) and SCO Unix, Unixware and the other old school RTFM Unixes.

Ditto. I would now regale you with stories about 3B2, 3B20, 3B10, *86 etc. implementations in a large (200 person) development group and other useless trivia. But that requires a libation or two and we are so far OT now.

As to "pre-date", that matters not. I'm sure I've forgotten almost everything from that era but for what I used a lot.

It's the human stuff I remember most. Like screaming my lungs out at an ANSI committee chairman who did not appreciate my working Thanksgiving holidays on an EDI application installation on a 3B2 at home while he ate his turkey. Later learned he messed up his specs to me and was covering his ass.

...

I'm thinking about tuning the buffer cache (NBUF). Let's see, IIRC, the

<snip>

...

This is all from memory. I didn't Google or Wikipedia for any of it since it's so much fun to try to dredge up obsolete and useless info out of my own brain. ;-)

Not fun for me! I've 2 active brain cells left and I try to use them for stuff I need *now*. When the DEC 11/70 was replaced by our 3B, I forgot everything about the Dec except for UNIX related stuff (PWB 6/7, followed by UNIX SYS III let me adapt easily to UNIX System IV. A lot of folks are aware there was such a version. This was "native" on the 3B20, 3B10(?) and 3B2 initially).

...

But that's really my point. This info, interesting as it was at the time, *is* quite useless and obsolete.

I'm looking at my desktop system right now. It has about 70% of physical memory devoted to disk buffers and cache. I start OpenOffice and that number decreases.

I would *never* have set NHBUF that high! And rightly so. Unixes back then were so stupid in comparison to today's Linux that it would have been suicide. Because as soon as someone started OpenOffice^WWordPerfect, the system would have thrashed to a halt. And I *couldn't retune and relink the kernel every time someone started or stopped an app.

Keep in mind that due to eqpt. cost, most *IX systems of the day were *not* single user and "terminals" were "dumb", not PCs. So a single tuning on a "server" might provide benefit for a large number of users (PCs still too expensive to justify one on each desk). And *that* fact is what made the expense of tuning worthwhile at the time.

As to "every time someone started...", I know you are being facetious. The difficulty began as soon as management realized "the system's running great! I bet we can add <nameyourapp> without any problem". And they would decide that yesterday was an appropriate time to do it... without planning, consideration, care for your home life, ...

...

<snip>

...

Every admin "knows" what the "best" policies are. No two agree, of course, but that doesn't stop them from "knowing" it.

OK. Maybe it depends on your "workload". But I see plenty of people telling each other to set swappiness to 0 or 10 or 90 or 100 to reduce latency. No one ever seems to recommend leaving it at 60. 60 is just bad, I guess.

A natural result of "design by committee". You must see that "60" was a "compromise"?

...

<snip>

...

... There are fundamental differences in the way even kernel devs think about swap. These can be roughly divided into the "swap is good" and "swap is bad" camps. They can't both be right.

Huh? Didn't you listen to anything I said? ;-) They are *both* right for the environment in which they *think* they exist. They have defined the problem appropriately, for their environment and needs, and see a solution that addresses that problem! :-) We'll presume they are not being myopic here.

For those that are totally interactive, have a restricted set of typical apps and have $$ for memory, 0% swap may be correct. You can surmise the other examples I would mention.

...

<snip>

...

So, in conclusion, I will say that a *truly* well studied admin, armed with today's tools, including the kernel's automatic mechanisms, can do better than the automatic mechanisms alone. The average admin is likely to make things worse.

Both will probably do better than the smartest admin was able to do with old school Unix and its panoply of (rather static) tun-ables.

Do we kinda agree on that?

Yes.

...

Oh, and my original post might have come across as a bit more confrontational than intended. If it did, I apologize. Confrontational is counterproductive. ;-)

I did not see that. I saw a good discussion, albeit OT, and replied in a friendly (I hope) vein.

I'll stop here. You can never tell which of the thousands of OT threads here will be actually deemed such (or when they will be so designated) and cause severe chastisement by the management.

...

<snip sig stuff>

Enjoy! -- Bill

Steve Bergman

15 Sep 15 Sep

6:51 p.m.

"""I'll stop here. You can never tell which of the thousands of OT threads here will be actually deemed such (or when they will be so designated) and cause severe chastisement by the management."""

I suspect that as long as we can avoid "My dad can beat up your dad" posts, we'll be OK. Sincerest apologies to those who have to skip over this addition an OT thread. :-)

If I understand correctly, you feel that kernel developers should add some rather high level knobs, allowing admins to tell the system what kind of system it is.

My systems are considered servers. But they are, these days, really desktops. They do accounting. It's a server function. But from an admin standpoint, the resources are devoted to XDMCP Gnome sessions, doing Evolution, Thunderbird, Firefox, xpdf, acroread... and Counterpoint Business Accounting and Point of Sale.

Consequently, I feel that I admin desktop systems.

So, does that make a difference? Obviously, 40 individual Linux boxes are going to require a different tuning technique than 40 systems running via XDMCP.

But if we decide that adding these knobs would be a fantastic idea, there is still the question of who is going to do it. I'm not anywhere near up to the task.

-Steve

William L. Maltby

8:49 p.m.

On Fri, 2006-09-15 at 11:51 -0500, Steve Bergman wrote:

...

<snip>

...

If I understand correctly, you feel that kernel developers should add some rather high level knobs, allowing admins to tell the system what kind of system it is.

I did not intend that. I was talking about my view of why things were/are the way they are and I see (reviewing the 3 posts left in my sent folder) that there are statements that would make it seem I endorse that solution. But I do not. I mentioned it as observations, such as

<quote> If one wants to promote *IX (any flavor) across the widest possible potential user base, then one must continue to support swap for those whose $s matter more than latency. But the tuning ability is needed for those to whom latency is more important. </quote>

And in subsequent posts I mentioned/responded without expending the effort to constantly repeat "If one wanted to ... then". For me, it was just consideration of some (possibly) relevant things that cause folks to keep reviving the discussions or make the discussions irrelevant.

My real position lay in my original statement to the effect "... why I tend to discount and ignore" these sort of discussions. From there, I *think* we were predominately discussing why the threads constantly reappear on swap vs. no swap... and things that were tangentially related.

...

My systems are considered servers. But they are, these days, really desktops. They do accounting. It's a server function. But from an admin standpoint, the resources are devoted to XDMCP Gnome sessions, doing Evolution, Thunderbird, Firefox, xpdf, acroread... and Counterpoint Business Accounting and Point of Sale.

I've not kept up with definitions. I would certainly view as a server any unit that had the majority of its load involved in serving multiple users with a small (or large, I guess) set of common functions in a networked environment and a dedicated user doing admin functions or being the source of only a very small % of load.

...

Consequently, I feel that I admin desktop systems.

So, does that make a difference? Obviously, 40 individual Linux boxes are going to require a different tuning technique than 40 systems running via XDMCP.

But if we decide that adding these knobs would be a fantastic idea, there is still the question of who is going to do it. I'm not anywhere near up to the task.

First the short answer: no one is going to do it. It's counter- productive.

Other than my "... why I ignore...", my predominate theme was that $$ and business drive this decision making. Unlike days of yore, the costs are no longer high enough that competent technical input is needed for management to make an effective business decision about what the equipment configuration should be. Its 90%+ "off the shelf" with pre- built boxes and "canned" applications. Not enough power? NP. Spend an extra $50 and it'll run like a scalded dog. Got that working but that caused net bog? NP. Get more "switches", do fiber-optics, finer sub- netting ... There there may be a one-time labor-intensive cost bump.

The undertones of my replies (I hope) and my thinking (for certain) is that the technical issues that folks like us worry about no longer matter to anyone *but* us. Science, then technology, then industry and lastly business (viewed as a system) are always successful in reducing all complexity (eventually) to "apparent" simplicity. Folks like us are a temporary annoyance on the road to business nirvana re. technical issues being a significant influence on cost.

And now that they have "us" doing much of the work for free (FOS), they've removed a major part of intellectual cost from *their* cost- basis (which any cognizant being will immediately recognize as a "cost transfer", not "reduction"). So "knobs" and any additional tunability would be a step back towards the stone age in their POV. Why? Because their cost would have to rise due to needed increased expertise (i.e. increased intellectual cost).

...

-Steve

<snip sig stuff>

You can see that I should quit this thread. ... Done.

Bill

Les Mikesell

14 Sep 14 Sep

6:42 p.m.

On Thu, 2006-09-14 at 10:11 -0500, Steve Bergman wrote:

...

So, in conclusion, I will say that a *truly* well studied admin, armed with today's tools, including the kernel's automatic mechanisms, can do better than the automatic mechanisms alone. The average admin is likely to make things worse.

Both will probably do better than the smartest admin was able to do with old school Unix and its panoply of (rather static) tunables.

Do we kinda agree on that?

The problem comes when you do something unusual and the machine adapts to it even though that pattern won't repeat. As an example, think about what running a backup once a day does to your self-tuning buffer cache...

-- Les Mikesell lesmikesell@gmail.com

Steve Bergman

13 Sep 13 Sep

5:21 p.m.

OK. Now I'm a bit confused. Raid 1 read performance is not what I expected.

CentOS 4.4 2.6.9-42.0.2.ELsmp

===== [root@hagar ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb2[1] sda2[0] 244035264 blocks [2/2] [UU] =====

===== [root@hagar scsi]# cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: SEAGATE Model: DAT DAT72-052 Rev: A16E Type: Sequential-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: Maxtor 7L250S0 Rev: BACE (This is /dev/sda) Type: Direct-Access ANSI SCSI revision: 05 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: Maxtor 7L250S0 Rev: BACE (This is /dev/sdb) Type: Direct-Access ANSI SCSI revision: 05 =====

===== [root@hagar ~]# hdparm -t /dev/sda2 /dev/sdb2

/dev/sda2: Timing buffered disk reads: 154 MB in 3.01 seconds = 51.15 MB/sec

/dev/sdb2: Timing buffered disk reads: 162 MB in 3.03 seconds = 53.47 MB/sec =====

=====

Then I run this script:

===== # flush the cache dd if=/dev/md1 bs=32M count=64 of=/dev/null

# sync the data sync

# Run two read operations, on different parts of /dev/md1 simultaneously # This reads a total of 1GB of data time dd if=/dev/md1 bs=4k count=131072 of=/dev/null & time dd if=/dev/md1 skip=262144 bs=4k count=131072 of=/dev/null & =====

The results show about 58MB/sec transferred, which is about the same as hdparm is showing for each drive individually.

Running the same thing, but reading the whole 1GB using one dd process in the foreground gives identical results.

Why am I not seeing higher numbers?

Thanks, Steve

Kirk Bocek

5:42 p.m.

Steve, show your math on how you calculated 58MB/sec from two timed processes. On an array on one of my servers, I show that the time approximately doubles when running two processes instead of each by themselves. That still yields about the same throughput.

Also, try running a benchmark like bonnie++:

http://www.coker.com.au/bonnie++/

Steve Bergman wrote:

...

# Run two read operations, on different parts of /dev/md1 simultaneously # This reads a total of 1GB of data time dd if=/dev/md1 bs=4k count=131072 of=/dev/null & time dd if=/dev/md1 skip=262144 bs=4k count=131072 of=/dev/null & =====

The results show about 58MB/sec transferred, which is about the same as hdparm is showing for each drive individually.

Running the same thing, but reading the whole 1GB using one dd process in the foreground gives identical results.

Why am I not seeing higher numbers?

Thanks, Steve

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve Bergman

5:58 p.m.

On Wed, 2006-09-13 at 08:42 -0700, Kirk Bocek wrote:

...

Steve, show your math on how you calculated 58MB/sec from two timed processes. On an array on one of my servers, I show that the time approximately doubles when running two processes instead of each by themselves. That still yields about the same throughput.

OK.

131072*4k = 524,288k per dd process.

There are two of these processes running simultaneously, for a total 1,048,576k of data.

1,048,576k / 1024k/M = 1024MB

The dd's came back with wall clock times of 17.1 seconds and 18.85 seconds. So the entire operation took a total of 18.85 seconds to read 1024MB of data.

1024MB /18.85sec = 54.3MB/sec

Thanks, Steve

Kirk Bocek

6:03 p.m.

Yep, that sucks. Now what what does bonnie++ say?

Steve Bergman wrote:

...

On Wed, 2006-09-13 at 08:42 -0700, Kirk Bocek wrote:

...
Steve, show your math on how you calculated 58MB/sec from two timed processes. On an array on one of my servers, I show that the time approximately doubles when running two processes instead of each by themselves. That still yields about the same throughput.

OK.

131072*4k = 524,288k per dd process.

There are two of these processes running simultaneously, for a total 1,048,576k of data.

1,048,576k / 1024k/M = 1024MB

The dd's came back with wall clock times of 17.1 seconds and 18.85 seconds. So the entire operation took a total of 18.85 seconds to read 1024MB of data.

1024MB /18.85sec = 54.3MB/sec

Thanks, Steve

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve Bergman

6:33 p.m.

On Wed, 2006-09-13 at 09:03 -0700, Kirk Bocek wrote:

...

Yep, that sucks. Now what what does bonnie++ say?

I'll have to wait until tonight to run that. Too disruptive during the day. I tried to run that the other day and got an immediate call from the client that the system had "locked up". It was actually just extremely slow. I thought that, as of 2.6, writes weren't suppose to starve reads anymore. But they definitely do with the current CentOS kernel at least.

-Steve

Kirk Bocek

6:36 p.m.

Bad admin! Doing this on a production system... Tsk, tsk!

;)

Steve Bergman wrote:

...

On Wed, 2006-09-13 at 09:03 -0700, Kirk Bocek wrote:

...
Yep, that sucks. Now what what does bonnie++ say?

I'll have to wait until tonight to run that. Too disruptive during the day. I tried to run that the other day and got an immediate call from the client that the system had "locked up". It was actually just extremely slow. I thought that, as of 2.6, writes weren't suppose to starve reads anymore. But they definitely do with the current CentOS kernel at least.

-Steve

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve Bergman

6:55 p.m.

On Wed, 2006-09-13 at 09:36 -0700, Kirk Bocek wrote:

...

Bad admin! Doing this on a production system... Tsk, tsk!

;)

Yeah, I know. But I'm still extremely disappointed that heavy writes kill the system in RedHat's Flagship Enterprise product.

What if it were a regular app that did heavy writes? It's certainly possible.

And yes, I know we're not supposed to say "RedHat" here, but if RH has a problem with fair use of their trademarks that's their freaking problem, not mine.

;-)

Kirk Bocek

7:08 p.m.

We, the users of CentOS, can say RedHat all we want. RedHat, RedHat, RedHat!

Only the proprietors of CentOS have to walk a line while proprietorizing their product. :)

Steve Bergman wrote:

...

On Wed, 2006-09-13 at 09:36 -0700, Kirk Bocek wrote:

...
Bad admin! Doing this on a production system... Tsk, tsk!

;)

Yeah, I know. But I'm still extremely disappointed that heavy writes kill the system in RedHat's Flagship Enterprise product.

What if it were a regular app that did heavy writes? It's certainly possible.

And yes, I know we're not supposed to say "RedHat" here, but if RH has a problem with fair use of their trademarks that's their freaking problem, not mine.

;-)

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve Bergman

8:15 p.m.

New subject: Wrties starving reads (was: re:Shrinking a volume group)

On the topic of writes starving reads in 2.6.9-42.0.2.ELsmp #1 SMP. It really *does* lock the box up, for all practical purposes. It's not really locked, but any normal user would conclude that the system was frozen up.

Interestingly enough, I cannot reproduce the writes starving reads problem on my desktop Ubuntu Dapper box running "2.6.15-26-k7 #1 SMP PREEMPT".

This is a UP box without RAID, but with LVM2. It's an AMD64 4000+, running 32 bit.

The main differences I see are:

UP vs SMP No RAID Different kernel version Preempt Ubuntu uses Anticipatory scheduler whereas Centos uses CFQ.

My understanding is that CFQ is supposed to be *better* than Anticipatory for this kind of thing.

Vanilla is moving to CFQ in 2.6.18 I believe. Also, the next release of Ubuntu is moving to CFQ.

I tried moving the CentOS kernel over to my Ubuntu box for testing, but it is having problems finding the volume groups.

So, does Linux software Raid1 just suck? Or CFQ? Or something else?

CentOS's behavior under heavy writes on this box is bad enough to be considered a serious bug in my opinion.

Thanks, Steve

chrism＠imntv.com

8:29 p.m.

New subject: Wrties starving reads

Steve Bergman wrote:

...

On the topic of writes starving reads in 2.6.9-42.0.2.ELsmp #1 SMP. It really *does* lock the box up, for all practical purposes. It's not really locked, but any normal user would conclude that the system was frozen up.

What exactly are you doing that's causing this condition? I've got a 4-core opteron box that I've just installed 4.4 + updates on and would like to see if I can duplicate the problem.

Cheers,

Steve Bergman

9:19 p.m.

New subject: Wrties starving reads

On Wed, 2006-09-13 at 14:29 -0400, chrism@imntv.com wrote:

...

What exactly are you doing that's causing this condition? I've got a 4-core opteron box that I've just installed 4.4 + updates on and would like to see if I can duplicate the problem.

bonnie++ -f

Actually, I'm switching to /home/steve and, as root:

bonnie++ -f -u steve

It's a 4GB machine, so bonnie runs with an 8GB dataset.

I'm installing 4.4 on my desktop box now for testing without raid or lvm.

I have an FC5 box at the office which does not exhibit the problem with the 2.6.17 kernel using raid1+lvm2. But the raid is currently running in degraded mode. I'll be syncing up the other drive tonight and will test again.

-Steve

chrism＠imntv.com

10:20 p.m.

New subject: Wrties starving reads

Steve Bergman wrote:

...

On Wed, 2006-09-13 at 14:29 -0400, chrism@imntv.com wrote:

...
What exactly are you doing that's causing this condition? I've got a 4-core opteron box that I've just installed 4.4 + updates on and would like to see if I can duplicate the problem.

bonnie++ -f

Actually, I'm switching to /home/steve and, as root:

bonnie++ -f -u steve

It's a 4GB machine, so bonnie runs with an 8GB dataset.

I'm installing 4.4 on my desktop box now for testing without raid or lvm.

I have an FC5 box at the office which does not exhibit the problem with the 2.6.17 kernel using raid1+lvm2. But the raid is currently running in degraded mode. I'll be syncing up the other drive tonight and will test again.

I was using emacs in another window while this was running in another console:

Dual Opteron 275 + 2gb RAM + 3Ware9550SX + RAID 10 (8 500gig Barracudas)

Not very pretty but....

[ritz@localhost ~]$ ./bonnie++ -f Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP localhost.locald 4G 170662 74 64484 20 157369 21 504.8 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ localhost.localdomain,4G,,,170662,74,64484,20,,,157369,21,504.8,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

Steve Bergman

9:57 p.m.

New subject: Wrties starving reads (was: re:Shrinking a volume group)

On Wed, 2006-09-13 at 13:15 -0500, Steve Bergman wrote:

...

Interestingly enough, I cannot reproduce the writes starving reads problem on my desktop Ubuntu Dapper box running "2.6.15-26-k7 #1 SMP PREEMPT".

OK. I have been able to test on a freshly installed CentOS 4.4 box without RAID or LVM and I *do* see a great performance degradation when bonnie++ is "writing intelligently". (I'm not sure it is as bad as what I'm seeing on the raid1+lvm2 box, but it is definitely in the ballpark.)

This box is using a plain old PATA drive.

So, it's not RAID or LVM2 causing the problem. The fact that the FC5 box, which is an old celeron 2GHz box with 512MB, does not exhibit the problem indicates that it is not the CFQ scheduler causing the problem.

So does 2.6.9 just suck for heavy write IO?

-Steve

Aleksandar Milivojevic

14 Sep 14 Sep

6:06 a.m.

New subject: Wrties starving reads

Steve Bergman wrote:

...

On Wed, 2006-09-13 at 13:15 -0500, Steve Bergman wrote:

...
Interestingly enough, I cannot reproduce the writes starving reads problem on my desktop Ubuntu Dapper box running "2.6.15-26-k7 #1 SMP PREEMPT".

OK. I have been able to test on a freshly installed CentOS 4.4 box without RAID or LVM and I *do* see a great performance degradation when bonnie++ is "writing intelligently". (I'm not sure it is as bad as what I'm seeing on the raid1+lvm2 box, but it is definitely in the ballpark.)

This box is using a plain old PATA drive.

So, it's not RAID or LVM2 causing the problem. The fact that the FC5 box, which is an old celeron 2GHz box with 512MB, does not exhibit the problem indicates that it is not the CFQ scheduler causing the problem.

So does 2.6.9 just suck for heavy write IO?

Well, RHEL5 will have at least 2.6.17 kernel. That's what was included in public beta 1. If you have any concerns about future RHEL5, now is the right time to download the beta, test it out, and complain. So if the FC5 box works fine, the RHEL5 should work fine too. It would be nice if they decide to switch to 2.6.18 in beta 2 and final for some extra features (like NCQ and hot-plug support for SATA). However, given that 2.6.18 is still in "release candidates" stage, the feature set for RHEL5 more or less getting frozen, and the clock is ticking fast, it is most likely RHEL5 will use 2.6.17. 2.6.18 might be just couple of months (weeks?) too late.

Scott Silva

13 Sep 13 Sep

5:58 p.m.

Steve Bergman spake the following on 9/13/2006 8:21 AM:

...

OK. Now I'm a bit confused. Raid 1 read performance is not what I expected.

CentOS 4.4 2.6.9-42.0.2.ELsmp

===== [root@hagar ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb2[1] sda2[0] 244035264 blocks [2/2] [UU] =====

===== [root@hagar scsi]# cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: SEAGATE Model: DAT DAT72-052 Rev: A16E Type: Sequential-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: Maxtor 7L250S0 Rev: BACE (This is /dev/sda) Type: Direct-Access ANSI SCSI revision: 05 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: Maxtor 7L250S0 Rev: BACE (This is /dev/sdb) Type: Direct-Access ANSI SCSI revision: 05 =====

===== [root@hagar ~]# hdparm -t /dev/sda2 /dev/sdb2

/dev/sda2: Timing buffered disk reads: 154 MB in 3.01 seconds = 51.15 MB/sec

/dev/sdb2: Timing buffered disk reads: 162 MB in 3.03 seconds = 53.47 MB/sec =====

=====

Then I run this script:

===== # flush the cache dd if=/dev/md1 bs=32M count=64 of=/dev/null

# sync the data sync

# Run two read operations, on different parts of /dev/md1 simultaneously # This reads a total of 1GB of data time dd if=/dev/md1 bs=4k count=131072 of=/dev/null & time dd if=/dev/md1 skip=262144 bs=4k count=131072 of=/dev/null & =====

The results show about 58MB/sec transferred, which is about the same as hdparm is showing for each drive individually.

Running the same thing, but reading the whole 1GB using one dd process in the foreground gives identical results.

Why am I not seeing higher numbers?

Thanks, Steve

Maybe the dat tape drive is slowing the bus down. I seem to remember that the smallest bus on the chain set the rest of the chain to the same speed. You could try and put it on another channel.YMMV

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

Steve Bergman

6:03 p.m.

On Wed, 2006-09-13 at 08:58 -0700, Scott Silva wrote:

...

Maybe the dat tape drive is slowing the bus down. I seem to remember that the smallest bus on the chain set the rest of the chain to the same speed. You could try and put it on another channel.YMMV

Yes. They are all on channel 00. But note that each device is on a separate scsi *host.

-Steve

Scott Silva

7:47 p.m.

Steve Bergman spake the following on 9/13/2006 9:03 AM:

...

On Wed, 2006-09-13 at 08:58 -0700, Scott Silva wrote:

...
Maybe the dat tape drive is slowing the bus down. I seem to remember that the smallest bus on the chain set the rest of the chain to the same speed. You could try and put it on another channel.YMMV

Yes. They are all on channel 00. But note that each device is on a separate scsi *host.

-Steve

Sorry... I didn't thoroughly read before I replied.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

Steve Bergman

7:50 p.m.

On Wed, 2006-09-13 at 10:47 -0700, Scott Silva wrote:

...

Sorry... I didn't thoroughly read before I replied.

I appreciate the response, regardless. :-)

-Steve

Kirk Bocek

6:09 p.m.

Does the md driver stripe Raid 1 reads? I know my 3ware driver does. But that's 3ware and not md.

If not, then that's the answer to Steve's question.

Kirk Bocek

Scott Silva wrote:

...

Steve Bergman spake the following on 9/13/2006 8:21 AM:

...
OK. Now I'm a bit confused. Raid 1 read performance is not what I expected.

Morten Torstensen

8:33 p.m.

Scott Silva wrote:

...

Steve Bergman spake the following on 9/13/2006 8:21 AM:

...
OK. Now I'm a bit confused. Raid 1 read performance is not what I expected.

md does not stripe reads in raid-1. As can be clearly seen on my raid-1 setup on two ide drives:

[root@balrog ~]# hdparm -t /dev/hde

/dev/hde: Timing buffered disk reads: 136 MB in 3.01 seconds = 45.17 MB/sec [root@balrog ~]# hdparm -t /dev/hdg

/dev/hdg: Timing buffered disk reads: 126 MB in 3.03 seconds = 41.52 MB/sec [root@balrog ~]# hdparm -t /dev/md0

/dev/md0: Timing buffered disk reads: 124 MB in 3.01 seconds = 41.24 MB/sec [root@balrog ~]#

-- //Morten Torstensen //Email: morten@mortent.org //IM: Cartoon@jabber.no morten.torstensen@gmail.com And if it turns out that there is a God, I don't believe that he is evil. The worst that can be said is that he's an underachiever.

Aleksandar Milivojevic

9:41 p.m.

Quoting Steve Bergman steve@rueb.com:

...

OK. Now I'm a bit confused. Raid 1 read performance is not what I expected.

Note that hdparm reports the raw speed of the drive. In real world, you will at least have overhead of the file system. Possibly also overhead of LVM and md device drivers (if you use them).

libata in current stable kernels doesn't support NCQ. The support for NCQ will be added in 2.6.18 (currently at rc7 level). Unless Red Hat bumps the kernel version in final release of RHEL5 or backports NCQ to 2.6.17 (I wouldn't bet on backporting, there were some major changes in libata), you are not going to see it in forthcomming RHEL5 either (beta 1 uses 2.6.17 kernel).

If you run only a single process, md device driver will read from disks in round robin fashion. You can even observe this visually if your hard drives have separate LEDs. Only one of them will be active at any point in time during sequential read test. It's not smart enough to stripe reads on RAID1. I'm not sure if this is due to lack of NCQ support and how much (if) it will help once NCQ support is added to the linux kernel. I didn't have any spare SCSI system to test how things work there.

However, if you run two processes, md driver will do reads from different drives in parallel. Again, you can observe this visually if your drives have individual LEDs (both will be lit).

I've run couple of benchmarks (using bonnie++), that also show this numerically. For "one drive test" I simply detached second disk from the mirror, so the overhead of drivers (md+lvm+ext3) is about the same. The numbers for "two processes" tests are for single process (multiply by two to get total throughput).

Test seq write (kB/s) seq read (kB/s) ======================================================================= raid1, single process 37716 47330 raid1, two processes (each) 16570 31572 degraded raid1, single process 39076 47627 degraded raid1, two processes (each) 16368 6759

Writing to single drive (degraded RAID1 in this case) is a bit faster than writing to RAID1, since there's no need to wait for data to be written to both drives.

If two processes are writing to the same disk, it's about the same. Note that 16.5MB/s is per process (total for disk is 33MB/s). So we have 33MB/s vs. 37MB/s). I'd expect bigger difference due to all the extra disk seeks. So this result is actually very good.

You can see effect of md driver not striping reads on RAID1. Almost the same speed (47MB/s) for RAID1 and degraded-RAID1 (single drive) case.

On the other hand, if there are two processes reading in parallel, each is able to read 31.5 MB/s, which totals to 63MB/s. Much better. It's still not double the speed (those two processes are fighting for system resources after all, not only disks but also CPU time).

Two processes reading from degraded RAID-1 clearly sucks. Total throughput drops to aruond 13-ish MB/s. I've no good explanation for such a low number (it is less than a half of the write throughput).

-- NOTICE: If you are not intended recipient, you are hereby notified that by reading this message you agreed not to disturb frogs during mating season. For more info, visit http://www.8-P.ca/

Daniel de Kok

12 Sep 12 Sep

9:55 p.m.

On Tue, 2006-09-12 at 14:41 -0500, Steve Bergman wrote:

...

Next step would be to shrink VG00. But it looks like you can only shrink a volume group by removing entire PV's. I only have 1 PV.

You can shrink a PV with pvresize. You will have to adjust the partition size afterwards. More information can be found in pvresize(8).

-- Daniel

6990

Age (days ago)

6993

Last active (days ago)

discuss@lists.centos.org

38 comments

10 participants

tags (0)

participants (10)

Aleksandar Milivojevic
chrism＠imntv.com
Daniel de Kok
Kirk Bocek
Les Mikesell
Matt Hyclak
Morten Torstensen
Scott Silva
Steve Bergman
William L. Maltby