Hi,
what would be a good value for
/proc/sys/vm/min_free_kbytes
on a Dual-CPU EM64T System with 8 GB of memory. Kernel is 2.6.9-42.0.3.ELsmp. I know that the default might be a bit low. Unfortunatelly the documentation is a bit weak in this area.
We are experiencing responsiveness problems (and higher than expected load) when the system is under combined memory+network+disk-IO stress.
Thanks Martin Please CC me on replies, as I am only getting the digest.
------------------------------------------------------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Hi,
On Thu, 7 Dec 2006 08:03:54 -0800 (PST) Martin Knoblauch spamtrap@knobisoft.de wrote:
what would be a good value for
/proc/sys/vm/min_free_kbytes
on a Dual-CPU EM64T System with 8 GB of memory. Kernel is 2.6.9-42.0.3.ELsmp. I know that the default might be a bit low. Unfortunatelly the documentation is a bit weak in this area.
We are experiencing responsiveness problems (and higher than expected load) when the system is under combined memory+network+disk-IO stress.
Before changing this tunable, it is important to realize what it is for. The free area is primarily useful to give the kernel enough memory to act in situations where the swapper suddenly has to do a lot of work. In other situations, you can make things worse, since you are effectively giving less memory for applications. The following thread may be interesting in this respect:
http://lists.centos.org/pipermail/centos/2006-August/thread.html#68306
Use the usual suspects to nail down the bottleneck. It may well be the case that the system has too little disk bandwidth, too little CPU power, or too little memory to handle its load. But there's no way to tell without more information. Touch vm_min_free bytes, if analysis proves that e.g. kswapd has trouble to page out memory pages in a timely fashion. A good value to start with is 4096KB if it is higher than your current vm_min_free bytes, though I have seen others setting it much higher.
-- Daniel
Martin Knoblauch wrote:
We are experiencing responsiveness problems (and higher than expected load) when the system is under combined memory+network+disk-IO stress.
^^^^^^^^^^^^^^^
First, I'd check the paging with `vmstat 5` ... if you see excessive SI (swap in/second), you need more physical memory, no amount of dinking with vm parameters can change this.
If you're not seeing excessive paging, I'd be inclined to monitor the disk IO with `iostat -x 5`... if the avgqu-sz and/or await on any device is high, you need to balance your disk IO across more physical devices and/or more channels. await = 500 means disk physical IO requests are taking an average of 500mS (0.5 seconds) to satisfy. If many processes are waiting for disk IO, you'll see high load factors even though CPU usage is fairly low.
iostat is in yum package systat (not installed by default in most configs), vmstat is in procps (generally installed by default). on both of these commands, ignore the first output, thats the system average since reboot, generally meaningless. the 2nd and successive outputs are at the intervals specified (5 seconds in my above examples).
On our database servers, which experience very high disk IO loads, we often use 4 separate RAIDs... / and the other normal system volumes are partitions on a raid1 (typically 2 x 36GB 15k scsi or sas), then the database itself will be spread across 3 volumes /u10 /u11 /u12, which are each RAID 1+0 built from 4 x 72GB 15k scsi/sas or FC SAN volumes. We'll always use RAID controllers with hardware battery-protected raid write-back cache for the database volumes, as this hugely accelerates 'commits'. Note, we don't use mysql, I have no idea if its capable of taking advantage of configurations like this, but postgresql and oracle certainly are. The database adminstrators will spend hours pouring over IO logs and database statistics in order to better optimize the distribution of tables and indicies across the available tablespaces.
Under these sorts of heavy concurrent random access patterns, SATA and software RAID just don't cut it, regardless of how good its sequential benchmarks may be.
Please CC me on replies, as I am only getting the digest.
spamtrap@knobisoft.de ??!? no thanks.
On Thu, 07 Dec 2006 10:14:25 -0800 John R Pierce pierce@hogranch.com wrote:
Please CC me on replies, as I am only getting the digest.
spamtrap@knobisoft.de ??!? no thanks.
Mmm, then I am probably spamtrapped by now :/.
-- Daniel
John R Pierce wrote:
certainly are. The database adminstrators will spend hours pouring over IO logs and database statistics in order to better optimize the distribution of tables and indicies across the available tablespaces.
Didn't realise Oracle was that primitive. One should just balance all the tablespaces out over multiple volumes and or controllers and add table partitioning as needed. Transaction filesystem is a striped filesystem over the same raid/volumes.
Then if you need more I/O bandwidth you just add more controllers and disks. No need to dick around with "put that table there and that here", that is just counter productive.
With a proper setup you would then engange all disk arms you have available for the I/Os you need and you basically remove hotspot management.
Morten Torstensen wrote:
John R Pierce wrote:
certainly are. The database adminstrators will spend hours pouring over IO logs and database statistics in order to better optimize the distribution of tables and indicies across the available tablespaces.
Didn't realise Oracle was that primitive. One should just balance all the tablespaces out over multiple volumes and or controllers and add table partitioning as needed. Transaction filesystem is a striped filesystem over the same raid/volumes.
Then if you need more I/O bandwidth you just add more controllers and disks.
thats the shotgun approach, yes.
in our case, our production systems are a very large very complex realtime oracle database running on large scale Sun enterprise hardware on bigiron EMC storage, using dozens and dozens of raid10 logical volumes as you do NOT want to have a single 10TB volume, sorry. by hand optimizing the tablespace layouts of the applications tables and indicies, which have very specific access patterns, we can get double the throughput of the blind 'just stripe the universe' approach. Since we're dealing with $millions worth of servers here at each production factory, tossing more iron at the problem isn't always the best solution. btw, we've found Oracle 10's new 'self optimizer' to do a far worse job of query optimization than what we have been able to hand tune out of Oracle 9, so we're not upgrading until this changes.
we're embarking on a pilot project to evaluate a smaller scale version of this manufacturing execution system on linux + postgres as there are smaller installations which can't justify the costs of Oracle.
John R Pierce wrote:
Morten Torstensen wrote:
John R Pierce wrote:
certainly are. The database adminstrators will spend hours pouring over IO logs and database statistics in order to better optimize the distribution of tables and indicies across the available tablespaces.
Didn't realise Oracle was that primitive. One should just balance all the tablespaces out over multiple volumes and or controllers and add table partitioning as needed. Transaction filesystem is a striped filesystem over the same raid/volumes.
Then if you need more I/O bandwidth you just add more controllers and disks.
thats the shotgun approach, yes. in our case, our production systems are a very large very complex realtime oracle database running on large scale Sun enterprise hardware on bigiron EMC storage, using dozens and dozens of raid10 logical volumes as you do NOT want to have a single 10TB volume, sorry. by hand optimizing the tablespace layouts of the applications tables and indicies, which have very specific access patterns, we can get double the throughput of the blind 'just stripe the universe' approach. Since we're dealing with $millions worth of servers here at each production factory, tossing more iron at the problem isn't always the best solution. btw, we've found Oracle 10's new 'self optimizer' to do a far worse job of query optimization than what we have been able to hand tune out of Oracle 9, so we're not upgrading until this changes.
we're embarking on a pilot project to evaluate a smaller scale version of this manufacturing execution system on linux + postgres as there are smaller installations which can't justify the costs of Oracle. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Just out of curiosity John, are you allowed to give us some hints about what your system does? If you are posting on the CentOS list i presume you are running CentOS, rather than "a similar upstream product". Also I'd love to know what you mean by "you do NOT want to have a single 10TB volume" - are you referring to performance or single-point-of-failure issues?
Regards,
MrKiwi
Just out of curiosity John, are you allowed to give us some hints about what your system does? If you are posting on the CentOS list i presume you are running CentOS, rather than "a similar upstream product". Also I'd love to know what you mean by "you do NOT want to have a single 10TB volume" - are you referring to performance or single-point-of-failure issues?
we're drifting pretty far off topic for this list, I was going to let this thread drop... but...
I work for a big 'widget' maker. we develop and pilot our complex little 'widgets' in the states, then volume production is done in our large scale plants overseas. my department is the core development group for the both the core database and the middleware used in our inhouse process tracking system.
I use centos in my groups development lab, mostly for prototyping our messaging/middleware servers... We've validated our databases on rhel/centos, and smaller design center operations who don't need the big sun/emc approach are starting to deploy with oracle on rhel on opteron servers instead of the traditional Solaris/Sun platforms.
re: single 10TB volume, I think its both too many eggs in one basket and performance. I'm not in operations, I'm the systems guy in the development group, so alot of what is decided in production I hear only 2nd hand. I've been told that the big SANs they use don't do as well with very large volume groups as they do with more smaller ones... at the big sites, the SAN is like ~1000 72GB 15K rpm fc drives. I believe Operations uses Veritas VxVM & VxFS on the big Suns (E20k, etc), and it may have performance issues with multiterabyte single file systems, too. Our databases do a LOT of disk writes from synchronous commits, much of the reads are cached (these servers have a LOT of ram). The way our database is organized, there's a nested set of tablespace groups that specific tables are bucketed in, so we can run quite nicely with 3 raids on a smaller system, and 36 raids on a really big system, this makes the space allocation fairly flexible, yet avoids spindle collisions during commonly used complex joins and so forth..
John R Pierce wrote:
in our case, our production systems are a very large very complex realtime oracle database running on large scale Sun enterprise hardware on bigiron EMC storage, using dozens and dozens of raid10 logical volumes as you do NOT want to have a single 10TB volume, sorry. by hand optimizing the tablespace layouts of the applications tables and indicies, which have very specific access patterns, we can get double the throughput of the blind 'just stripe the universe' approach. Since
Oh you would not have a single 10TB volume of course: and it is still a little quaint to hand optimize tablespaces. That is something we did in the 90s. Our own tests on multi-TB databases and modern SAN systems, the shotgun approach beat hand optimization every time. And we are talking about people with dozens of years with SQL optimization behind them. Intelligent I/O prefetch adaption, intelligent and dynamic access plans... the world of performance in the RDBM world is changing and old rules for performance are changing with it.
Morten Torstensen wrote:
John R Pierce wrote:
in our case, our production systems are a very large very complex realtime oracle database running on large scale Sun enterprise hardware on bigiron EMC storage, using dozens and dozens of raid10 logical volumes as you do NOT want to have a single 10TB volume, sorry. by hand optimizing the tablespace layouts of the applications tables and indicies, which have very specific access patterns, we can get double the throughput of the blind 'just stripe the universe' approach. Since
Oh you would not have a single 10TB volume of course: and it is still a little quaint to hand optimize tablespaces. That is something we did in the 90s. Our own tests on multi-TB databases and modern SAN systems, the shotgun approach beat hand optimization every time. And we are talking about people with dozens of years with SQL optimization behind them. Intelligent I/O prefetch adaption, intelligent and dynamic access plans... the world of performance in the RDBM world is changing and old rules for performance are changing with it.
Im not sure im following the language correctly ... Morten and John - you seem to be at opposite ends of the opinion scale about hand optimising tablespaces?
Maybe you have different definitions of "hand optimising tablespaces"?
Morten - Are your systems similar enough to Johns to make a valid comparison about the merits of the shotgun approach vs "hand optimising tablespaces"?
For the benefit of the rest-of-us DBAs, i understand "hand optimising tablespaces" to mean ... "choosing where a table (or a partition of a table) will physically live, and where its indexes will live so that roughly; sum(reads/s + writes/s on spindle [1..n])^2 ) is as low as possible. (ie balanced across all spindles). I have never had a db span anything bigger than a redundant fibre channel array controller (yep - a bit outdated now), but if it did, then balance across n controllers as well. I imagine you also decide if/which tables or indexes are pinned in memory?
Thanks to both of you for the info so far :)
Regards,
MrKiwi
Quoting Martin Knoblauch spamtrap@knobisoft.de:
Hi,
what would be a good value for
/proc/sys/vm/min_free_kbytes
on a Dual-CPU EM64T System with 8 GB of memory. Kernel is 2.6.9-42.0.3.ELsmp. I know that the default might be a bit low. Unfortunatelly the documentation is a bit weak in this area.
We are experiencing responsiveness problems (and higher than expected load) when the system is under combined memory+network+disk-IO stress.
Actually, under similar load, I experinced ext3 file system corruption (RHEL4/CentOS4 kernels). In my case it was memory+cpu+network. All I was doing was running a simple Perl script. For details check out:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=202955
Appernetly, Red Hat considers increasing default value for min_free_kbytes in update 5.
Increasing it to 8192 seems to do good job in preventing Linux kernel to deny memory to itself (and destroying my valuable data). Well, at least in my case.