[CentOS] 4.4/64-bit Supermicro/ Nvidia RAID [thanks]

Sat Dec 9 06:19:30 UTC 2006
John R Pierce <pierce at hogranch.com>

 > I can accept faster in certain cases but if you say HUGELY faster, I 
would like to see some numbers.

Ok, first a specific case that actually came up at my work in the last 
week.....

We've got a middleware messaging application we developed we'll call a 
'republisher'.  it recieves a stream of 'event' messages from a upstream 
server, and forwards them to a number of downstream servers that have 
registered ('subscribed') with it.   it queues for each downstream 
subscriber so if they are down, it will hold event messages for it.  not 
all servers want all 'topics', so we only send each server the specific 
event types its interested in.   it writes the incoming event stream to 
a number of subscriber queue files as well as a series of journal files 
to track all this state. there is a queue for each incoming 'topic', its 
entries aren't cleared til the last subscriber on that topic has 
confirmed delivery.  

{before someone screams my-favorite-MQ-server, let me just say, we HAD a 
commercial messaging system doing this, and our production operations 
staff is fed up with realtime problems that involve multinational vendor 
finger pointing, so we have developed our own to replace it}

On a typical dual xeon linux server running CentOS 4.4, with a simple 
direct connect disk, this republisher can easily handle 1000 
messages/second using simple write().       However, if this process is 
busy humming away under the production workload of 60-80 messages/sec, 
and the power is pulled or the server crashes (this happened exactly 
once so far at a Thailand manufacturing facility, due to operator 
error), it lost 2000+ manufacturing events that the downstream servers 
couldn't easily recover, this was data in Linux's disk cache that hadn't 
yet been commited to disk.   So, the obvious solution is to call fsync() 
on the various files after each 'event' has been processed, to insure 
its an atomic operation.

However, if this republisher does an fsync() after each event, it slows 
to like 50/second on a direct connect disk.  If its run on a similar 
server with RAID controller that has battery-protected writeback cache 
enabled, it can easily do 700-800/second.   We need 100+/second and 
prefer 200/second to have margins for catchup after data interruptions.


now, everything I've described above is a rather unusual application... 
so let me present a far more common scenarios...

Relational DB Management Servers, like Oracle, or PostgreSQL.     when 
the RDBMS does a 'commit' at transaction END;, the server HAS to fsync 
its buffers to disk to maintain data integrity.     With a writeback 
cache disk controller, the controller can acknowlege the writes as soon 
as the data is in its cache, then it can write that data to disk at its 
leisure.      With software RAID, the server has to wait until ALL 
drives of the RAID slice have seeked, and completed the physical writes 
to the disk.   In a write intensive database, where most of the read 
data is cached to memory, this is a HUGE performance hit.