Hi All,
I've been asked to setup a 3d renderfarm at our office , at the start it will contain about 8 nodes but it should be build at growth. now the setup i had in mind is as following: All the data is already stored on a StorNext SAN filesystem (quantum ) this should be mounted on a centos server trough fiber optics , which in its turn shares the FS over NFS to all the rendernodes (also centos).
Now we've estimated that the average file send to each node will be about 90MB , so that's what i like the average connection to be, i know that gigabit ethernet should be able to that (testing with iperf confirms that) but testing the speed to already existing nfs shares gives me a 55MB max. as i'm not familiar with network shares performance tweaking is was wondering if anybody here did and could give me some info on this? Also i thought on giving all the nodes 2x1Gb-eth ports and putting those in a BOND, will do this any good or do i have to take a look a the nfs server side first?
thanks,
Wessel
Hi :)
On Mon, Mar 7, 2011 at 12:12 PM, wessel van der aart wessel@postoffice.nl wrote:
Hi All,
I've been asked to setup a 3d renderfarm at our office , at the start it will contain about 8 nodes but it should be build at growth. now the setup i had in mind is as following: All the data is already stored on a StorNext SAN filesystem (quantum ) this should be mounted on a centos server trough fiber optics , which in its turn shares the FS over NFS to all the rendernodes (also centos).
From what I can read, you have 1 NFS server only and a separate
StoreNext MDC. Is this correct?
Now we've estimated that the average file send to each node will be about 90MB , so that's what i like the average connection to be, i know that gigabit ethernet should be able to that (testing with iperf confirms that) but testing the speed to already existing nfs shares gives me a 55MB max. as i'm not familiar with network shares performance tweaking is was wondering if anybody here did and could give me some info on this? Also i thought on giving all the nodes 2x1Gb-eth ports and putting those in a BOND, will do this any good or do i have to take a look a the nfs server side first?
Things to check would be: - Hardware: * RAM and cores on the NFS server * # of GigE & FC ports * PCI technology you're using: PCIe, PCI-X, ... * PCI lanes & bandwidth you're using up * if you are sharing PCI buses between different PCI boards (FC and GigE): you should NEVER do this. If you have to share a PCI bus, share it between two PCI devices which are the same. That is you can share a PCI bus between 2 GigE cards or between 2 FC cards, but never mix the devices. * cabling * switch configuration * RAID configuration * cache configuration on the RAID controller. Cache mirroring gives you more protection, but less performance.
- software: * check the NFS config. There are some interesting tips if you google around.
HTH
Rafa
On Mar 7, 2011, at 6:12 AM, wessel van der aart wessel@postoffice.nl wrote:
Hi All,
I've been asked to setup a 3d renderfarm at our office , at the start it will contain about 8 nodes but it should be build at growth. now the setup i had in mind is as following: All the data is already stored on a StorNext SAN filesystem (quantum ) this should be mounted on a centos server trough fiber optics , which in its turn shares the FS over NFS to all the rendernodes (also centos).
Now we've estimated that the average file send to each node will be about 90MB , so that's what i like the average connection to be, i know that gigabit ethernet should be able to that (testing with iperf confirms that) but testing the speed to already existing nfs shares gives me a 55MB max. as i'm not familiar with network shares performance tweaking is was wondering if anybody here did and could give me some info on this? Also i thought on giving all the nodes 2x1Gb-eth ports and putting those in a BOND, will do this any good or do i have to take a look a the nfs server side first?
1Gbe can do 115MB/s @ 64K+ IO size, but at 4k IO size (NFS) 55MB/s is about it.
If you need each node to be able to read 90-100MB/s you would need to setup a cluster file system using iSCSI or FC and make sure the cluster file system can handle large block/cluster sizes like 64K or the application can handle large IOs and the scheduler does a good job of coalescing these (VFS layer breaks it into 4k chunks) into large IOs.
It's the latency of each small IO that is killing you.
-Ross
On Mon, 7 Mar 2011, Ross Walker wrote:
1Gbe can do 115MB/s @ 64K+ IO size, but at 4k IO size (NFS) 55MB/s is about it.
If you need each node to be able to read 90-100MB/s you would need to setup a cluster file system using iSCSI or FC and make sure the cluster file system can handle large block/cluster sizes like 64K or the application can handle large IOs and the scheduler does a good job of coalescing these (VFS layer breaks it into 4k chunks) into large IOs.
It's the latency of each small IO that is killing you.
I'm not necessarily convinced it's quite that bad (here's some default NFSv3 mounts under CentOS 5.5, with Jumbo frames, rsize=32768,wsize=32768).
$ sync;time (dd if=/dev/zero of=testfile bs=1M count=10000;sync) [I verified that it'd finished when it thought it had] 10485760000 bytes (10 GB) copied, 133.06 seconds, 78.8 MB/s
umount, mount (to clear any cache):
$ dd if=testfile of=/dev/null bs=1M 10485760000 bytes (10 GB) copied, 109.638 seconds, 95.6 MB/s
This machine only has a double-bonded gig interface so with four clients all hammering at the same time, this gives:
$ dd if=/scratch/testfile of=/dev/null bs=1M 10485760000 bytes (10 GB) copied, 189.64 seconds, 55.3 MB/s
So with four clients (on single gig) and one server with two gig interfaces you're getting an aggregate rate of 220Mbytes/sec. Sounds pretty reasonable to me!
If you want safe writes (sync), *then* latency kills you.
jh
On Mar 7, 2011, at 9:55 AM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
On Mon, 7 Mar 2011, Ross Walker wrote:
1Gbe can do 115MB/s @ 64K+ IO size, but at 4k IO size (NFS) 55MB/s is about it.
If you need each node to be able to read 90-100MB/s you would need to setup a cluster file system using iSCSI or FC and make sure the cluster file system can handle large block/cluster sizes like 64K or the application can handle large IOs and the scheduler does a good job of coalescing these (VFS layer breaks it into 4k chunks) into large IOs.
It's the latency of each small IO that is killing you.
I'm not necessarily convinced it's quite that bad (here's some default NFSv3 mounts under CentOS 5.5, with Jumbo frames, rsize=32768,wsize=32768).
$ sync;time (dd if=/dev/zero of=testfile bs=1M count=10000;sync) [I verified that it'd finished when it thought it had] 10485760000 bytes (10 GB) copied, 133.06 seconds, 78.8 MB/s
umount, mount (to clear any cache):
$ dd if=testfile of=/dev/null bs=1M 10485760000 bytes (10 GB) copied, 109.638 seconds, 95.6 MB/s
This machine only has a double-bonded gig interface so with four clients all hammering at the same time, this gives:
$ dd if=/scratch/testfile of=/dev/null bs=1M 10485760000 bytes (10 GB) copied, 189.64 seconds, 55.3 MB/s
So with four clients (on single gig) and one server with two gig interfaces you're getting an aggregate rate of 220Mbytes/sec. Sounds pretty reasonable to me!
If you want safe writes (sync), *then* latency kills you.
The OP wanted 90MB/s per node and we have no clue whether the application he is using is capable of driving 1MB block sizes.
Why wouldn't you want safe writes? Is that like saying, and if you care for your data?
-Ross
On 3/8/11 8:32 AM, Ross Walker wrote:
Why wouldn't you want safe writes? Is that like saying, and if you care for your data?
You don't fsync every write on a local disk. Why demand it over NFS where the server is probably less likely to crash than the writing node? That's like saying you don't care about speed - or you can afford a 10x faster array just for the once-in-several-years you might see a crash.
On Mar 8, 2011, at 9:48 AM, Les Mikesell lesmikesell@gmail.com wrote:
On 3/8/11 8:32 AM, Ross Walker wrote:
Why wouldn't you want safe writes? Is that like saying, and if you care for your data?
You don't fsync every write on a local disk. Why demand it over NFS where the server is probably less likely to crash than the writing node? That's like saying you don't care about speed - or you can afford a 10x faster array just for the once-in-several-years you might see a crash.
Well on my local disk I don't cache the data of tens or hundreds of clients and a server can have a memory fault and oops just as easily as any client.
Also I believe it doesn't sync every single write (unless mounted on the client sync which is only for special cases and not what I am talking about) only when the client issues a sync or when the file is closed. The client is free to use async io if it wants, but the server SHOULD respect the clients wishes for synchronous io.
If you set the server 'async' then all io is async whether the client wants it or not.
-Ross
On Tue, 8 Mar 2011, Ross Walker wrote:
Well on my local disk I don't cache the data of tens or hundreds of clients and a server can have a memory fault and oops just as easily as any client.
Also I believe it doesn't sync every single write (unless mounted on the client sync which is only for special cases and not what I am talking about) only when the client issues a sync or when the file is closed. The client is free to use async io if it wants, but the server SHOULD respect the clients wishes for synchronous io.
If you set the server 'async' then all io is async whether the client wants it or not.
I think you're right that this is how it should work, I'm just not entirely sure that's actually generally the case (whether that's because typical applications try to do sync writes or if it's for other reasons, I don't know).
Figures for just changing the server to sync, everything else identical. Client does not have 'sync' set as a mount option. Both attached to the same gigabit switch (so favouring sync as far as you reasonably could with gigabit):
sync;time (dd if=/dev/zero of=testfile bs=1M count=10000;sync)
async: 78.8MB/sec sync: 65.4MB/sec
That seems like a big enough performance hit to me to at least consider the merits of running async.
That said, running dd with oflag=direct appears to bring the performance up to async levels:
oflag=direct with sync nfs export: 81.5 MB/s oflag=direct with async nfs export: 87.4 MB/s
But if you've not got control over how your application writes out to disk, that's no help.
jh
thanks for all the response , really gives me a good idea where to pay attention to. the software we're using to distribute our renders is RoyalRender, i'm not sure if any optimization is possible, i'll check it out. so far it seems that the option of using nfs stands or falls with he use of sync. does anyone here uses nfs without sync in production? does data corrupt often? all the data send from the nodes can be reproduced , so i would think an error is acceptable if it happens once a month or so. are there any other options more suitable in this situation? i thought about GFS with iscsi but i'm not sure if that will work if the filesystem to be shared already exists in production.
Thanks, Wessel
On Tue, 8 Mar 2011 17:25:03 +0000 (GMT), John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
On Tue, 8 Mar 2011, Ross Walker wrote:
Well on my local disk I don't cache the data of tens or hundreds of clients and a server can have a memory fault and oops just as easily as any client.
Also I believe it doesn't sync every single write (unless mounted on
the
client sync which is only for special cases and not what I am talking about) only when the client issues a sync or when the file is closed. The client is free to use async io if it wants, but the server SHOULD respect the clients wishes for synchronous io.
If you set the server 'async' then all io is async whether the client wants it or not.
I think you're right that this is how it should work, I'm just not
entirely
sure that's actually generally the case (whether that's because typical applications try to do sync writes or if it's for other reasons, I don't know).
Figures for just changing the server to sync, everything else identical. Client does not have 'sync' set as a mount option. Both attached to the same gigabit switch (so favouring sync as far as you reasonably could with gigabit):
sync;time (dd if=/dev/zero of=testfile bs=1M count=10000;sync)
async: 78.8MB/sec sync: 65.4MB/sec
That seems like a big enough performance hit to me to at least consider
the
merits of running async.
That said, running dd with oflag=direct appears to bring the performance up to async levels:
oflag=direct with sync nfs export: 81.5 MB/s oflag=direct with async nfs export: 87.4 MB/s
But if you've not got control over how your application writes out to
disk,
that's no help.
jh _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 3/8/2011 3:14 PM, wessel van der aart wrote:
the software we're using to distribute our renders is RoyalRender, i'm not sure if any optimization is possible, i'll check it out. so far it seems that the option of using nfs stands or falls with he use of sync. does anyone here uses nfs without sync in production? does data corrupt often?
The difference between sync/async isn't whether or not the data will be corrupted or lost, it is whether the client writing it knows whether or not each write completes. Unless the client has some reasonable way to respond to a failed write it's not going to make a difference in practice.
all the data send from the nodes can be reproduced , so i would think an error is acceptable if it happens once a month or so.
How often does your nfs server crash? And would it matter if the writing software knew the exact write that failed or a few seconds later? If you are transferring money between two accounts it matters - but with rendering you'd probably redo the file anyway.
On Tue, 8 Mar 2011, wessel van der aart wrote:
does anyone here uses nfs without sync in production? does data corrupt often?
Yes, I use it. If you had an NFS server that regularly died due to hardware faults, or kernel panics, then I wouldn't consider using it.
all the data send from the nodes can be reproduced , so i would think an error is acceptable if it happens once a month or so.
Exactly.
are there any other options more suitable in this situation? i thought about GFS with iscsi but i'm not sure if that will work if the filesystem to be shared already exists in production.
Personally I'd start with NFS first, and then prove that it's not up to the job. If it is, it's a whole lot easier than any other option.
jh
On Mar 8, 2011, at 12:25 PM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
On Tue, 8 Mar 2011, Ross Walker wrote:
Well on my local disk I don't cache the data of tens or hundreds of clients and a server can have a memory fault and oops just as easily as any client.
Also I believe it doesn't sync every single write (unless mounted on the client sync which is only for special cases and not what I am talking about) only when the client issues a sync or when the file is closed. The client is free to use async io if it wants, but the server SHOULD respect the clients wishes for synchronous io.
If you set the server 'async' then all io is async whether the client wants it or not.
I think you're right that this is how it should work, I'm just not entirely sure that's actually generally the case (whether that's because typical applications try to do sync writes or if it's for other reasons, I don't know).
As always YMMV, but on the whole it's how it works.
ESX is an exception, it does O_FSYNC on each write cause it needs to know for certain that each completed.
Figures for just changing the server to sync, everything else identical. Client does not have 'sync' set as a mount option. Both attached to the same gigabit switch (so favouring sync as far as you reasonably could with gigabit):
sync;time (dd if=/dev/zero of=testfile bs=1M count=10000;sync)
async: 78.8MB/sec sync: 65.4MB/sec
That seems like a big enough performance hit to me to at least consider the merits of running async.
Yes, disabling the safety feature will make it run faster. Just as disabling the safety on a gun will make it faster in a draw.
That said, running dd with oflag=direct appears to bring the performance up to async levels:
oflag=direct with sync nfs export: 81.5 MB/s oflag=direct with async nfs export: 87.4 MB/s
But if you've not got control over how your application writes out to disk, that's no help.
Most apps unfortunately don't allow one to configure how it handles io reads/writes, so you're stuck with how it behaves.
A good sized battery backed write-back cache will often negate the O_FSYNC penalty.
-Ross
On Wed, 9 Mar 2011, Ross Walker wrote:
On Mar 8, 2011, at 12:25 PM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
I think you're right that this is how it should work, I'm just not entirely sure that's actually generally the case (whether that's because typical applications try to do sync writes or if it's for other reasons, I don't know).
As always YMMV, but on the whole it's how it works.
ESX is an exception, it does O_FSYNC on each write cause it needs to know for certain that each completed.
But what I was saying is, most applications benefit from async writes, which suggests most applications do noticeably amounts of sync writes. It doesn't overly matter if they *can* choose to write async if hardly any of them do.
sync;time (dd if=/dev/zero of=testfile bs=1M count=10000;sync)
async: 78.8MB/sec sync: 65.4MB/sec
That seems like a big enough performance hit to me to at least consider the merits of running async.
Yes, disabling the safety feature will make it run faster. Just as disabling the safety on a gun will make it faster in a draw.
And if you're happy with that (in the case of this render farm, a fault at the NFS level is non-fatal) then there's no problem. I don't have a safety on my water pistol.
That said, running dd with oflag=direct appears to bring the performance up to async levels:
oflag=direct with sync nfs export: 81.5 MB/s oflag=direct with async nfs export: 87.4 MB/s
But if you've not got control over how your application writes out to disk, that's no help.
Most apps unfortunately don't allow one to configure how it handles io reads/writes, so you're stuck with how it behaves.
A good sized battery backed write-back cache will often negate the O_FSYNC penalty.
All those figures *were* with a 256Mbyte battery backed write-back cache. It's really not hard to make those figures look a whole lot more skewed in favour of async...
jh
On Tue, 8 Mar 2011, Ross Walker wrote:
The OP wanted 90MB/s per node and we have no clue whether the application he is using is capable of driving 1MB block sizes.
I thought he wanted 90MB/s reads per node (and I've demonstrated that's doable with NFS). The only reason I'm not showing it with four clients is because that machine only has two GigE interfaces, so it's not going to happen with any protocol. That also showed ~80MB/s writes per node.
Why wouldn't you want safe writes? Is that like saying, and if you care for your data?
The absolute definiton of safe here is quite important. In the event of a power loss, and a failure of the UPS, quite possibly also followed by a failure of the RAID battery you'll get data loss, as some writes won't be committed to disk despite the client thinking they are.
Now personally, I'll gladly accept that restriction in many situations where performance is critical.
jh
On Mar 8, 2011, at 12:02 PM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
The absolute definiton of safe here is quite important. In the event of a power loss, and a failure of the UPS, quite possibly also followed by a failure of the RAID battery you'll get data loss, as some writes won't be committed to disk despite the client thinking they are.
Don't forget about kernel panics and the accidentally pulled the plugs...
-Ross
On Wed, 9 Mar 2011, Ross Walker wrote:
On Mar 8, 2011, at 12:02 PM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
The absolute definiton of safe here is quite important. In the event of a power loss, and a failure of the UPS, quite possibly also followed by a failure of the RAID battery you'll get data loss, as some writes won't be committed to disk despite the client thinking they are.
Don't forget about kernel panics and the accidentally pulled the plugs...
Sure, but the kernel's always in a position to screw you over. While you're being negative, include bad memory on the RAID card, and then your life becomes really interesting.
jh
On Mar 9, 2011, at 8:44 AM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
On Wed, 9 Mar 2011, Ross Walker wrote:
On Mar 8, 2011, at 12:02 PM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
The absolute definiton of safe here is quite important. In the event of a power loss, and a failure of the UPS, quite possibly also followed by a failure of the RAID battery you'll get data loss, as some writes won't be committed to disk despite the client thinking they are.
Don't forget about kernel panics and the accidentally pulled the plugs...
Sure, but the kernel's always in a position to screw you over. While you're being negative, include bad memory on the RAID card, and then your life becomes really interesting.
Life is full of risks one of course has to prioritize these from likely to unlikely and determine if mitigation of the likely risks is necessary. I have personally experienced kernel panics after a kernel upgrade, so I put that as likely, but have yet to experience RAID write-back corruption, so I put that as unlikely, but you never know.
-Ross
John Hodrien wrote:
On Wed, 9 Mar 2011, Ross Walker wrote:
On Mar 8, 2011, at 12:02 PM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:
The absolute definiton of safe here is quite important. In the event of a power loss, and a failure of the UPS, quite possibly also
followed by a
failure of the RAID battery you'll get data loss, as some writes won't be committed to disk despite the client thinking they are.
Don't forget about kernel panics and the accidentally pulled the plugs...
Sure, but the kernel's always in a position to screw you over. While you're being negative, include bad memory on the RAID card, and then
your life
becomes really interesting.
Hey, you forgot a failing connection on the backplane for the RAID....
mark "they were in slot 5 & 6, now they're in 14 and 15...."