Since someone previously complained about my lack of definitions and expansion ...
On Fri, 2005-07-08 at 17:43 -0500, Bryan J. Smith wrote:
Shared memory systems (especially NUMA)
NUMA = Non-Uniform Memory Architecture
Most platforms have adopted it, including Opteron, Power, SPARC, etc..., although commodity PowerPC and Intel systems have not (not even Itanium 2). There are proprietary, extremely costly Xeon and Itanium systems that use NUMA.
GbE
Gigabit Ethernet -- 1Gbps = 1000Mbps ~ 100MBps = 0.1GBps with typical 8/10 data-level encoding.
That's before including the overhead of layer-2 frame, let alone typical layer-3 packet and layer-4 transport parsing.
or FC interconnected.
FiberChannel, which is a storage-oriented networking stack. Typical speeds for FiberChannel Arbitrated Loop (FC-AL) are 2-4Gbps = 2000-4000Mbps ~ 200-400MBps (0.2-0.4GBps) with typical 8/10 data-level encoding.
The reality is that a dual-2.8GHz Xeon MCH
Memory Controller Hub (MCH) aka Front Side Bus (slang/Bottleneck) (FSB). All processors content for the same GTL logic "bus" to the MCH, same with all memory and all I/O -- in a literal "hub" type architecture (_all_ nodes receive from a single transmitting node).
On Itanium2, Intel calls this Scalable Node Architecture (SNA), which is not true at all.
Bandwidth of the latest AGTL+ is up to 8.4GBps in DDR2/PCIe implementations, although it can be widened to 16.8GBps. But remember, that is _shared_ by _all_ CPU, memory _and_ I/O -- and only one can talk to another at a time because of the MCH. Even if a proprietary NUMA solution is used, _all_ I/O _still_ goes through that MCH (to reach the ICH chips).
NUMA/UPA
Ultra Port Architecture (UPA), which is Sun's crossbar "switch" interconnect for UltraSPARC I and II. Most RISC (including Power, but typically not PowerPC) platforms uses a "switch" instead of a "hub" -- including EV6 (Alpha 264, Athlon "32-bit"), UPA and others. This allows the UPA "port" to connect to a variety of system "nodes," and up to even 128 "nodes" -- to 1-2+GBps per "node" in the "partial mesh." Performance is typically 1GBps per UPA "port," with 2 processors typical in a daughtercard with local memory (hence NUMA).
The "Fireplane" is an advancement of the UPA for UltraSPARC III and IV which increases performance to 4.8GBps per "node."
Opteron, by comparison, has direct 6.4GBps for DDR memory _plus_ up to 3 HyperTransport links of 8.0GBps each (6.4GBps in previous versions -- 1 for 100 series, 2 for 200, 3 for 800).
sub-0.5GBps for the FC-AL on even PCI-X 2.0.
PCI-X 1.0 is up to 133MHz @ 64-bit = 1.0GBps (1 slot configuration). PCI-X 2.0 is up to 266MHz @ 64-bit = 2.0GBps (2 slot configuration).
Real "end-to-end" performance is typically much lower. E.g., Intel has yet to reach 0.8GBps for InfiniBand over PCI-X, which is the "lowest overhead" of a communication protocol for clusters. FC introduces far more, and GbE (~0.1GBps before overhead) even more. They do _not_ bother with 10GbE (~1GBps before overhead) on PCI-X (but use custom 600-1200MHz XScale microcontrollers with direct 10GbE interfaces short of the PHY).
PCIe is supposed to address this, with up to 4GBps bi-directional in a 16 channel configuration, but the MCH+ICH design is proving impossible to break sustained 1GBps in many cases because it is a peripheral interconnect, not a system interconnect. Graphics cards are typically just shunting to/from system memory, and do it without accessing the CPU, and Host Based Adapters (HBA) do the same for FC and GbE. I.e., there's _no_way_ to talk to the CPU through the MCH+ICH kludge at those speeds, so the "processing is localized."
HTX InfiniBand (Infiniband directly on the HT).
HyperTransport eXtension (HTX) is a system (not peripherial) interconnect that allows clustering of Opteron with _native_ HyperTransport signaling. InfiniBand is capable of 1.8GBps end-to-end -- and that's before figuring the fact that _each_ Opteron can have _multiple_ HyperTransport connections. This is a _commodity_ design, whereas the few, capable Intel Xeon/Itanium2 clusters are quite proprietary and extremely expensive (like SGI's Altix).
fab lines as their SCSI equivalents, with the same vibration specs and MTBF numbers. They are not "commodity" [S]ATA drives, of
Mean Time Between Failures (MTBF)
Commodity Disk: 400,000 hours (50,000 starts, 8 hours operation) Enterprise Disk: 1,400,000 hours (24x7 operation)
It has _nothing_ to do with interface. 40, 80, 120, 160, 200, 250, 300, 320, 400GB drives are "commodity disk" designs, 9, 18, 36, 73, 146GB are "enterprise disk" designs. The former have 3-8x the vibration (less precise alignment) at 5,400-7,200 RPM than the latter at 10,000-15,000 RPM. There are SCSI drives coming off "commodity disk" lines (although they might test to higher tolerances, less vibration), and SATA drives coming of "enterprise disk" lines.
Until recently, most materials required commodity disk to be operating at 40C or less, whereas Enterprise is 55C. Newer commodity disks can take 60C operating temps -- hence the return to 3-5 year warranties -- but the specs still only for 8x5 operation -- some vendors (Hitachi-IBM) only rate 14x5 as "worst case usage" and warranty voided.
Some vendors are introducing "near-line disk" which are commodity disks that test to higher tolerances, and they are rated as 24x7 "network managed" -- i.e., the system powers them down (non-24x7 operation, just the system is), and spins them back up on occasion (you should _never_ let a commodity disk sit, hence why the are _poor_ for "off-line" backup).
Anyhoo, the point is that I can build an ASIC that can interface into 4-16 PHY (physical interface) chips that are "point-to-point" (host to drive electronics) and drive them _directly_ and _independently_. That's what intelligent SATA Host Adapters do -- Especially with "enterprise" 10,000 RPM SATA devices -- switch fabric + low-latency. SCSI is a "shared bus" or "multiple shared buses" of parallel design -- fine for yesteryear, but massive overhead and cost for today. The next move is "Serial Attached Storage" (SAS) which will replace SCSI, using a combination of "point-to-point" ASICs in a more distributed/managed design.