On Jul 1, 2019, at 10:10 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On 2019-07-01 10:01, Warren Young wrote:
On Jul 1, 2019, at 8:26 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
RAID function, which boils down to simple, short, easy to debug well program.
I didn't intend to start software vs hardware RAID flame war
Where is this flame war you speak of? I’m over here having a reasonable discussion. I’ll continue being reasonable, if that’s all right with you. :)
Now, commenting with all due respect to famous person who Warren Young definitely is.
Since when? I’m not even Internet Famous.
RAID firmware will be harder to debug than Linux software RAID, if only because of easier-to-use tools.
I myself debug neither firmware (or "microcode", speaking the language as it was some 30 years ago)
There is a big distinction between those two terms; they are not equivalent terms from different points in history. I had a big digression explaining the difference, but I’ve cut it as entirely off-topic.
It suffices to say that with hardware RAID, you’re almost certainly talking about firmware, not microcode, not just today, but also 30 years ago. Microcode is a much lower level thing than what happens at the user-facing product level of RAID controllers.
In both cases it is someone else who does the debugging.
If it takes three times as much developer time to debug a RAID card firmware as it does to debug Linux MD RAID, and the latter has to be debugged only once instead of multiple times as the hardware RAID firmware is reinvented again and again, which one do you suppose ends up with more bugs?
You are speaking as the person who routinely debugs Linux components.
I have enough work fixing my own bugs that I rarely find time to fix others’ bugs. But yes, it does happen once in a while.
- Linux kernel itself, which is huge;
…under which your hardware RAID card’s driver runs, making it even more huge than it was before that driver was added.
You can’t zero out the Linux kernel code base size when talking about hardware RAID. It’s not like the card sits there and runs in a purely isolated environment.
It is a testament to how well-debugged the Linux kernel is that your hardware RAID card runs so well!
All of the above can potentially panic kernel (as they all run in kernel context), so they all affect reliability of software RAID, not only the chunk of software doing software RAID function.
When the kernel panics, what do you suppose happens to the hardware RAID card? Does it keep doing useful work, and if so, for how long?
What’s more likely these days: a kernel panic or an unwanted hardware restart? And when that happens, which is more likely to fail, a hardware RAID without BBU/NV storage or a software RAID designed to be always-consistent?
I’m stripping away your hardware RAID’s advantage in NV storage to keep things equal in cost: my on-board SATA ports for your stripped-down hardware RAID card. You probably still paid more, but I’ll give you that, since you’re using non-commodity hardware.
Now that they’re on even footing, which one is more reliable?
hardware RAID "firmware" program being small and logically simple
You’ve made an unwarranted assumption.
I just did a blind web search and found this page:
https://www.broadcom.com/products/storage/raid-controllers/megaraid-sas-9361...
…on which we find that the RAID firmware for the card is 4.1 MB, compressed.
Now, that’s considered a small file these days, but realize that there are no 1024 px² icon files in there, no massive XML libraries, no language internationalization files, no high-level language runtimes… It’s just millions of low-level highly-optimized CPU instructions.
From experience, I’d expect it to take something like 5-10 person-years to reproduce that much code.
That’s far from being “small and logically simple.”
it usually runs on RISC architecture CPU, and introduce bugs programming for RISC architecture IMHO is more difficult that when programming for i386 and amd64 architectures.
I don’t think I’ve seen any such study, and if I did, I’d expect it to only be talking about assembly language programming.
Above that level, you’re talking about high-level language compilers, and I don’t think the underlying CPU architecture has anything to do with the error rates in programs written in high-level languages.
I’d expect RAID firmware to be written in C, not assembly language, which means the CPU the has little or nothing to do with programmer error rates.
Thought experiment: does Linux have fewer bugs on ARM than on x86_64?
I even doubt that you can dig up a study showing that assembly language programming on CISC is significantly more error-prone than RISC programming in the first place. My experience says that error rates in programs are largely a function of the number of lines of code, and that puts RISC at a severe disadvantage. For ARM vs x86, the instruction ratio is roughly 3:1 for equivalent user-facing functionality.
There are many good reasons why the error rate in programs should be so strongly governed by lines of code:
1. More LoC is more chances for typos and logic errors.
2. More LoC means a smaller proportion of the solution fits on the screen at once, hiding information from the programmer. Out of sight, out of mind.
3. More LoC takes more time to compose and type, so a programmer writing fewer LoC has more time to debug and test, all else being equal.
This is also why almost no one writes in assembly any more, and those who do rarely write *just* in assembly.