On Tue, Jan 6, 2015 at 6:12 PM, Les Mikesell <> wrote:
I've had a few systems with a lot of RAM and very busy filesystems come up with filesystem errors that took a manual 'fsck -y' after what should have been a clean reboot. This is particularly annoying on remote systems where I have to talk someone else through the recovery.
Is there some time limit on the cache write with a 'reboot' (no options) command or is ext4 that fragile?
I'd say there's no limit in the amount of time the kernel waits until the blocks have been written to disk; driven by there parameters:
vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 20 vm.dirty_writeback_centisecs = 500
ie, if the data cached on RAM is older than 30s or larger than 10% available RAM, the kernel will try to flush it to disk. Depending how much data needs to be flushed at poweroff/reboot time, this could have a significant effect on the time taken.
Regarding systems with lots of RAM, I've never seen such a behaviour on a few 192 GB RAM servers I administer. Granted, your system could be tuned in a different way or have some other configuration.
TBH I'm not confident to give a definitive answer re the data not been totally flushed before reboot. I'd investigate:
- Whether this happens on every reboot or just on some. - Whether your RAM is OK (the FS errors could come from that!). - Whether your disks/SAN are caching writes. (Maybe they are and the OS thinks the data has been flushed to disk, but they haven't) - filesystem mount options that might interfere (nobarrier, commit, data...)
HTH
~f