Yesterday i was copying a few dump files (backups) about 1GB in size from my centOS 3.4, on a samba share, to my windows, when the centOS box stopped responding.
The HDD LED was on and after I connected a monitor all I could see was this error message over and over again:
"usb-uhci.c: host controller halted, trying to restart"
I wasn't able to login or anything, so I saw no other solution than pressing the reset button.
When it rebooted it forced a hdd check, but was unable to mount /dev/hde2 (that useually mounts on /var !) and pretty much nothing works without /var. The error message mount gave was something like "Unable to mount /dev/hde2: invalid argument".
The files I was copying are located on /dev/md0 (raid 0 over 4 disks) and that still worked fine, / is mounted on /dev/hde1 and that also worked as expected.
I removed /var from /etc/fstab and restored a backup of /var to the dir /var and then I could boot normally again.
Trying to save what was on /dev/hdde2 I ran a e2fsck -p /dev/hde2 and that corrected a ton of errors (deleted a lot of data), I then mounted /dev/hde2 to another folder and restored my backups so I only lost a few hours of data. After I added /var to /etc/fstab again everything worked as normal again.
My question is happened!? and what can I do to avoid this again? If /dev/hde2 had been a RAID 1 would it then have rebuild? Should I move /var to a RAID 1?
I have copied large files like this before without problems, was it just bad luck or should I expect it to do this again?
/dev/hde is attached to a cheap ide ultra ata 133 pci controller card (Silicon Image) that has worked flawlessly for about a year. Can that be broken? Right now it seems OK again. I have a replacement for it but I would rather not replace it if it isnt necessary.
Any suggestions are appreciated. Best regards Ulrik
PS. Try backups! You wont regret it :)
On Wed, 2005-08-03 at 10:23 +0200, Ulrik S. Kofod wrote:
Yesterday i was copying a few dump files (backups) about 1GB in size from my centOS 3.4, on a samba share, to my windows, when the centOS box stopped responding.
The HDD LED was on and after I connected a monitor all I could see was this error message over and over again:
"usb-uhci.c: host controller halted, trying to restart"
I wasn't able to login or anything, so I saw no other solution than pressing the reset button.
When it rebooted it forced a hdd check, but was unable to mount /dev/hde2 (that useually mounts on /var !) and pretty much nothing works without /var. The error message mount gave was something like "Unable to mount /dev/hde2: invalid argument".
Hard to see what that has to do with usb-uhci - hde is apparently on an IDE controller and usb-uhci is USB. Could be a memory or MB issue. I'd try running memtest86+ and monitor logs for errors. Checking all disks with smart is also indicated.
The files I was copying are located on /dev/md0 (raid 0 over 4 disks) and that still worked fine, / is mounted on /dev/hde1 and that also worked as expected.
I removed /var from /etc/fstab and restored a backup of /var to the dir /var and then I could boot normally again.
Trying to save what was on /dev/hdde2 I ran a e2fsck -p /dev/hde2 and that corrected a ton of errors (deleted a lot of data), I then mounted /dev/hde2 to another folder and restored my backups so I only lost a few hours of data. After I added /var to /etc/fstab again everything worked as normal again.
My question is happened!? and what can I do to avoid this again? If /dev/hde2 had been a RAID 1 would it then have rebuild? Should I move /var to a RAID 1?
Wouldn't hurt, but why not do the whole system on RAID 1 if you're going to that trouble?
I have copied large files like this before without problems, was it just bad luck or should I expect it to do this again?
Again - I'd suspect some underlying hardware problems.
/dev/hde is attached to a cheap ide ultra ata 133 pci controller card (Silicon Image) that has worked flawlessly for about a year. Can that be broken? Right now it seems OK again. I have a replacement for it but I would rather not replace it if it isn’t necessary.
I'd guess disks, memory, and MB before the controller - emphasis on GUESS.
PS. Try backups! You won’t regret it :)
Follow your own advice! :-)
Good luck, Phil
Phil Schaffner sagde:
On Wed, 2005-08-03 at 10:23 +0200, Ulrik S. Kofod wrote:
The HDD LED was on and after I connected a monitor all I could see was this error
message over and over again:
"usb-uhci.c: host controller halted, trying to restart"
Hard to see what that has to do with usb-uhci - hde is apparently on an IDE
controller and usb-uhci is USB. Could be a memory or MB issue. I'd try running memtest86+ and monitor logs for errors. Checking all disks with smart is also indicated.
After I had fixed all the errors and everything was running again, I tried copying the files again and the same thing happened. It wiped the disks gave same error message on the screen and it was pretty much a brick again. But this time I was prepared, I had copied everything to a smaller temp box that could take over if this one failed again.
There are no USB devices connected to that box, only USB devise involved was the HDD I was copying to, but that was connected to the windows box. I find it hard to believe that a HDD connected to a USB port on a windows box can cause above error and wipe the disks on my Linux box via a samba share.
I did add 512MB of memory not that long ago to the Linux box, but I have been running memtest86+ for over 24 hours, without any errors reported at all.
I mounted the disk on another computer to have a look at the logs but the logs was destroyed and I obviously don't have a backup of the log at the time of the crash.
SMART status on all HDD's is GOOD. I have run a test utility on all the HDD's and the ware all good.
Not sure how I can test the motherboard? Any suggestions? It is an Asus A7V266-C.
A friend of mine told me that computers with AMD processors can have problems if power save is enabled in the BIOS (like I had), something with the north bridge being unavailable, could this be the problem and if so why did it first occur now? It can't be just bad luck or coincidence when it happens twice two days in a row.
Now I have disabled the USB controller and power save in the BIOS and hope that helps, but I'm not very comfortable that I'm not sure what actually caused this.
I also tried to see if there has been any updates to uhci but I'm not sure I'm looking the right place, yum.log doesn't tell much. Would uhci be part of a kernel update?
Now I'm running a burn in test (lucifer) on storage to see it that can provoke it to fail again.
I'd guess disks, memory, and MB before the controller - emphasis on GUESS.
Your guess is very much appreciated and with no doubt better than mine as I really can't see what the problem should be caused by.
best regards Ulrik