Hi,
You might have noticed the server has been offline for a day. First I want to apologise for that. It was partly my fault and partly a problem caused by a piece of software. Machine is now back up, and running normally.
a few system files were lost due to the problem (which caused filesystem corruption. These have since been restored from backup. No other files have been damaged.
Server Problems / Status
In case you want to know what happened.
I recently upgraded my kernel (as part of upgrading to debian etch). I noticed some performance issues, and saw that dma was switched off on the harddisk. i reenabled it, and then again next day it was off again. I checked the system logs and saw that early morning something went wrong and the disk reported problems (lost interrupt / errors).
My first thought were my hdparm options I was using were causing some incompatibility, so i turned some off. It still happened. I grep'd the logs again, and then realised the problem was happening at exactly the same time each day. 4am.
I then thought it could have been the smart monitoring, which that wasn't configured for 4am, then I realised.
The backup script I use does a few tasks. One (which is not actually needed), is it dumps the system configuration to a file (which then gets backed up). It uses the command "hwinfo" for this. Turns out that "hwinfo" as it probes the system, causes some MAJOR problems with the hd communication on my ppc mac mini. I guess this was not happening with an earlier kernel or an older version of hwinfo which is why it's only recently shown up.
I of course disabled/removed this step from the backup script. A few minutes later I was playing with apache, and noticed that one of its config files was unreadable. uh oh.
At this point, I should have killed any task with open files, remounted the filesystem readonly and run a fsck on it. But no.. instead I typed "shutdown -rF" to do a fsck on next boot, not even thinking about the fact that all this happens well before "sshd" is loaded. doh. But it was too late.
My server should be able to netboot to a recovery console, however, some problem stopped this. I had to wait until the next morning for an engineer to get a keyboard and monitor on the machine and instruct them to get the machine past the filesystem check/fix.
Once the machine was up and running, I used a combination of find/mdsum/sort and diff, to compare the server filesystem with the most recent backups. I also noticed the filesystem corruption had been around for a few days as some files had not successfully been backed up for a while. Luckily I have incremental backups, and the only corrupt/missing files were a few system configuration ones, which i replaced from a backup before the problem occurred.
Anyway, we are back on, and this problem should happen again.
I recently upgraded my kernel (as part of upgrading to debian etch). I noticed some performance issues, and saw that dma was switched off on the harddisk. i reenabled it, and then again next day it was off again. I checked the system logs and saw that early morning something went wrong and the disk reported problems (lost interrupt / errors).
My first thought were my hdparm options I was using were causing some incompatibility, so i turned some off. It still happened. I grep'd the logs again, and then realised the problem was happening at exactly the same time each day. 4am.
I then thought it could have been the smart monitoring, which that wasn't configured for 4am, then I realised.
The backup script I use does a few tasks. One (which is not actually needed), is it dumps the system configuration to a file (which then gets backed up). It uses the command "hwinfo" for this. Turns out that "hwinfo" as it probes the system, causes some MAJOR problems with the hd communication on my ppc mac mini. I guess this was not happening with an earlier kernel or an older version of hwinfo which is why it's only recently shown up.
I of course disabled/removed this step from the backup script. A few minutes later I was playing with apache, and noticed that one of its config files was unreadable. uh oh.
At this point, I should have killed any task with open files, remounted the filesystem readonly and run a fsck on it. But no.. instead I typed "shutdown -rF" to do a fsck on next boot, not even thinking about the fact that all this happens well before "sshd" is loaded. doh. But it was too late.
My server should be able to netboot to a recovery console, however, some problem stopped this. I had to wait until the next morning for an engineer to get a keyboard and monitor on the machine and instruct them to get the machine past the filesystem check/fix.
Once the machine was up and running, I used a combination of find/mdsum/sort and diff, to compare the server filesystem with the most recent backups. I also noticed the filesystem corruption had been around for a few days as some files had not successfully been backed up for a while. Luckily I have incremental backups, and the only corrupt/missing files were a few system configuration ones, which i replaced from a backup before the problem occurred.
Anyway, we are back on, and this problem should happen again.