Server Problems / Status

All news regarding the website or the forum will be posted here. This will include website updates, features and anything else considered suitable. Only Forum Moderators may make new posts here, but all registered users may reply.

Moderators: XtC, BuZz, menace

Post Reply
User avatar
BuZz
Site Admin
Posts: 569
Joined: Mon Jun 10, 2002 12:52 pm
Location: Didcot, Oxfordshire
Contact:

Server Problems / Status

Post by BuZz »

Hi,

You might have noticed the server has been offline for a day. First I want to apologise for that. It was partly my fault and partly a problem caused by a piece of software. Machine is now back up, and running normally.

a few system files were lost due to the problem (which caused filesystem corruption. These have since been restored from backup. No other files have been damaged.

asle
Posts: 208
Joined: Fri Mar 07, 2003 11:28 pm
Location: France
Contact:

Post by asle »

I _did_ notice :). Thanks for putting all this back online that soon.

Cya
Sylvain
Sylvain "Asle" Chipaux

User avatar
BuZz
Site Admin
Posts: 569
Joined: Mon Jun 10, 2002 12:52 pm
Location: Didcot, Oxfordshire
Contact:

Post by BuZz »

In case you want to know what happened.

I recently upgraded my kernel (as part of upgrading to debian etch). I noticed some performance issues, and saw that dma was switched off on the harddisk. i reenabled it, and then again next day it was off again. I checked the system logs and saw that early morning something went wrong and the disk reported problems (lost interrupt / errors).

My first thought were my hdparm options I was using were causing some incompatibility, so i turned some off. It still happened. I grep'd the logs again, and then realised the problem was happening at exactly the same time each day. 4am.

I then thought it could have been the smart monitoring, which that wasn't configured for 4am, then I realised.

The backup script I use does a few tasks. One (which is not actually needed), is it dumps the system configuration to a file (which then gets backed up). It uses the command "hwinfo" for this. Turns out that "hwinfo" as it probes the system, causes some MAJOR problems with the hd communication on my ppc mac mini. I guess this was not happening with an earlier kernel or an older version of hwinfo which is why it's only recently shown up.

I of course disabled/removed this step from the backup script. A few minutes later I was playing with apache, and noticed that one of its config files was unreadable. uh oh.

At this point, I should have killed any task with open files, remounted the filesystem readonly and run a fsck on it. But no.. instead I typed "shutdown -rF" to do a fsck on next boot, not even thinking about the fact that all this happens well before "sshd" is loaded. doh. But it was too late.

My server should be able to netboot to a recovery console, however, some problem stopped this. I had to wait until the next morning for an engineer to get a keyboard and monitor on the machine and instruct them to get the machine past the filesystem check/fix.

Once the machine was up and running, I used a combination of find/mdsum/sort and diff, to compare the server filesystem with the most recent backups. I also noticed the filesystem corruption had been around for a few days as some files had not successfully been backed up for a while. Luckily I have incremental backups, and the only corrupt/missing files were a few system configuration ones, which i replaced from a backup before the problem occurred.

Anyway, we are back on, and this problem should happen again.

User avatar
XtC
Posts: 628
Joined: Wed Jun 12, 2002 6:26 pm
Location: Rossendale, England
Contact:

Post by XtC »

Ey up!
BuZz wrote:Anyway, we are back on, and this problem should happen again.
I bloody hope not! :shock:
Cheers!

User avatar
BuZz
Site Admin
Posts: 569
Joined: Mon Jun 10, 2002 12:52 pm
Location: Didcot, Oxfordshire
Contact:

Post by BuZz »

LOL. trust you to spot my typo..

;-)

Post Reply