Server Crash NT4.0 and Restored

Posted at 12:43:59 AM in Hardware (8) | Read count: 2258

What a great way to start the holidays.  Just as everyone was wrapping up to leave for the holidays, we discovered that the server drive had crashed.  All the diagnostics pointed to the drive being the issue.

The server is a Dell PowerEdge 2450 using the integrated RAID controller.  The the indication to the user was that the file couldn't be read and programs aborted.  However, on the server console the errors were write-behind cache issues. Mft$ couldn't to be written to, some data may have been lost and this same error displayed on several folders and files specifically on the drive.  The OS on this server is NT4.0.

The OS is installed on a striped array of 2 drives, 9.1Gig each giving a total of 18gig.  Raid 0 is not a configuration that should be used on the OS drive.  The drive was partitioned with 2Gigs for the OS and 14Gigs for exchange server files which was no longer being used and a Dell utility partition.  Thankfully, there was nothing wrong with the drives in that set.  The data drive was a single partition mounted in the RAID as a volume with 146Gig storage.  All the drives were u160 SCSI-2 hot swappable drives.

The backups are being performed by backuppc which has been in operation for about 5 to 6 years.  It has performed flawlessly, but I've never had to restore a whole drive from it before.  I used Acronis V9 workstation to make a bare metal image of the OS drive and all it's partitions.  I also tried to backup the defective drive just in case it was the NT that was causing the problems, but Acronis couldn't back it up either.

Once all the data was backed up, I pulled the integrated RAID controller plug off the mother board and took a look at the drives in Acronis again.  All the drives were uninitialized.  I restored the bare metal backup to drive 0 which was a 9.1Gig HD.  Acronis restored the OS partition as it was and shrunk the partition used for Exchange without any problems.  I was able to boot back into NT4.0 without any errors, but I still didn't have a data drive.

The new drive I purchased wasn't recognized by NT, but the SCSI controller recognized it.  When I did a data verification, it "red screened" right away indicating that the media wasn't any good.  I tried a drive that we had on hand but was not marked as being bad and found the same problem when I did the media test.  I was only left with the original HD that was bad to begin with.  When I did the media test in the in the SCSI controller interface, it only reported 3 bad spots on the drive.  NT also recognized the drive, so I went ahead and formatted it and started the restore.  This drive will have to be replaced, but appears to be usable now.

The drive was purchased from ServerSupply.com.  I ordered it late on 12/22/2010 (Wednesday) and was told that it wouldn't arrive til Monday even with the overnight delivery I requested.  However, it showed up on 12/23/2010 at 11:00am which I thought was pretty good service.  I have submitted an RMA and will follow up with their service on that item.  I found them on pricewatch.com, but called anyway because I needed the item to be delivered to a location that was not registered with the credit card.  They said it wouldn't be a problem as long as the ship to location was a business.

It took 2 hours to get a bare metal backup. And then I pulled the plug on the controller and restored the OS.  I spent the next 6 hours trying to get the system to take the drive back without restoring a backup and couldn't do it.  Then 2 hours formatting the replacement disk and 2 hours getting the restore of the data drive going.  The automatic backups had started for all the PCs in the office which cause a lot of problems getting the restore to go.

Backuppc never shows the xfer PID like the backups do.  I kept checking the status and since no xfer PID was showing I thought it wasn't running.  When I checked the server, Rsync was eating a lot of CPU which is usually an indication that the backup is running, so I checked the drive and it was filling up.  The restore operation took over 7 hours.  It restored 55G of data.

I should have selected the current incremental backup as it would have brought everything up to date.  However, I did the full restore and then applied the incremental backup, but the incremental backup is taking just as long to restore.

I was really pleased that Acronis backed up the RAID and allowed me to restore it to a SCSI drive and the system booted.  I have tried this on HP servers and Acronis can't recognized the RAID on HP servers.  They have a bare metal implementation for HP servers, but it requires installing Acronis on the OS.  That becomes a problem when restoring the system because you need to install the OS and then Install Acronis in order to restore the bare metal system which isn't really bare metal.

I approach every restore with a lot of trepidation.  It's bad enough that the data is lost but if the restore doesn't work then the problems really begin.  I have restored a system that didn't have a bare metal backup and all they had was a day old SQL backup.  I had to install everything and all the users and prepare the SQL database correctly then restore the data.  It was 36 hours of work, but on Monday morning the system was back on line and I was a mess.  I never want to do that again.

Written by Leonard Rogers on Friday, December 24, 2010 | Comments (0)


    Name
    URL
    Email
    Email address is not published
    Remember Me
    Comments

    CAPTCHA Reload
    Write the characters in the image above