CRM SQL Box – Hardware Failure

Wednesday afternoon around 4 PM, a disk drive in a RAID 5 environment failed.  The environment was a CRM 4 SQL machine that also contained the Reporting Server for approximately 120 users.  The company leverages all the CRM modules (Sales, Marketing, Service, etc.).  Within just a few seconds, a second drive failed.  Well, we all know what that means, high probability of data corruption at best.

A talented IT support person was able to get the drives back online and moved the database production files to network storage.  He then cloned the CRM SQL environment to a VM Ware environment.  This laid the foundation for restoring SQL.  I contacted Microsoft support to identify next steps to ensure I would not do anything that would not be supported in the event of future issues.  The path identified was to uninstall SQL server from the cloned environment and install a clean copy with all the latest updates.  The company uses SQL 2005 so I completed that work around 9 PM. 

Next step was to work with Microsoft support to see if we could restore the production data as the last good SQL backup was performed on Wednesday morning around 3 AM.  After a few hours of trying several manual steps to restore, it was apparent that the data corruption was extensive and if we were successful with the restore, the data was most like comprised and not reliable.  We decided to restore from the last backup which meant the the users would lose any work they completed on Wednesday.  At this point, this was best case scenario.

Long story short, I was able to successfully attach the backup files to the database and after several hours effort to correct security logins and make some final tweaks, the production environment was restored and CRM was able to communicate with the SQL server.  I spent another day resolving issues with Scribe security and working with MS support to restore the Reporting Server.  Of everything, the Reporting Server restoration was the most problematic and required significant manual changes to get everything working.

This was a good experience, I learned alot, lost a lot of sleep but in the end, users are working in the application and considering the type of failure, it could have been worse.  Moral of the story, establish a good backup strategy, create cloned environments, keep test environments current with production as much as possible, and test restoring production throughout the year.

Have a good weekend All!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s