Tuesday, June 3, 2008

Tough Weekend Outage

The company that hosts the webserver for SomaFM.com and the mail server, ThePlanet.net, had a rather large outage last weekend, which took the SomaFM web site off the air (so to speak) from 3:08 PM PDT Pacific time on Saturday, until about 3:37 AM Pacific time Monday morning (June 2nd).

Our mail server is still down, about 72 hours later. More on that in a bit.

The cause of this outage was outage was not immediately known, and calls to The Planet's tech support lines (which had 30 minute waits) were "unrewarding" to say the least. At first they wouldn't give me any information at all (because I didn't have the proper password), and they were only giving out information to "affected customers". I pointed out that since they had caller ID and they knew that I was calling from the phone number on record for our account, that should prove adequate to allow them to give me some information on what was happening. The rep finally agreed, even though he said, "he could get in trouble for telling me this".

What he told me was that they had had a transformer explosion at the datacenter where our servers were located.

This seemed kind of fishy, didn't they have adequate generator power? What about the UPSes? Blown transformers happen fairly frequently, that's one reason you have redundant power systems.

A while later, they made a public announcement about the outage at the Planet's Houston data center:

Today at approximately 5:45 p.m. [central time], a transformer in our H1 data center in Houston caught fire, thus requiring us to take down all generators as instructed by the fire department. All servers are down until power can be restored.
According to our monitoring logs, it was 5:07 PM central time, not 5:45 PM.

We received more information dated May 31 – 10:46pm (8:46 pm Pacific):

On Saturday, May 31st at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room. Thankfully, no one was injured. In addition, no customer servers were damaged or lost.

We have just been allowed into the building to physically inspect the damage. Early indications are that the short was in a high-volume wire conduit. We were not allowed to activate our backup generator plan based on instructions from the fire department.
This time makes more sense. Seems like the UPSes did indeed work, but they weren't able to switch over to generator power. So about 10 minutes after they lost power, the UPS batteries were expended, and the facility lost power.

This is also the first time they mention "the short". At first it was just a transformer fire. But now it sounds like it was a transformer explosion caused by an electrical short, which implies that some wires were so overloaded that the insulation melted and caused them to short out.

There have been lots of discussions about the blame for the problems at The Planet. I'm not going to go into that now. However I am less than satisfied at the quality of the communications from them, and not happy with at all how they've handled the situation.

The SomaFM.com web server eventually came back while we were just finishing up restoring our backups to a new web server. (So at least we now have a tested plan and sequence from restoring from backups!)

However, as of 10:30am on June 3rd, our mail server is still not running, nor did it come back up when The Planet said that they had powered back on the part of the datacenter where it is located. After sitting on hold (with very bad music) for 35 minutes, a tech told me that our mail server machine was one of the older ones that would have to be powered on by hand... and that there were over 1000 of these machines that they would be going around and turning on one at a time. But that never happened.

The last update on The Planet's web site was kind of ominous:

This morning at approximately 2:45 a.m. CST, the temporary generator supplying power to the servers and environmental control systems located in Phase 1 of our H1 facility shut down. This was caused by some faulty current sensors in the output breaker. The sensors detected an out of balance current condition that did not exist.

At this point, I don't know when the mail servers will be working again. I guess we have to deploy a new mail server (which is also the secondary DNS server).

Wait! Another update:

Fixing the faulty breaker on the generator powering H1 Phase 1 was not successful. we have located a second generator that is currently being delivered to the facility. It is expected to arrive this afternoon and we will provide additional information regarding the new generator at that time.
That doesn't sound promising. And for all I know, our server has been blown up by a power glitch or something. Time to get working on that new mailserver, I guess!

Unfortunately, I screwed up and didn't properly backup the mail server configs and will have to recreate all that by hand, so it's not a real simple process.

But I guess it won't take too long as I won't have any interruptions from email today!

But now we do have a full backup of the SomaFM web server up and running at our rack in 365 Main's San Francisco data center. And I'm working on getting further redundancy in place so this won't impact our listeners much if it happens again.

You can follow the drama of The Planet on their Service Update web page.

And thanks for your patience with us.

Labels: ,


Anonymous Anonymous said...

A horror weekend indeed! Now we know how much we depend on computers... and how easily it's all gone..

Keep up Rusty !

Cor Knops.

June 3, 2008 1:58 PM  
Blogger Kenji Rikitake, JJ1BDX said...

Many so-called data centers dive into chaos when their so-called trusted power supply systems unexpectedly fail. And those supplies are very fragile.

I hope Soma FM will recover soon.

Kenji Rikitake

June 4, 2008 6:26 AM  
Anonymous Lou Lesko said...

Not sure if you're in the market for a new provider, but we've been extremely happy with One World Hosting. They were really forthright when they had an outage.


June 4, 2008 2:07 PM  
Anonymous Anonymous said...

About the email-serverproblem: today my provider sent me a message that the mail I sent you on June 1 (sic!) could not be delivered.... Mail me for full headers.

Cor Knops.

June 4, 2008 2:58 PM  
OpenID liz said...

re: the email blast from today requesting someone to drop by the data center

since the planet is a dedicated server provider and not a colocation facility it will be impossible to talk to anyone there. even if you were able to talk to anyone there, they will be backed up with hundreds of support requests and other angry customers, not to mention that there is little remedy that they could offer since it is a power issue.

good luck and hope your email is back online soon!

June 4, 2008 6:50 PM  
Anonymous Anonymous said...

The planet told everyone to go to the forums, but when you ask questions you get no answers and if you ask what some feel are the wrong questions one of the planets managers calling customers trolls. When I posted telling this manager it was sad to see a company manager calling customers trolls as anyone that has spent any time on the Internet knows calling someone a troll is getting close to using the "N" word with someone.

After I posted that it was sad, he banned me from posting anymore comments. I have yet to receive any contact from his bosses (I sent copies of the comment to all of them) so I can only assume that they are aware or feel the same way about the customer.

Folks systems fail, it is just how it is. But to then have the company call it's customers trolls for asking about info related to the service that company is being paid to provide is just too much for me. I have moved my server and I know many who once they get the rest of their data from their servers will be gone as well. I feel the planet should know that for some of these folks the last straw was your manager calling the customers trolls.

In years to come when things are calmed down what will be remembered about this whole incident is how at a time of crisis the customer where called TROLLS by management

June 7, 2008 8:07 AM  

Post a Comment

Links to this post:

Create a Link

<< Home