Tuesday, July 24, 2007

San Francisco power outage, SomaFM outage

There was a power outage affecting downtown San Francisco today, which also caused an outage at SomaFM's primary datacenter, 365 Main. Note that we've been there for about 2 years now, and this is the first power outage that's affected us. They had another outage right before we moved in, due to a faulty fire alarm which cut power to most of the building.

Now, a "world class datacenter" is supposed to have all sorts of redundant systems in place. And they did. But a slightly unusual series of events proved that even with all that redundancy, things can go very wrong. Here's what really went down at 365main as far as I can tell:

365 Main, like most facilities built by Above.net back in the day, doesn't have a battery backup UPS. Instead, they have a "CPS", or continuious power system. What they are is very very large flywheels that sit between electric motors and generators. So the power from PG&E never directly touches 365main. PGE power drives the motors which turn the flywheels which then turn the generators (or alternators, I don't remember the exact details) which in turn power the facility. There are 10 of these on their roof (or as they call it, the mezzanine; it's basically a covered roof). These CPS units isolate the facility from power surges, brownouts and blackouts.

The flywheels (the CPS system) can run the generator at full load for up to 60 seconds according to the specs.

There are also 10 large diesel engines up on the roof as well, connected to these CPS units. If the power is out for more than 15 seconds (as I recall, I could be wrong on the exact time), the generators start up, and clutch in and drive the flywheels.

There is a large fuel storage tank in the basement, and the fuel is pumped up to the roof. There are smaller fuel tanks on the roof as well, with enough capacity to run all the generators until the fuel starts getting pumped up to the roof.

Here's what I suspect happened:

It was reported there were several brief outages in a row before the power went out for good, so I bet the CPS (flywheel) systems weren't fully back up to speed when the next sequential outage occurred. Since several of these grid power interruption happened in a row, and were shorter than the time required to trigger generator startup, the generators were not automatically started, BUT the CPS didn't have time to get back up to full capacity. By the 6th power glitch, there wasn't enough energy stored in the flywheels to keep the system going long enough for the diesel generators to start up and come to speed before switching over.

Why they just didn't manually switch on the generators at that point is beyond me. (I bet they will next time!)

So they had a brief power outage. By our logs, it looks like it was at the most 2 minutes, but probably closer to 20 seconds or so.

So it looks like the diesels did cut over, but not before the CPS was exhausted in some cases. The whole facility did not lose power I'm told, just most of it.

Here's the letter their noc sent to customers about this:

This afternoon a power outage in San Francisco affected the 365 Main St. data center. In the process of 6 cascading outages, one of the outages was not protected and reset systems in many of the colo facilities of that building.

This resulted in the following:

- Some of our routers were momentarily down, causing network issues. These were resolved within minutes. Network issues would have been noticed in our San Francisco, San Jose, and Oakland facilities.

- DNS servers lost power and did not properly come back up. This has been resolved after about an hour of downtime and may have caused issues for many GNi customers that would appear as network issues

- Blades in the BC environment were reset as a result of the power loss. While all boxes seem to be back up we are investigating issues as they come in

- One of our SAN systems may have been affected. This is being checked on right now

If you have been experiencing network or DNS issues, please test your connections again. Note that blades in the DVB environment were not affected.

We apologize for this inconvenience. Once the current issues at hand are resolved, we will be investigating why the redundancy in our colocation power did not work as it should have, and we will be producing a postmortem report.

Lots of companies were affected. There was a huge line to get into the data center. It was definitely the most people I've ever seen there!

Labels: , ,

6 Comments:

Blogger David said...

Hey Rusty,

I don't know whether you've seen Vallywag's coverage of the event, but they're claiming that a drunken employee hit the EPO button. Given the scale of power outages in the surrounding area this seems unlikely, but it still gave me a little giggle. :)

http://valleywag.com/tech/breakdowns/a-drunk-employee-kills-all-of-the-websites-you-care-about-282021.php

There's quite a few more articles about it on the site.

July 24, 2007 8:07 PM  
Blogger Rusty Hodge said...

It might be believable except that 1) it happened at the same time the other power outages in the area happened and 2) there are 8 separate Colo rooms in the building, and they each have their own power system (plus 2 spare power systems on the roof).

Valleywag also reported that the outage only affected Colo 4, yet we're in Colo 7 and it was out as well.

And the generators were indeed running when I got there, and still when I left. I think they were playing "safe not sorry" mode now, but PG&E could have also asked them to do that to lessen the load while they repair the busted transformer vault.

Valleywag's version is far more entertaining, of course!

July 24, 2007 8:15 PM  
Blogger sloux said...

The transfer switches constantly monitor the utility power quality to determine whether to transfer the critical load to the generator. Typically they have a user-defined wait-state programmed in to avoid starting the generators for an event of a short duration (say 1-5 seconds inhibited)

It's hard to imagine the data center did not pay careful attention to the programming of the transfer switches (and associated paralleling gear) in order to address repeated transient faults in the utility.

As an aside though, the flywheel system is not nearly as dependable or robust as a UPS system backed by flooded cell battery strings. The notion of isolation is nice, but the vast majority of engineers responsible for mission critical sites believes the UPS is far more dependable. If the generators fail to start (as apparently was the case here) a minute or two is not enough time to troubleshoot the problem(s) with the generators before the site goes down hard...very hard.

Once the generators do start, stabilize and close to the load, they do not retransfer to utility until the normal utility source has stabilized for a user-defined interval, often around 30 minutes. Once the load retransfers to utility, the generators will continue to run for a period of time to cool down. This may explain why you observe the generators continuing to run well after the event. If the data center participates in PG&E's curtailment program, it is possible they were asked to stay off the grid until the utility problems had been corrected.

July 24, 2007 10:02 PM  
Blogger Rusty Hodge said...

There are no transfer switches with the CPS systems used by 365 Main. You can read about them at http://hitec.pageprocessor.nl/?RubriekID=1991

July 25, 2007 2:06 AM  
Blogger dannyman said...

Hello,

I toured a colocation facility and was shown the generator. I asked why it was making noise, and the guide explained that the generator wasn't running-that would be much louder. What WAS running was the starter motor to crank the generator up to speed when needed. They kept that little motor running at all times so that the generator could get up-to-speed in time.

I think your hypothesis works very well, at which point I would wonder if the primary rationale for the flywheels isn't to replace UPS technology, but to act as a starter motor for the generators. This would certainly be an eco-friendly approach, and isn't necessarily flawed, as long as they are set to start the generators while they still have the torque.

Not having an ability to start the generators once the flywheels spun down would explain why it took 40n minutes to start them manually.

July 27, 2007 1:33 PM  
Blogger Felix said...

Good obervations from Sioux and Rusty. I worked with and in mission critical NOCs, including FAA. The real mission critical NOC data centers NEVER get power directly from their utility; typicall a UPS system backed by flooded cell battery strings; including power conditioning on the sine wave.

The other issue is sometimes non-tech executives don't understand the budget needed to truly protect a NOC and money is not spent on needed protection.

IEEE devotes a whole series of specifictions around this subject; including IEEE1159.

The series of "power outage" was most likely the utility distribution protection equipment giong through its normal operation automatic relay coordination testing. Its important to note the electric distribution is somewhat like a living biological organism and needs to protect and isolate it's body parts having trouble to the whole system won't go down.

Felix

July 28, 2007 7:15 AM  

Post a Comment

Links to this post:

Create a Link

<< Home