System Availability and Redundancy

If you really want to be able to guarantee system availability you need some form of redundancy, and testing that redundancy before a failure  is probably a pretty good idea, which is why this article on recent problems with the Hubble Space Telescope while impressive, makes me wonder why they never tested.

The two-day repair began early Wednesday as NASA engineers began commanding Hubble from the ground to switch to a backup system after its main data relay channel failed last month. But remote control fix is tricky, requiring systems to power up after nearly two decades of hibernation.

Hubble’s main science operations were silenced on Sept. 27, when the Side A channel of its Science Instrument Control and Data Handling system failed after 18 years of continuous service since the space telescope launched in April 1990.

Hubble has a backup data relay channel, Side B, but the unit and five related systems had not been powered up since the space telescope reached orbit.

OK 18 years of up time is impressive!, I just can’t believe that in 18 years no one thought it would be a good idea to test the “Side B” backup unit, which regardless does seem to be booting up just fine.  The redundancy won’t be restored until a spare part is sent up in February of 2009.

Hubble’s recent glitch prompted NASA to delay the planned Oct. 14 launch of the shuttle Atlantis and seven astronauts to the orbital observatory for a final service call. That mission has been pushed to February 2009, with engineers hoping to send a spare data formatter to restore redundancy after the Side A failure.

Sorry, comments are closed for this post.