Mysterious power failure takes down Wikipedia, Wikinews
|This article mentions the Wikimedia Foundation, one of its projects, or people related to it. Wikinews is a project of the Wikimedia Foundation.|
Wednesday, February 23, 2005
On Monday, at 10.14 pm UTC, the Wikimedia server cluster experienced a total power failure, taking down Wikipedia, Wikinews, and all other Wikimedia websites.
The servers are housed in a colocation facility (colo) in Tampa, Florida, USA. They occupy two racks, with each rack receiving electricity from two independent supplies. However, both supplies have circuit-breakers in them, and both opened at the same time, leading to a total power failure. All computers immediately went down. It’s normal for fire safety regulations to prohibit uninterruptible power supplies in colos, with the colo providing its own UPS and generator instead. The circuit breakers were on the computer side of this emergency power system, so none of the computers continued to receive power to survive the breaker trip or shut down safely.
The actual reason why the circuit breakers tripped is currently unknown.
When power was restored, it was discovered that most of the MySQL databases that store the data which makes up Wikipedia et al had been corrupted. The main database and the four slaves had all damaged the data on their hard disk drives beyond the ability of the auto-correction to repair. Only one copy survived safely, on a machine that is used for report generation and maintenance tasks, which remained 31 hours backlogged while catching up after an unusually heavy update load during the previous week.
Volunteer Wikimedia engineers worked through the night rebuilding the databases from the sole good copy onto the other servers. The Wikipedia database is over 180Gb in size, making the copying process last 1.5 – 2 hours for every server it was performed upon.
Regular back-ups of the database of Wikipedia projects are maintained – the encyclopedia in its entirety was not at risk. The last database download was made on February 9; all edits since then could only have been laboriously rebuilt from logs and recovered from the damaged database requiring much more time and effort.
Limited read-only service was established late Tuesday afternoon, with editing becoming possible 24 hours after the power failure. Final repairs continue now, as well as upgrades to prevent similar issues in the future. Server-intensive features, such as categories and ‘watchlists’ that display recent changes to selected articles to registered users, remain disabled to ease the load on the recovering systems.
The process which led to the damage originated with the operating system, disk controllers, or hard drives failing to flush the data correctly.
If the power to a database server is cut mid-write, the database may be corrupted and unreadable, however the operating system, hardware, and software are designed to make this very unlikely. In a previous incident in 2004 power was also lost to a server but the database was undamaged.
To avoid such damage, each database server saves a copy of an edit to be applied to the database on a separate storage system before making the actual update to the database itself. This so-called ‘write-ahead logging’ should ensure that in the event of a system crash, the database can be rebuilt from a ‘last-good’ state by replaying the edits saved in the log.
Earlier this year popular blogging site LiveJournal suffered a similar power failure when another customer at their colocation facility pressed an Emergency Power Off button, intended for use only by firefighters. The company suffered database corruption similar to that seen at Wikimedia.
LiveJournal are now fitting UPS to their servers to ensure that they have time to shut down safely in the event of a power failure. Wikimedia was said to be investigating the possibility of fitting similar equipment at the time of this failure.
Several pundits have suggested that the use of another database, such as the proprietary database Oracle or the free PostgreSQL, would have avoided the database corruption seen at the server cluster. A post-mortem of the incident show the failure was in the operating system, or the hardware, or some combination of the two. LiveJournal, which also uses MySQL, reported similar database corruption after their power cut.
Users are reminded that during times of system failure or excessive demand, they can still search Wikipedia using Google. The articles may be viewed using Google’s cache.