GitLab, a startup with $25 million in funding, is having a “very bad day,” as the company’s interim vice president of marketing, Tim Anglade, put it to Business Insider on Wednesday after a series of human errors caused the service to go down overnight.
GitLab provides a virtual workspace for programmers to work on their code together, merging individual projects into a cohesive whole. It’s a fast-growing alternative to GitHub, the high-profile Silicon Valley startup valued at $2 billion.
GitLab was only just starting to come back online as of Wednesday morning. But even worse than the embarrassment of such major downtime, the company now has to warn a handful of its users that some of their data may be gone forever.
A bad day
The bad day started on Tuesday evening, when a GitLab system administrator tried to fix a slowdown on the site by clearing out the backup database and restarting the copying process. But the admin accidentally typed the command to delete the primary database instead, according to a blog entry.
And by the time he noticed and scrambled to stop the deletion “of around 300 GB only about 4.5 GB is left,” the blog explained. Oops. The site had to be taken down for emergency maintenance while the company figured out what to do, keeping users apprised via its blog, its Twitter account, and a Google Doc that the GitLab team kept updated as new developments arose.
We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8
— GitLab.com Status (@gitlabstatus) February 1, 2017
Making matters worse, the team couldn’t just restore: “Out of 5 backup/replication techniques deployed none are working reliably or set up in the first place” the blog said. “We ended up restoring a 6 hours old backup.” That means any data created in that six-hour window may be lost forever, Anglade said.
Bad news, good news
While restoring that older version of the database, the site went down for at least six hours, Anglade said. Intermittent failures sprung up for another several hours while the team got the service back online, with everything starting to get back to normal only on Wednesday morning.
The good news for users, Anglade said, is that the database that was affected didn’t actually contain anyone’s code, just stuff like comments and bug reports. Furthermore, Anglade said the many customers who installed GitLab’s software on their own servers weren’t affected, since that doesn’t connect to GitLab.com. And paying customers weren’t affected at all, the company said, minimizing the financial impact.
Anglade acknowledged the outage was bad, as is the looming possibility that some of that data may be gone, but nobody is going to have to start rewriting his or her software from scratch, and only about 1% of GitLab’s users are expected to see any lasting effects from the incident.
As for the systems administrator who made the mistake, Anglade is hesitant to place blame, saying it was the whole team’s fault that none of its other backup systems were working. “It’s fair to say it’s more than one employee making a mistake,” he said.