Back in July, when Pokémon Go mania was in full swing, Googler Dave Rensin’s wife and kids wanted to go out on a Pokémon hunting expedition.
But Rensin couldn’t go: He had to stay home, on the couch, with his laptop, working to keep the game’s servers running as the unforeseen crush of users rendered the whole thing unplayable for millions.
Normally, this wouldn’t be Google’s problem, let alone Rensin’s specifically. While Pokémon Go’s servers are indeed hosted with Google Cloud, it’s not really common for any of the major cloud providers to take any kind of hands-on approach with their customers.
This time, though, Google was trying something new.
Since 2003, Google has maintained a global Site Reliability Engineering (SRE) team, a network of engineers that practice a style of intense discipline and increasingly efficient automation to keep its own massive server infrastructure online and reliable for users across the world. The SRE’s goal is keeping Google online without having to “feed the machine with blood” – Google-ese for throwing valuable manpower at problems that automation can solve.
“Who cares about the rest of your [system] if you can’t rely on it?” Rensin asks.
Rensin’s big idea was simple. Take those SRE engineers – all experts in site reliability – and embed them with customers, for free. The whole sales pitch behind Google Cloud is to give people access to Google’s infrastructure; this so-called Customer Reliability Engineering (CRE) program would help customers build systems the way Google does, too.
“When you join our cloud, we get married. And we have a child: It’s called your system,” Rensin tells Business Insider. And Pokémon Go became the first time this CRE team would get in the saddle.
‘We are all Pokémon SRE”
The CRE program was supposed to start at the end of 2016. But after Pokémon Go developer Niantic appealed directly to Google CEO Sundar Pichai for “reinforcements” as players overloaded the system, it was decided that the game would be the perfect time for Google to put the CRE program to the real-world test.
Rensin’s team put up posters in the office: “We are all Pokémon SRE,” a reminder that Google CRE was now on the hook to do everything in their power to help Niantic cope with the surging demand. It took some doing, but eventually everything went “smooth as butter,” and Niantic was able to resume its international expansion of the game.
Following the success with Pokémon Go, the CRE team was called in for its second big engagement. Home Depot, the national home improvement retail chain, had about a 90-day window to make sure that its website and apps were resilient enough to withstand the rush of Thanksgiving weekend, the busiest shopping time of the year. And with Home Depot being a customer of Microsoft and Amazon Web Services, too, the job had extra complexity.
Rensin and members of the CRE team flew to Home Depot HQ in Atlanta and worked around the clock to meet that tight deadline. On Thanksgiving night, Rensin got a text from his Home Depot contact: “You know what I’m doing right now?”
Rensin braced for the worst, mentally preparing for a trip back to Atlanta to help triage a disaster scenario. But no: He just wanted Rensin to know that, for the first time in years, he was able to enjoy a quiet Thanksgiving dinner with his family.
From there, the CRE team was firmly established, and now boasts a “very large” backlog of people waiting to take advantage – though Google has partners like Pivotal and Rackspace, similarly trained in the ways of the CRE, that you can pay to “skip the line” and get similar expertise, as Rensin puts it.
The actual practice of working with the CRE team is like going on a diet, also as Rensin puts it: Customers can commit to it at various levels, from having Google consult on your infrastructure, all the way to “running joint operations” that involve co-building the tools for monitoring and maintenance.
Even if you’re hesitant to commit all the way, Rensin says, “you’re better off than you were” – hopefully, he says, you’ve picked up the skills you need to build a system that’s closer to Google’s standard of reliability.
The real takeaway, Rensin says, is that Google has more experience at this, sure, but the skills refined by the SRE team can be learned by anybody, if you have a willingness to learn by doing. Even Google’s most elite engineers picked it up from somewhere.
“We don’t genetically engineer our SREs,” jokes Rensin.