Dropbox: Virtualizing Mac Infrastructure at Scale
Dropbox is on a mission to design a more enlightened way of working with the world’s first smart workspace, designed to help teams be organized, stay focused, and get in sync.
Dropbox runs natively on Mac products like iPhones and MacBooks. This means reliable, scalable infrastructure for Mac and iOS app development is critical to the company’s success.
Although Dropbox’s Mac continuous integration (CI) operations ran well on an in-house VMware build cluster, the team eventually ran into problems as the size and scale of their development needs grew. A few factors contributed to Dropbox’s scalability issues, said Paul Ruan, a software engineer at Dropbox: “One is the number of engineers we have. The more engineers you have, the more builds you run. And then there’s the number of features and tests. The more features you have, the more tests you have. There’s also the fact that we support multiple OS versions, so we have to run end-to-end tests on multiple OS X versions.”
Dropbox was running a single VMware vCenter with hundreds of Mac hosts, organized into multiple clusters. The Dropbox team used the clusters to isolate work and control how many resources each workload could use. For example, if they wanted to limit iOS workloads to a specific set of Mac hosts. They used shared data stores to share VM templates across the hosts. “We run VMs using the hosts’ SSDs, and we use linked clones to get the templates off of the data stores,” said Ruan. VMware’s Linked Clones feature allows VMs to act more like containers, which was important to Dropbox as their VMs only live for minutes or hours. “These are very short-lived VMs,” said Ruan. “If you have VMs that live for minutes only, having fast clone time is pretty important.”
While their CI pipeline worked very well, Dropbox started running into problems as they increasingly added more hosts to their cluster. “One of the issues we saw was a lot of ESXi crashes on the Macs. We didn’t know why, but it was causing us a lot of headaches. That made coordination between the teams a challenge,” said Ruan. The ESXi service crashing for an unknown reason didn’t inspire a lot of confidence in their staff developers. When a test failed, there was some lingering doubt as to whether it was the infrastructure or the code that was causing the failure. “It’s confusing when you can’t trust the underlying infrastructure,” said Ruan. “Because then you’re questioning, ‘is my test actually flaky or is it the infrastructure’? So it’s really important to have reliable infrastructure that you can trust.”
Dropbox didn’t have a tremendous amount of expertise in-house for running VMware on Mac computers, and there were multiple teams involved in the process which added even more complexity. “If you think about the entire system, it’s not actually one team that’s managing everything,” said Ruan. “We have a team working on CI, there’s a team managing vCenter, and then there’s a team working on Mac hardware in the data center. As we scaled up, we found that we were stepping on each other’s toes.”
The struggles associated with growing an in-house build farm, and the expertise that it requires, led Dropbox to MacStadium. “It was a mix of having to deal with all these issues while constantly hiring engineers because we needed to add capacity quickly,” said Ruan. Experimenting with different storage, networking, and configuration options, Dropbox aligned on an improved version of their original in-house Mac build cluster based on VMware, but in a more scalable cloud environment at MacStadium.
With MacStadium, we have more expertise available to us for Macs.
Impact on Dropbox
With a MacStadium cloud based on VMware, the Dropbox team is able to add Macs to their CI clusters with a simple request to MacStadium’s engineers. “If we were to try to add racks in our old data center, we would have needed to plan for the space much earlier and we would have had to figure out how to buy more Macs from Apple. Whereas with MacStadium we talk to our finance department, we get it approved, we come to MacStadium, and we get Macs in a few weeks,” said Ruan. “We have so much more flexible scaling now.”
That translates to increased flexibility elsewhere in the pipeline as well, and the ESXi crash headaches are gone. They effectively cut the workload in half for the team’s in-house CI pipeline, with half of the workload being funneled into MacStadium. Dropbox engineers built a small API wrapper around VMware that could load balance between the two environments. “We’re reducing load on our internal vCenter by spreading the load across two vCenters now,” said Ruan. “We already had this extra layer on top of vCenter, so we just had to change the code a little so that we can allocate VMs to either our internal vCenter or our MacStadium one.”
By integrating a MacStadium cloud with their existing vCenter environment, Dropbox was able to better distribute their development workload, eliminate system crashes, and scale their infrastructure more quickly and easily. And as an added benefit, the Dropbox team has gained expertise in VMware and Mac infrastructure by partnering with MacStadium. Mac infrastructure does not have to be a core competency for their data center team. “With MacStadium, we have more expertise available to us for Macs and vSpheres in general. Internally, we don’t have a lot of vCenter experts, nor experts at managing Macs in a data center,” said Ruan. MacStadium manages the hardware so the Dropbox team can focus on building great products.
“It’s really important to have reliable infrastructure that you can trust.”Paul RuanSoftware Engineer
As Dropbox continued to grow and launch new tools, they encountered challenges with their Mac build infrastructure; frequent Mac host failures led to scalability issues and increased demand on their data center team
Supplement in-house Mac build cluster with MacStadium’s cloud environment; new injection of expertise and technology to help support growth
Faster and more flexible scaling, no more host failures, and a significantly reduced load on the internal cluster by spreading workloads across two environments