Taking Flight: When to Move to the Cloud

Woman on computer

How MAIDAP draws the line between local and cloud development

Almost all ML projects will start on a local machine, using simple tools like Jupyter notebooks to get started quickly, and almost all ML projects will end in the cloud, using powerful infrastructure to interpret data at scale.

While the starting and ending points are clear, knowing when to switch from local to cloud development can be complicated.

At the Microsoft AI Development Acceleration Program (MAIDAP), we work with teams around the company to incorporate ML into products and services. We often create the initial versions of a pipeline on our laptops and then migrate our codebase to the cloud.

After 50+ projects, we’ve grown accustomed to making the jump from local to cloud, and we’ve found there are a few key points where a project needs to consider this migration. If your team is experimenting with ML and unsure when to move to the cloud, this guide is for you.

At the Start: Compliance and Privacy

At a large company like Microsoft, we often manage critical data for governments and large enterprises, and these partners will often require that their data remain in an isolated environment.

If you’re working with highly sensitive data, you might never develop a pipeline locally. When we find ourselves in this situation, we’ll often utilize cloud-based resources that closely resemble a local development.

For example, Azure Machine Learning (AML) has a helpful Data Science Virtual Machine service that makes it relatively simple to run a Jupyter Notebook in the cloud. Setup will take a bit longer than running locally, and you’ll have to pay a subscription to provision VMs, but if you need to work 100% in the cloud, AML is a great place to start.

Midway: The Training Crunch

If you can start development on a local machine, you’ll usually begin with exploratory data analysis of a sampled subset and some basic functions to manage the pipeline. This step is possible for datasets of almost all sizes, but teams will often hit a crunch when they begin training.

One constraint comes from storage capacity. We want to train our models on as much data as possible, and sometimes our local machines simply lack the space on disk. When this challenge arises, we’ll switch to cloud-based tools like Azure Data Explorer and Azure Data Factory that are designed to store and process data at Microsoft scale.

Another constraint comes from compute capacity. At MAIDAP, we use GPU-enabled laptops that allow faster training (thanks for the fancy hardware, Satya!), but even these machines can become overwhelmed when processing a large dataset or running multiple models in parallel.

Training on these datasets may still be possible, but pipelines can run for days, and we’d rather move faster. While the threshold varies between data scientists, we usually move to the cloud once a training run starts taking more than 1-2 hours. Our go-to tool for high-performance training is Azure Machine Learning, which scales to large workloads and offers helpful features for MLOps.

Late Game: Stability Before Production

While it’s less common, some projects will be able to run training quickly on a local machine. When this happens, the movement to the cloud can usually wait until the later stages of development.

Still, teams shouldn’t serve their models from localhost, so eventual migration is necessary. In this scenario, MAIDAP teams will usually conduct a few training and tuning runs on their local machines before moving to the cloud. The late-game decision to move to the cloud comes down to stability – essentially, a point where the model performance is consistent and the data pipeline is functioning smoothly. Once our pipeline and model are stable, we’ll move our project to the cloud, where we can focus on serving our model to users in production.

Conclusion

We hope these checkpoints will serve as a helpful guide to teams beginning their work in machine learning, and while every project is different, we think these decision points cover the three most basic cases teams will encounter.

To learn more about MAIDAP, please check out our homepage. Every fall we hire a new cohort of university graduates studying AI/ML, so if you’re interested in building with us, please check the Microsoft University Hiring website.

Tags: , , ,