What do you do when your first pipeline ingests terabytes of data?
It’s hard to find best practices when you’re building the first version of a global-scale ML model.
Most ‘Intro to ML’ guides focus on simple pipelines and small datasets, and most ‘Intro to Big Data’ guides assume an existing ML infrastructure. So, what do you do when your first pipeline ingests terabytes of data?
On our team, we see this problem all the time. The Microsoft AI Development Acceleration Program (MAIDAP) is a rotational program for recent university graduates in ML and related fields.
We collaborate with teams across Microsoft, and many of our projects involve building the first ML features for products with hundreds of millions of users. We are constantly starting at scale.
After 40+ rotations, we’ve developed our own set of best practices for building the first versions of global-scale ML models. These standards help us answer questions like:
- What are the tradeoffs between prototyping pipelines on a laptop vs. a cloud-based GPU?
- When is a dataset too large for popular libraries like Pandas, Scikit-Learn, and Pytorch?
- When should you switch from off-the-shelf models to custom architectures?
- How do you decide if an ML project is feasible for your team?
We’ll use this blog to share our best practices, along with the stories that inspired them. To follow along, you can sign up below for our newsletter, or check back at this page.
We’re excited to get started!