Designing Pipelines for Scale

| Siddharth Agrawal, Software Engineer, MAIDAP

Pipelines

Simplicity is the ultimate sophistication

Small adjustments accumulate over time, and with a configurable pipeline, any evolution would be simply extending the interface and providing the differentiation through configuration. Configuration enables you to scale your pipeline and experiments at will.

Every massive data pipeline starts with a series of small commits. When starting a project that intends to scale, it can be challenging to decide which qualities to prioritize in the early stages of development.

At the Microsoft AI Development Acceleration Program (MAIDAP), we partner with teams across the company to build high-impact AI projects. We often start projects from scratch while simultaneously planning to grow them to a Microsoft scale. 

Across 50+ projects, we have developed techniques to ensure our pipelines can scale quickly, and we’re excited to share these techniques with the rest of the AI community.

Locate, Move, and Store Your Data

Data flow design is the foundation of pipeline design. The dataset you create will be the key to all modeling in your pipeline, so when designing for scale, it’s best to start with your data flow.

At a high level, any ML project can be described as below.

shapes

However, such a simple design is rarely feasible. Your data do not necessarily lie in one source, and you may have different strategies to store your results. So how do you compress all of it down to one general scenario? To work around these challenges, we propose asking the following questions.

Are the data all in one location? Having a single store for training data makes modeling much simpler. When we have a project that requires data from multiple locations, we often build an aggregator service to pull all the data into a single location we can reference during training.

Do you need to store your intermittent results? Storing these metrics can help diagnose the performance of a model during a training run, so we often create small databases to store results and then retrieve them during model analysis. Azure Machine Learning is a great resource for this task.

Can you abstract data transfer? Moving data across many sources can clutter a pipeline’s design, so when we need to move our data, we abstract our implementations within separate modules. 

Defining data flows is often one of the hardest parts of a project, so don’t be afraid to spend extra time on this step.

shapes

Explore Your Data

Hopefully, you’ve already explored your data when evaluating whether your project was feasible. Pipeline design is highly dependent upon the shape of the underlying data, and it’s very common to need transformations within your pipeline. 

If the data need to be transformed, you might need a preprocessor stage before even starting to clean the data. In our experience, it’s best to keep preprocessing separate from data cleaning, as these requirements can evolve separately over time. Moreover, having a separate preprocessing step keeps your model from being tied down to a specific implementation of a data source.

Start with a Specific, Constrained Use Case

Even if you’re building a pipeline that will grow to a Microsoft scale, it’s important to start as small as possible. To know how small you can start, it helps to define your minimal constraints – the core dataset, model, and performance metrics you’ll need in an end-to-end implementation.

These constraints will be your guiding lines as you build an initial version of your pipeline, helping you find the balance between an unscalable, single-purpose pipeline and a highly scalable pipeline that remains too general to answer specific questions. 

As you scale your pipeline to new data sources and modeling efforts, you may begin to remove these constraints, but it helps to stay as small as possible at the start of the project.

Modularize Your Code

Once you have a working end-to-end pipeline for your specific, constrained use case, it’s time to modularize your code. We approach this step by creating a list of the “moving parts” in our pipeline – common examples include I/O, preprocessing, cleaning, training, and evaluation. We then look for common actions across these steps, moving shared logic into their own modules.

In the short term, modularizing code won’t have much effect and can even seem cumbersome, but in the long term, modularized pipelines are a lifesaver. For example, we often add new data sources to a pipeline as a project develops, and having our I/O module separate means we don’t have to constantly update our logic across the pipeline.

Modularization also extends to external libraries. We often use external libraries at MAIDAP, but it’s crucial that our pipeline minimize dependence upon a specific library or version of that library. Often, we will create wrappers around external library functions so that our core pipeline is not dependent upon a specific external implementation.

Generalize Your Code

Once you have a focused, modularized pipeline that is specialized to your data, most of your short-term work is complete! The next steps focus on preparing to scale.

Use cases for a pipeline can evolve with customer needs, and you wouldn’t want to repeat work for a small evolution in model or requirement. The solution is to generalize your pipeline. Some questions we ask at MAIDAP are:

  • Which details are too specific to the use case? Can you make them generalizable?
  • Are your classes too specific? Can you make them extensible using interfaces? 
  • Are your functions too tied down with specific work? Can you split them to a general function and a specific call of that general function?

Abstracting your code separates concerns, making it easier to update, evolve, or fix one part of your pipeline without impacting the other. This quality is crucial for scaling a pipeline.

Configure Your Pipeline

As a byproduct of generalization, you’ll need to build configuration into your pipeline. Often, your first configuration will take the generalized pipeline and refocus it back to your initial use case. This process may seem unnecessary – why not skip generalization and configuration if only for the same result?

The advantage will be clear when your pipeline starts to scale. Small adjustments accumulate over time, and with a configurable pipeline, any evolution would be simply extending the interface and providing the differentiation through configuration. Configuration enables you to scale your pipeline and experiments at will.

Optimize Your Pipeline

This is our last step, which may seem unintuitive. 

We recommend this step last because it is very easy to fall in the trap of optimizing from the very beginning. Early optimization might end up hurting your development process and bring in constraints that might not be required, hurting your ability to scale your pipeline. 

Once you have addressed your specific use case and then gone through the generalization process, you will know your pipeline’s core use cases, and you will know your pipeline’s major functionality. These core use cases and functionalities should be your targets for optimization.

Conclusion

We understand one size does not fit all. These questions and approaches have helped us design, develop, and deliver on machine learning projects that work with our scale. As with everything else, these guidelines are what we have learned over our experience and can evolve as we learn more. You may have requirements for which this may be an overkill or may be insufficient.

To learn more about MAIDAP, please check out our homepage. Each year we hire a new cohort of recent university graduates studying AI/ML, and if you’re interested in joining our program, please check the Microsoft University Hiring website this fall.

Tags: , , , , ,