MAIDAP Blog — Beyond the metrics: modeling with UX in mind

Developing AI features that delight users goes beyond just performant models.

Would you trust a SQL query if it only gave you the correct result 85% of the time? For a typical software product, this would never ship; however, this can be a familiar experience when developing machine learning (ML) models.

Incorporating ML into features adds additional dimensions of consideration that are not present in typical rules-based feature development. ML models have their own unique idiosyncrasies: They’re not always right; they can have biases; and they evolve over time. These unique attributes require additional attention to create the desired outcomes for the user.

The solution is not to just build better models. Perhaps you have increased the Area Under the Curve (AUC) or the inference is blazing fast. These evaluation metrics can excite a team when it builds a feature – but users don’t care about these metrics.

Users care about the product’s value proposition and experience. ML is the powerful tool to get us there, but it takes a different strategy to overcome ML’s unique traits.

Ultimately, ML product success relies on an additional feedback loop – understanding the relationship between data science and users. Building ML features with a distinctly user-centric approach will help focus the data science on what matter most: creating intelligent experiences that bring value and delight to your customers.

At MAIDAP, we’ve developed over 50+ unique ML projects for a wide range of users. One thing has remained consistent and certain – ML features succeed when the user is at the focus of the data science. Here are a few unique things we consider when adding ML to a feature.

The threshold of usability

Great mathematical model performance doesn’t always translate to great user value. The inverse is also true – average model performance doesn’t always equate to failure. What’s important is understanding – and eventually overcoming – what model performance threshold divides a good and bad user experience.

The first iterations of your ML feature will involve discovering your users’ willingness to tolerate inaccuracies in your modeling results, and these inaccuracies are oftentimes inevitable. The ‘threshold of usability’ is where a model’s usefulness overcomes its incorrectness in the user’s perspective. Fall below this line, and your users may churn due to frustrations with inaccuracies. Overcome this line, and your feature will reveal its core value proposition to your users.

A simple illustration of this can be found in something commonplace such as autocorrect in Microsoft Word. Autocorrect is not always correct, but it has surpassed a performance threshold such that most users will keep the feature turned on despite the occasional error. In contrast, imagine using autocorrect in its early days – if autocorrect was only sometimes correct, how many errors could you tolerate before you turn the feature off?

Finding the threshold is an exercise of truly understanding what core functionality your users want. Surpassing this threshold and reaching a state of usability allows you to meaningfully validate the user experience. In addition, the user insights along the way provides a foundation for subsequent iterations of the data science and engineering.

However, reaching the threshold of usability is just the minimum viable product of your ML feature’s development. Another layer of considerations arises when a team begins to update its models.

A relationship: modeling improvements and user behavior

After reaching a threshold of usability, how do subsequent improvements in model performance change user experience? It may seem intuitive that improving model performance should equate to a proportional improvement in user value; however, this is not necessarily the case.

The correlation between model performance and user experience should be validated. It’s possible that incremental improvements to the model result in a similar growth in user value. Conversely, it’s also possible to find diminishing returns from improving the model performance. Whatever the case may be, it’s important to establish this feedback loop. The insights will help inform which model capabilities to prioritize in order to maximize feature value.

Returning to the autocorrect example: Suppose that the team actively tracks user engagement vs model performance and wants to test a new model. The model is a performance jump and has much better accuracy for autocorrecting incorrectly typed words; however, user adoption decreases.

Noticing this detachment between user behavior and model metrics can help reveal the most pressing user pain points. Maybe the autocorrect results are better, but the high-accuracy model is now just too slow in generating results for the user to quickly apply during their workflow. Improving the model accuracy further will not improve the user experience unless latency is solved first.

Ultimately, developing this understanding helps avoid the potential red herring of chasing ever-improved model metrics. There are many different metrics and characteristics that contribute to how a user experiences a model, and it remains important to refocus data science improvements on improving user experience.

Understanding the blind spots

It’s easy to overestimate the capabilities of a well optimized ML model, and even the best models have blind spots and limitations.

Modeling blind spots can manifest in multiple ways, and these blind spots will be unique for each feature. While solving for every scenario is infeasible, product teams should try to understand where blind spots may arise. Perhaps the most prevalent and discussed of these blind spots is algorithmic bias.

Let’s once again return to autocorrect. Perhaps early iterations of autocorrect were trained on the English language. As a result of the underlying dataset, the autocorrect model produced is particularly capable in handling typical English names. However, the model has a blind spot for non-English names – it is marking names that derive from non-English languages as incorrect spellings of English words. How would it feel to be told by an AI that you spelled your own name incorrectly?

These blind spots will manifest in in the user experience and may be hidden in model evaluation metrics or telemetry. Regardless of what the underlying cause may be, blind spots can only be addressed when an effort is made to find them. Some blind spots will be more critical to address than others, and it will be up to you to decide which to prioritize.

Conclusion:

Creating an ML feature requires different approaches from a rules-based features. The main goal is not to always build the most performant model; the main goal is to empower your end users with intelligent support.

If these challenges sound interesting to you, check out the work our Program Managers do at MAIDAP at MAIDAP – Microsoft New England.