Finding the right metrics for evaluating your bot | Industry

On Sep 23, 2018

Breaking Tech Industry news from the top sources

Google Confirms Business Profile Reviews Outage

Feb 11, 2025

DeepSeek AI draws ire of spy agency over data hoarding and…

Feb 11, 2025

Here’s the good news for companies building AI-powered virtual agents (aka bots) for customer service: Research from Accenture suggests that 80 percent of support chats or calls can be resolved by bots that have “good design.” The bad news is, the definition of good design changes depending on your business case.

Historically, many bot builders across industries have considered natural language understanding (NLU) a silver bullet for enabling extended, open-ended conversations. Indeed, too many implementers of failed bots have confessed to me that the plan all along was to support a handful of intents and have the bot learn phrasing and vocabulary variations “on the fly.” Unfortunately, that approach too often results in failure because NLU technology, while impressive and advancing rapidly, is not yet up to the task of artificial general intelligence.

The key to success is to design for a narrow use case and to deeply understand what users want out of the experience, including whether it’s an “I want to have a long conversation” scenario like role-play gaming (RPG), or an “I want this conversation to end as quickly as possible” one like customer service. Success is premised on understanding the difference and designing your bot accordingly (ideally, based on pre-existing customer interaction data, not guesswork and anecdote).

First, let’s talk about popular use cases for bots as of late 2018.

Why do people build bots?

My team at Chatbase keeps track of the types of bots using our analytics technology. The following chart — based on a sample of nearly 5,000 of these bots — suggests customer service and support is the most popular use case. We see this point mirrored in industry research about the growing prominence of AI-powered virtual agents for customer service. For example, Gartner has predicted that by 2020, 25 percent of all customer support interactions will involve virtual agents or bots, and Juniper Research predicts that AI-powered virtual agents will save banking and healthcare contact centers nearly $8 billion by 2022.

Clearly, customer service is currently the “killer app” for bots. But that use case doesn’t necessarily guarantee success.

Which bots succeed?

Our data also suggests a spectrum of engagement “intensity” across bots, as measured by metrics like number of turns (a turn being a single Q&A pair) or session time. The bots with the highest intensity boast double-digit turns or minutes per session, an impressive achievement. That said, intense engagement is not automatically a positive result. For some uses cases, less intense engagement is more — customer service, for example.

A good way to evaluate bot success is to view the intensity of engagement as measured by user expectations for conversation length. The chart below shows how different use cases would be mapped across those two dimensions:

In customer service scenarios, most users want answers (“What is my account balance?”) as quickly as possible, with minimal interaction — and they want to use their own personal phrasing. At the other end of the spectrum, success for RPG, social, and coaching bots is premised on lengthy sessions with multiple turns; users want to have as long a conversation as the bot can sustain. Successful bots meet the appropriate expectation.

Our data supports this conclusion: The median session time among the most popular customer support bots we’ve tracked is at the low-engagement end of the spectrum (

Context is everything

Although we don’t know the design principles behind these particularly successful bots, it stands to reason that deeply understanding user intents, including all the possible ways users might phrase their questions, would make those shorter conversations much easier to achieve. Conversely, without that understanding, most bots will struggle to recognize the user’s intent, leading to multiple conversation turns and extended session time.

By contrast, for those building bots whose primary purpose is to be social, the design needs to support longer conversations. Here, too, understanding user intent is key. If the bot is forced to rely on “fallback intents” (such as, “I didn’t get that, could you please rephrase the question?”), it could quickly frustrate users and lead to early exits.

Having a deep understanding of user intents and all their variations doesn’t mean your work is done. In addition, bot developers must prioritize which intents to build and deploy first, and rely on solid copywriting and user experience planning around how the user journey should work.

In summary, even for narrow use cases, raw performance metrics are not sufficient for evaluating bot success. Instead, dev teams will need to look at those metrics in the proper context: How do they map to users’ expectations in the given use case? The faster bot builders come to this conclusion, the better off they will be.

Ofer Ronen is General Manager of conversational analytics service Chatbase. Previously, he was CEO of Pulse.io, an app performance monitoring service (acquired by Google) and CEO of ad network Sendori (acquired by IAC). Ronen is a startup mentor at Stanford and an angel investor in Lyft, Palantir, and Klout.