Practical tips for evaluating AI rollouts
Group of businesspeople standing together and discussing their work and projects.

Good AI evaluation is (mostly) just good evaluation: Practical tips for evaluating AI rollouts 

Our Thinking | Insight

Published

Authors

5 Minute Read

RELATED TOPICS

Share Insight

Idea In Brief

“Is it worth it?” is a valid evaluation question

AI licenses are expensive. The question of whether AI improves quality, or instead creates “workslop,” is not an abstract question.

Don’t accept off-the-shelf ROI reporting, and don’t rely on one method

Vendor dashboards can overcount activity and undercount value. Context-specific frameworks can explain what’s really happening.

Segment by what people do, not who they are

Without role- and seniority-based segmentation you’ll miss adoption patterns that determine where value (or resistance) actually sits.

Many organisations are in the midst of rolling out AI assistants to their workforces and trying to work out whether it is worth it. Do these tools deliver value for individuals and businesses by reducing time and increasing quality? Or does the rise of AI-induced “workslop” create more trouble than it’s worth? These are important questions, not least because licenses to tools like Microsoft 365 Copilot and ChatGPT do not come cheap. The question of whether those dollars are buying anything real is not abstract.

Nous has a large evaluation practice that spans many methods and approaches. While evaluators are not typically in the business of evaluating software adoption so rigorously, AI is both expensive and a general-purpose technology with truly transformational potential for workplaces. So, it is perhaps unsurprising that we’ve recently been undertaking a range of AI evaluations. We have worked with clients across a range of sectors to understand whether and to what extent deploying AI tools has been valuable for organisations. 

In this article, we peek under the hood and identify some of the key lessons we’ve learned from these projects.

Be wary of off-the-shelf reporting

Every major AI vendor now ships a default "ROI framework" alongside the product. Microsoft has one for Copilot. The dashboards are slick. The survey instruments are ready to go. But we find they are not usually well equipped for the evaluations we undertake. 

The reasons are two-fold. First, some of the vendor data can be systematically biased in favour of the tool: it counts prompts and accepted suggestions, not whether the output was any good or whether the time saved was accurate. Second, each organisation needs to decide what metrics matter to them, given their operating context, which will not necessarily be those that the vendor has selected for its off the shelf reports. 

The practical implication: your evaluation framework needs to be built, or at least scrutinised, by someone who knows your context and what you are trying to achieve. If you inherit one from the vendor, treat it as a starting draft at best.

Use mixed methods and take the qualitative side seriously

The corollary to not trusting any single data source is that you need more than one. Quantitative data – telemetry, survey responses, time-use estimates – tells you about the shape and scale of what is happening. Qualitative data – interviews, focus groups, observational work – tells you why it is happening, and whether what looks like engagement is in fact value.

Neither side is optional. Adoption metrics on their own are particularly misleading for AI tools: usage is wildly uneven across an organisation, and two cohorts with similar prompt counts can be getting radically different things out of the product. One might be using Copilot to draft serious analytical work; the other might be using it to reformat the same three emails every morning. The numbers do not tell you which is which.

The qualitative work is where the interesting findings live. In every engagement we have run, there are pockets of genuine excellence – teams that have worked out a way to use the tool that is materially changing how they work – and pockets of resistance, where the tool is being ignored or worked around for reasons that are usually substantive rather than irrational. Both groups have something important to teach the organisation, and neither shows up clearly in the dashboards. Getting under the hood through structured interviews, cohort-specific focus groups, and role-based round tables is how you find them.

Work to get the baseline data

If you want to claim Copilot saved your organisation 500 hours in “writing meeting minutes” you need to know two things: how many of your people were writing meeting minutes in the first place, and how much time they were spending on the tasks Copilot now helps them with. If either number is missing or wrong, the headline figure is fiction, no matter how impressive the arithmetic looks.

This sounds obvious. In practice it is the single most common place evaluations fall down. Organisations want to start measuring now, in the moment, and they discover too late that there is no pre-intervention data to compare against. You cannot retrofit a baseline.

The fix is to get measurement in place before the rollout, or at the very least early in it. A lightweight pulse survey capturing self-reported time on key tasks, a snapshot of organisational telemetry on tool usage, and a clear statement of what "good" currently looks like for the functions being targeted. 

Even if you don’t do this, you can still find useful information in unexpected places. For example, some organisations measure the impact of access to tools like Copilot on staff retention, which is routinely collected by many organisations.  These figures can form the basis for a pre/post intervention comparison.

Make sure you get segmentation right

A standard program evaluation in the Australian public sector will break results down by a range of demographic variables, such as: First Nations status, culturally and linguistically diverse background, socioeconomic indicators, gender, location. These are the right cuts for most social policy work.

However, they are not necessarily the right cuts for an AI tool evaluation. What drives variation in Copilot usage is not who you are, but what you do. Consider a recent example from the Australian Public Service. Seniority matters enormously. An APS3 uses a generative AI assistant very differently from an EL1, and both differ from an SES officer. Job function matters even more. 

Moreover, AI is polarising inside organisations. We find that the AI enthusiasts are the first to show up to focus groups. If you do not work actively to recruit sceptics and resisters, your qualitative data will agree enthusiastically with the vendor's telemetry, and both will be wrong in the same direction.

The implication runs through the whole evaluation design: sampling, survey segmentation, focus group composition, round-table facilitation. If you do not segment your sample by the relevant factors affecting the use of AI tools, findings will be neither representative nor useful. 

Efficiency and quality are different questions

Most Copilot evaluations focus on time savings. The numbers are surprisingly consistent across our engagements: in the order of an hour a day for active users, a couple of hours a week on average across a cohort. That is a meaningful gain, and it is materially less than most executives expect going in.

The more interesting question is what happens to quality. The honest answer is that quality outcomes are much harder to measure than time savings, and highly context-dependent: what counts as improved quality in a policy brief is not what counts as improved quality in a research grant application or a board paper.

The trade-off between the two is also real. Take meeting transcription, a function most organisations have switched on by default. It is almost certainly delivering an efficiency gain. It may also be producing a quieter cost: taking notes is itself a way of processing a conversation and outsourcing it to a tool that produces a clean summary can make a meeting feel better run than it was, while thinning out the discussion it was meant to capture. Measure time saved, and measure output quality. Measure them separately, with different instruments. The interesting finding is almost always in the gap between the two.

Widen the frame beyond productivity

Most Copilot evaluations are scoped as productivity studies, and most of them should be scoped more broadly. Two pillars belong in the framework alongside individual and organisational productivity.

The first is governance and risk. Is there clear authority over how AI is being used? Are the guardrails around sensitive data working? Is there a process for someone to raise a concern? These questions are often treated as the IT team's problem and kept out of the evaluation, and the result is that material risks get surfaced late, or not at all.

The second is the technology itself: its fitness for purpose, its integration with the organisation's data estate, its interoperability with the tools people already use. Copilot is considerably more valuable when it is hooked into an organisation's assets properly than when it operates as a glorified text assistant, and the gap between those two states is usually an IT architecture question, not a user training question.

An evaluation that measures productivity but says nothing about governance or technical fit is half an answer. Sometimes it is a misleading one.

Evaluate the procurement decision, and bargain as a coalition

We have found that organisations that get the most out of their evaluations used them to answer a forward-looking question, not a backward-looking one. The backward-looking question is "did Copilot pay off". The forward-looking question is "given what we now know, what should our AI strategy and procurement posture be from here".

In one engagement, the most valued output was not the estimate of time saved, but rather the analysis that said: do not sole-source your AI capability from a single vendor. The technology is moving too fast, experimentation across tools still has genuine value, and locking yourself in removes the pricing and strategic leverage you would otherwise have.

That leverage grows considerably if you use it collectively. Large buyers – public sector clusters, industry associations, university consortia – are substantially underweight in how they procure AI capability. Bargaining as a coalition, with a shared view of what good looks like and a shared willingness to walk, produces better fee structures and terms than any single organisation can negotiate alone. This belongs inside the evaluation's key lines of inquiry, not outside them.

Get in touch to discuss how you can evaluate the impact of AI in your organisation.

Connect with Joshua Sidgwick, Charlotte Bradley, Will Prothero, and Ned Lis-Clarke on LinkedIn.

This is the third and final article in our series on AI evaluation. Read the first article here and the second article here.