The evaluation imperative: Measuring what matters in AI
Bright blue cables in a network server.

The evaluation imperative: Measuring what matters in AI

Our Thinking | Insight

Published

Authors

5 Minute Read

RELATED TOPICS

Share Insight

Idea In Brief

Don’t confuse AI adoption with real impact

Uptake is easy to track, but as a metric it can hide poor-quality use and even incentivise low-value “workslop” that only creates more work.

Always evaluate AI across multiple domains

A useful evaluation spans adoption capability, system performance, governance and risk, value delivery, and competitiveness, tailored to your goals and context.

Use systems thinking to measure real change

Treat AI as part of a wider system of people, processes, data, and governance, and test for flow-on effects like shifting bottlenecks and overreliance on outputs.

Do you know how your organisation will measure the impact of AI? 

Right now, most don’t. Some organisations, for fear of being left behind by their competitors, are rolling out AI tools rapidly, without much thought as to how they will measure the impacts it is having. Other organisations are holding back, concerned by the risks, both regulatory and reputational. This concern is often underpinned by a strong intuition that the reality of these tools just doesn’t match the hype. 

Both these organisational archetypes are missing the opportunity to systematically measure the impact that AI is having for their organisation, so that they can best realise the gains and mitigate or prevent the harms. After all, AI is rapidly changing how we work. It is driving economy-wide changes. It is a complex system change, and it is happening now. 

It is imperative that organisations understand the impacts it is having to best unlock the value that it can produce and to ensure that investments in AI tools actually produce value for money. As we explain in this article, this often requires a systems-thinking approach.

Adoption metrics are just the tip of the iceberg

Where organisations are conducting formal evaluations of the impact of AI, many focus on uptake of AI tools. This is unsurprising. And it is hardly a novel problem: a common challenge in all evaluations is that it tends to be much easier to measure inputs and activities than it is to measure outcomes and impacts. 

One problem with this is that uptake or adoption of AI tools tells you nothing about whether these tools are being used well: for example, whether they are enhancing productivity, improving quality and lifting organisational capability. 

More significantly, incentives matter. An overemphasis on measuring uptake of AI tools risks being counterproductive. It can lead to rises in so-called "workslop: – a now-common term in the popular lexicon – where AI is used to do a lot of stuff in organisations that does not create real value, and which can actually create a lot of unnecessary work for employees, who have to navigate and interpret AI-generated content. 

None of this is to say that adoption and uptake are not important – especially in risk averse organisations beset by institutional inertia – but it is only one piece of the puzzle. 

AI evaluations should focus on a range of domains

Evaluating the impact of AI requires considering a range of domains. This recognises that AI is a complex technology, with potential to greatly enhance individual and organisational productivity, but also with considerable risks if it is not ethically and responsible deployed. 

The appropriate domains will differ depending on the context, but the list below provides a rough guide for the general focus areas that AI Evaluations need to consider.  

They can be articulated and operationalised in different ways depending on the objective of your evaluation; for example, as a maturity model (if you are conducting a developmental evaluation) or in terms of a value chain (if your concern is to make the case for further investment). 

Evaluation domainFocus questionExample methods or metrics

Adoption

Confidence, experience and capability 

Are employees adopting and effectively using AI tools in their daily work? 

User analytics give insight into metrics like frequency of use and feature adoption.

User surveys supplement these numbers with qualitative insights into confidence, experience and perceived usefulness. 

Pairing quantitative usage data with targeted survey questions gives a more complete picture of how adoption is translating into capability.

System performance

Reliability and integration 

Do the tools function reliably in your organisation’s operating context? 

Benchmarks and error rate tracking provide objective measures of reliability, but these need to be contextualised within your organisation's operating environment. 

A/B testing or time-on-task comparisons can demonstrate whether tools are genuinely improving workflows. 

Governance & risk 

Governance, risk and responsible use

Is the use of the tools being managed in a safe, responsible and complaint way? 

Internal policy reviews and compliance checks establish whether use is occurring within agreed boundaries. 

Stakeholder interviews and awareness assessments help gauge whether those boundaries are understood in practice. 

Value

Strategic alignment, value and impact

What is your current strategic intent and are AI tools delivering outcomes aligned with this? 

Outcome mapping against objectives provides a qualitative picture of alignment. 

Quantitative measures such as increases in efficiency or productivity can provide greater rigour. 

Subjective measures of quality or customer satisfaction can round out the picture.

Competition

Competitiveness and future readiness 

Are your adoption and use of AI tools competitive given your peers in the sector?

Sector benchmarking through maturity models or industry surveys offers a relative positioning. 

Capability gap analysis – mapping your current AI use against emerging applications in your sector – can highlight where you're falling behind or pulling ahead. 

The importance of a systems-thinking approach

Evaluating the impact of AI in your organisation is often best enabled through a systems-thinking approach. This involves treating a system itself as the object of an evaluation, rather than a single program, tool, or intervention. System-level evaluations are a core part of Nous’ repertoire (see the callout box below).

This approach is well suited to AI because of the complex ways that AI tools can affect organisational outcomes. A systems-level evaluation begins from the recognition that AI models don't exist in a vacuum. AI models are deployed in applications, which are deployed in systems that include the people, processes, data and governance arrangements that shape how AI is actually used. 

A systems-thinking approach to an AI evaluation involves asking questions about how people adapt their workflows, how decision-making processes change, whether new risks or dependencies emerge, and whether the intended benefits actually materialise in practice.

Nous experience conducting system-level evaluations

Nous has considerable experience conducting system-level evaluations for clients across the private, not-for-profit and government sectors. We have evaluated some of society's most complex policy reforms, programs and investments. For instance, we have evaluated: 
 

Our approach to evaluations is documented here.

System-level evaluation may sound complicated, but at a practical level it involves being attentive to the various flow on effects of the use of AI tools. For example, if you have recently deployed Copilot (or ChatGPT Enterprise or similar) and are seeking to understand its effects, a systems-based approach involves asking questions like: 

  • Is AI-assisted work actually better, or just faster?
  • Does what you're seeing connect to the outcomes your organisation actually cares about?
  • Have efficiency gains in one part of a workflow created a bottleneck somewhere else?
  • Are employees sufficiently critical of Copilot outputs (e.g., meeting and document summaries)?
  • Is there evidence that Copilot is becoming a substitute rather than a complement for thought? 

This case study provides an example of how Nous approached evaluating the world's largest whole-of-government Copilot trial.

In these ways, organisations can understand the impact of AI tools through systems-thinking.

Get in touch to discuss how your organisation can evaluate the impact of AI. 

Connect with Joshua Sidgwick, Charlotte Bradley, and Heidi Wilcoxon on LinkedIn.