Team of young engineers celebrating.

Impact evaluations: Delivering rigorous results within real-world constraints

Our Thinking | insight

Published

Author

6 Minute Read

RELATED TOPICS

Share insight

Idea In Brief

Impact evaluations promote better outcomes

Impact evaluations aim to answer the critical question: did the program or policy make a difference?

RCTs are not always feasible or appropriate

Conducting an RCT can be resource-intensive, requiring substantial time and financial investment to deliver effectively.

Strong causal reasoning is integral

Understanding and clearly articulating the intended chain of causation between an intervention and the observed impacts are central to impact evaluations.

When governments and community organisations make decisions about whether to fund a program or introduce a policy, having evidence about the extent to which it works is vital. It is the difference between decision-making based on evidence, expertise and experience, and decision-making based on inertia, conjecture, heuristics or ideology. While evaluations of all kinds provide evidence to decision makers, impact evaluations stand out for their focus on understanding the effects of programs and policies through counterfactual reasoning. They ask: what would have happened in the absence of the intervention? 

Recently, there has been a greater focus across government on conducting high-quality impact evaluations – a trend embodied in the establishment of the Australian Centre for Evaluation (ACE) and one that has been enthusiastically championed by Assistant Minister Andrew Leigh. This trend is welcome. Impact evaluations help to promote effective and accountable governance, judicious use of taxpayer money and better outcomes for the beneficiaries of policies and programs. 

Impact evaluations can often be regarded as synonymous with randomised control trials (RCTs) – and indeed RCTs are often ideal. But RCTs are not always feasible within the constraints of time, resources and the policy development lifecycle. And sometimes they are not appropriate. Fortunately, RCTs are not the only option. There is a myriad of different approaches to delivering impact evaluations that can provide rigorous methods and robust findings, all the while ensuring that evaluations have a real-world use. 

This article describes how Nous approaches delivering impact evaluations in a range of contexts. 

Impact evaluations are vital to ensure that policies and programs are evidence-based

Impact evaluations aim to answer the critical question: did the program or policy make a difference?  Perhaps all participants in a social services or jobs program had improved health outcomes and employment prospects; but if everyone in the population also experienced these benefits, then it is unlikely the program caused this change. By understanding attribution, impact evaluations allow policymakers to distinguish between correlation and causation. In an era where governments must justify every dollar spent, impact evaluations provide crucial insights to ensure that limited resources are allocated effectively.

Despite their importance, impact evaluations have historically been under-used or done poorly. This is unsurprising. Compared to other types of evaluation, they are resource intensive, theoretically complex, methodologically rigorous and take time to deliver. In their absence, decision-makers may assume that achieving short-term outcomes automatically translates into long-term impact, but this assumption does not necessarily hold. Without robust evidence, decision-makers may continue ineffective policies or miss opportunities to achieve better outcomes.

RCTs are often ideal to measure impact, but they are not always feasible or appropriate

The core challenge of impact evaluations is to establish a defensible counterfactual: what would have happened without the program or policy? This might sound simple, but it is no mean feat – after all, the entire history of the universe is an experiment with sample size n=1. In lieu of a time machine, assessing counterfactuals is, and must be, an exercise in careful inductive reasoning, not iron-clad deduction.

RCTs are widely regarded as the benchmark for establishing a counterfactual in impact evaluations. By randomly assigning eligible program or service participants to either the intervention/treatment group or the control group (where participants experience no intervention, or the existing service), RCTs seek to ensure that any differences observed between the two groups can be confidently attributed to the intervention itself, rather than external factors. This randomisation process, done well, should mitigate as much as possible, biases and confounding variables, thereby providing a robust and credible basis for understanding the true impact of a program. 

However, while RCTs provide a rigorous method for establishing attribution between an intervention and an outcome, they are not always feasible. Conducting an RCT can be resource-intensive, requiring substantial time and financial investment to deliver effectively – and require establishment at the same time as the program is designed and mobilised. Where evaluations are conducted late in the lifecycle of a policy or program, it may not be possible to conduct an RCT. Moreover, the ethical challenges of randomising participants — such as withholding potentially beneficial interventions from the control group — can mean that, quite apart from feasibility, RCTs may not be appropriate in some cases. 

There is a myriad of options to assess the impacts of policies and programs

Fortunately, while RCTs are a powerful tool in the evaluator’s arsenal, they are just one tool. In scenarios where they impractical or unethical, evaluators often turn to alternative methods. Three broad approaches to impact evaluations can be delineated (see below). 

General approaches to impact evaluations

  • Experimental designs randomly select treatment and control groups which are isolated to determine the causal impact of interventions. While RCTs are the main example of this approach, other experimental designs include stepped wedge designs (which randomise the timing of intervention) and cluster randomised trials (which randomise groups of participants rather than individuals).
  • Quasi-experimental designs comprise a range of different forms of analysis that typically combine natural observations and statistical methods to construct a counterfactual, whether over time or by groups of participants, to approximate the conditions of an experiment.
  • Non-experimental designs do not attempt to create separate control and treatment groups. Instead, they consider whether evidence is consistent with what would be expected (and potentially what has been already demonstrated in prior research), whether the intervention was producing the intended impact, and whether other factors could provide an alternative explanation.

Within these general categories, there is a broad spectrum of possible evaluation designs that evaluators can select from to understand impact (see Figure 1 below). The appropriate design will depend on a range of factors, notably the complexity of the policy or program and the ability to define a robust control. Each design can include a range of quantitative and qualitative methods to provide an account of the outcomes achieved and the degree to which they can be attributed to the intervention. 

Irrespective of the design and method, strong causal reasoning is integral to quality impact evaluations. This involves understanding and clearly articulating the intended chain of causation between an intervention and the observed impacts. Nous does this in a variety of ways. For example, clearly articulating a theory of change in conjunction with policy or program designers and the affected community can provide a conceptual basis to understand causation. This can helpfully be reinforced through more technical tools like directed acyclic graphs (or DAGs) which are graphical representations that model causal relationships between variables.  

Nous has considerable experience deploying a wide range of designs and methods for impact evaluations, a selection of which are described below. 

Figure 1 | A spectrum of impact evaluation design options
Figure 1 | A spectrum of impact evaluation design options
X
Figure 1 | A spectrum of impact evaluation design options

Natural experiments

Natural experiments leverage real-world circumstances where groups are exposed to different conditions in ways that mimic random assignment. These situations arise when external factors inadvertently create a treatment group and a comparison group. This method leverages natural variation in exposure to the intervention, such as policy changes, natural disasters, or other external shocks, comparing outcomes between those affected and unaffected. 

By comparing outcomes across these groups, natural experiments can provide valuable insights into the causal impacts of programs or interventions when RCTs are infeasible or inappropriate. Limitations include the potential for confounding variables that are not controlled for and the difficulty in finding appropriate natural experiments that align with the research questions. 

In an evaluation of a portfolio of innovation initiatives led by a state government, Nous used a natural experiment method to quantify impacts. This involved comparing the performance of the state with similar jurisdictions over the same period. We also compared the performance of the state before and after the initiative and the performance of initiative recipients with non-recipients. By developing a series of counterfactuals that each offered different insights on the intervention, we were able to provide a robust and credible account of the economic impacts of the initiatives on the state economy. 

Synthetic control groups

Synthetic groupscan simulate the conditions of an RCT through statistical methods such as matching techniques, regression discontinuity, or other statistical controls. Such approaches allow evaluators to approximate a counterfactual and draw causal inferences about program impacts, even when random assignment or naturally occurring comparisons are not available. 

One advantage of synthetic control groups is their ability to improve causal inference when a single, comparable control group is not available. Nevertheless, limitations include the complexity of the method, the need for rich data on many potential control units, and challenges in selecting the appropriate weights to create a valid synthetic control.

In an impact evaluation for the Business Research and Innovation Initiative grants program, Nous developed a synthetic control group by using ABS data and a nearest neighbour matching method that leveraged anonymised financial records. We used this to complete a cost-benefit analysis that quantified how effective the grants program was in facilitating business investment in research and development. The evaluation report (which is available here) included recommendations for using datasets developed through the evaluation to support cost benefit analyses of future rounds. 

Interrupted time series analysis

Interrupted time series (ITS) analysis examines changes in the outcome variable by analysing multiple observations before and after the intervention, thereby identifying trends that can be attributed to the intervention. By examining data points collected at multiple intervals before and after the implementation of an intervention, ITS analysis can help determine whether observed changes in outcomes can be attributed to the program rather than to underlying trends or seasonal effects. 

This method is especially useful for time-dependent data. However, it generally requires long-term data collection, can be susceptible to external events affecting the outcome, and it can be difficult to attribute causality using this method when multiple interventions or changes occur simultaneously.

In an evaluation of a series of programs designed to reduce the number of people under 65 years old living in residential aged care Nous analysed aged care and NDIS data to conduct ITS analysis using a multivariate linear model to estimate the impact of these initiatives. This method was coupled with engagement of over 300 stakeholders, including people living in residential aged care and service providers. This mixed-method approach helped our client to understand the impact of their investment and the factors that enabled and hindered change. 

Contribution analysis

Contribution analysis is a theory-based evaluation method that focuses on understanding and validating the causal mechanisms through which an intervention leads to outcomes. It involves constructing a theory of change, gathering evidence to support or refute the causal linkages, and developing a narrative to explain how and why certain outcomes occurred.

The primary advantage of contribution analysis is its ability to provide a nuanced understanding of complex interventions, particularly those with multiple components and indirect effects. However, its limitations include the subjective nature of building and interpreting the theory of change, potential bias in evidence collection, and the difficulty in quantitatively measuring the precise contribution of the intervention amid other influencing factors.

In a rapid evaluation of a state-wide mental health program for women with complex mental health concerns and their families, Nous used a range of quantitative methods to evaluate program impacts. This included descriptive analysis to understand clients, services and packages; association analysis to determine program outcomes for women across a spectrum of needs; hypothesis testing to determine whether changes in outcomes were statistically significant; and regression analysis to identify associations between clinical outcome measures and other factors. We also conducted contribution analysis by using Cohen’s D to estimate the extent to which observed outcomes could be attributed to the program. 

The most appropriate method usually depends on desirability, feasibility and ethics

Choosing which design and method to use for an impact evaluation involves a series of considerations. The criteria that we usually find to be especially important are set out at Figure 2. While all are important, desirability is usually the first order question. Ethics and feasibility are important hurdles for research design, but these criteria need only be assessed if an approach is first judged to be desirable. 

Figure 2 | Key criteria to decide the method for an impact evaluation
Figure 2 | Key criteria to decide the method for an impact evaluation
X
Figure 2 | Key criteria to decide the method for an impact evaluation

Tailor your approach to the context

As governments and businesses strive to make evidence-based decisions, impact evaluations will become increasingly important. But they are not a one-size-fits-all exercise. 

The challenge – and the opportunity – is to adapt methodologies and approaches to meet the constraints and objectives of each context. A commitment to rigour – coupled with a pragmatic mindset – can ensure that evaluations are credible, robust and able to achieve real-world influence.

Get in touch to discuss how you can design and deliver high-quality, fit-for-purpose impact evaluations.

Connect with Joshua Sidgwick on LinkedIn. 

Prepared with input from Virginia Wong, Robert Sale, Annette Madvig, and Heidi Wilcoxon

This is the third in our four-part series of articles on good practice evaluation. This series focuses on the steps you can take to ensure rigorous, high-quality evaluations. Download the full series here.