Parliament House in Canberra at night.

Copilot comes to Canberra: Evaluating the world’s largest whole-of-government AI trial

Our Work | Case Study

6 Minute Read

RELATED TOPICS

Share Case Study

Artificial intelligence in the workplace is here and organisations both public and private are looking to maximise its benefits.

In November 2023, the Australian Government announced a six-month, whole-of-government trial of Microsoft 365 Copilot, a generative AI chatbot and assistant that uses a combination of large language models to understand, summarise, predict, and generate content. The trial ran from January until June 2024​ and included more than 5000 APS staff across almost 60 participating agencies.

We were engaged by the Digital Transformation Agency to evaluate the trial, testing the extent to which the promise of AI had translated into real-world adoption by public servants. The results would help the Australian Government consider future opportunities and challenges related to AI adoption.

“The Australian government’s Copilot trial was the world’s largest public sector trial of generative AI,” says Nous principal Will Prothero. “It was particularly exciting in that it included participation from so many different agencies, which presented an amazing opportunity to gain insight into both the distinct and common benefits and challenges of AI adoption simultaneously, from dozens of different but related workforces.”

Artificial intelligence in the cockpit

Copilot was selected as the subject of the trial for several reasons. In addition to being a deemed a suitable proxy for generative AI in general, offering comparable features to other off-the-shelf AI products, Copilot was already available in many agencies and could be rapidly deployed across the broader APS. Its use could also be controlled and monitored.

The trial was non-randomised, with agencies nominating staff to take part. Trial participants comprised a range of APS classifications, job families, AI experience levels, and expectations. Participants were encouraged to use Copilot in a range of business-as-usual activities:

  • Content generation – Drafting documents, emails, and PowerPoint presentations
  • Summarisation and theming – Providing overviews of meetings, documents, and email threads, and identifying key messages
  • Task management – suggesting follow up actions and next steps from meetings, documents, and emails
  • Data analysis – Creating formulas, analysing data sets, and producing visualisations

The trial also had clearly defined objectives, which were the subject of our subsequent evaluation.

Table outlining the Copilot evaluation objectives and lines of enquiry
Table outlining the Copilot evaluation objectives and lines of enquiry
X

Evaluating Copilot’s flight

To ensure breadth and depth of insight, we took a mixed-methods approach to our evaluation, which assessed the use, benefits, risks and unintended outcomes of Copilot in the APS during the trial.

"A mixed-methods approach, which utilises both quantitative and qualitative research techniques, was ideal for evaluating the Copilot trial because it provides a comprehensive understanding of both measurable outcomes and nuanced contextual insights,” says Prothero. “This was particularly important to ensure that the varied experiences and impacts of AI adoption across the public servants and agencies that were represented in the evaluation and its findings.”

“This also provides us with rich insight for more effective and tailored strategies that could be applied uniquely to different workforces,” he says.

Our qualitative and quantitative data collection methods included a centralised issues register, outreach interviews during the initial stages of the trial, pre-use, post-use and pulse surveys, post-trial interviews with key stakeholders, and focus groups. All told, more than 2000 trial participants took part in the evaluation.

Copilot users from both this evaluation and agency-specific evaluations consistently reported quality and efficiency improvements in three key areas: summarisation of content, creating first drafts, and information searches. Trial participants estimated efficiency gains of around an hour when completing one of these three tasks, with participants at junior levels (APS3-6), EL1s, and those in information and communications technology roles perceiving the most efficiencies in these activities.

In addition, 40 per cent of post-use survey respondents reported they were able to reallocate their time to higher-value activities such as staff engagement, culture building and mentoring, and building relationships with end users and stakeholders. This tracks with some of our own thinking on AI and its potential to liberate knowledge workers from drudgery and allow them to more fully engage in the uniquely human aspects of their work.

We suspect that future trials will show even greater productivity gains.

“It's likely the efficiencies reported in the trial are at the lower end of what is possible,” says project director Virginia Wong. “We probably won’t see the full productivity benefits of generative AI until it is embedded in key workflows.”

The sky’s the limit?

While trial participants across all job classifications and job families were satisfied with Copilot, and while the majority wished to continue using it, the adoption of AI, in the public service and elsewhere, requires a concerted effort to address technical, cultural, and capability barriers in order to improve usage.

Our evaluation found that agencies faced both technical and cultural adoption challenges during the trial. Capability challenges were also highlighted, with trial participants requiring both tailored training in agency-specific use cases and well as general generative AI training.

Cultural barriers included stigma around the use of AI – no one wants to be perceived as a slacker – while some participants reported feeling uncomfortable about being recorded and transcribed during meetings. Interviews with government agencies highlighted that generative AI may have a large impact on the composition of APS jobs and skills, especially for women and junior staff, who are perceived to be at a greater risk of job displacement. Trial participants noted the need for clear guidance and information regarding their accountabilities and acknowledged the need to have change management supports in place, including ‘champions’ within the workplace, who could demonstrate generative AI's benefits and drive adoption.

These are challenges that all organisations, beginning their AI adoption journeys, will need to face. In the case of the APS, agencies will need to carefully weigh the potential benefits of efficiency and quality improvements against the costs, risks, and suitability of generative AI to meet their agency’s needs.

“Ultimately this is a leadership challenge,” says Nous’ chief data scientist David Diviny. “Leaders who are curious and encourage ‘deliberate practice’ will lead the way. The costliest mistake a leader can make is to perceive the rise of Generative AI as a technical problem, rather than an adaptive leadership challenge. “

Where to next?

Our evaluation of the whole-of-government Copilot trial was published in October 2024. In our report, we made a number of recommendations to the APS on AI implementation, adoption, and risk management. Our work will be used to inform future iterations of the Policy for the Responsible Use of AI in Government, which came into effect in September 2024.

The growing availability and speed of uptake in publicly available AI tools meant the APS, like other organisations, had to get on the front foot, and quickly. Its decision to undertake such a trial, and to have it rigorously evaluated, was the right one.

“It is critical that the Australian government maintains trust throughout its adoption and use of new technologies,” says Prothero. “There is, appropriately, a lot of public concern about the government’s current and future use of generative AI. This includes concerns about information security and privacy, inappropriate use of people’s data, and the potential uses of AI in government decision-making.”

“In this context, it was critical that a highly robust evaluation was undertaken by an expert evaluation partner.”

There is already much deep thinking underway about how the public service might ensure the trustworthiness of its advice as its adoption of AI advances. It is indicative of how seriously the APS is taking the issue that its first report in a new series of long-term insights briefings, ‘How might artificial intelligence affect the trustworthiness of public service delivery?’, is dedicated to the topic. There remains a lot of thinking to be done about this and other AI-related issues.

For this reason and more, it is important to remember that this was only one trial of a generative AI tool by the Australian Government. As other tools become available, and the landscape of use cases only becomes broader, the need to maximise productivity gains and remain competitive will become ever more pressing.

“This is genuinely an exciting time to be alive,” says Diviny, an optimist if ever there was one. “The innovations that generative AI make possible are limited mostly by our imagination and our willingness to change the way we do things.”

In other words, this was a glimpse at the future. There is still a lot of future to come.

Key findings from our evaluation

  • Daily usage – 1 in 3 participants used Copilot daily during the trial
  • Productivity gains – 69 per cent reported improved task completion speed and 61 per cent saw enhanced work quality
  • Time savings – Participants saved up to one hour per day, reallocating this time to higher-value activities
  • High satisfaction – 77 per cent were satisfied with the AI tool and 86 per cent wanted to continue using it beyond the trial
  • High adoption effort – Concerted effort is required to address technical, legal cultural, and governance barriers, and to continuously build staff capability to drive adoption
  • Inclusivity and accessibility – Improvements are expected in this area, particularly for people who are neurodiverse, have a disability, or are from a culturally and linguistically diverse background
  • Job and skill concerns – These concerns are real, particularly for those in administrative roles and entry-level positions, marginalised groups, and women, while there are also concerns about a general erosion of writing skills

AI at Nous Group

At Nous, our AI Studio leads the development of internal tools that embed GenAI in group-wide processes, while also creating bespoke, project-specific solutions upon request from individuals and teams. Our leaders encourage AI adoption through communication and behaviour modelling. We are also playing a leading role in broader public discussions about AI. In 2024, we hosted the first of our Human Side of Generative AI Summits for executive-level leaders and sponsored the CEDA National AI Leadership Summit. We are a member of aicolab.org and regularly publish cutting-edge thinking about how organisations can reap the benefits of this exciting moment.

What you can learn from our work with the Digital Transformation Agency

Robust evaluation of digital initiatives provides pragmatic, actionable and evidence-based insights. Leading organisations are increasingly applying this level of rigour, but many organisations are not, missing material opportunities to derive value from their investments.

Generative AI has amazing potential to improve workforce productivity by augmenting specific types of activities, including content summarisation, report drafting, and information searches. But equity and ethical usage also need to play a role in any smart adoption of the technology.

Large-scale evaluations can often provide insights beyond the scope and terms of the project in question. This project provided deep, evidence-based insights into generative AI adoption by more than 50 different workforces. Such insights can serve as a wellspring of knowledge to inform future projects and evaluations.