by JOHN BYRON

The Australian Research Council has put a lot of thought and effort into its refresh of the Excellence in Research for Australia assessment exercise. They consulted widely in a comprehensive review over the last year or two, and have clearly devoted considerable internal and external expertise to the reform piece, drawing on some of the nation’s most experienced researcher leaders across the disciplines and a number of world-leading bibliometricians.

The resulting ERA 2023 Benchmarking and Rating Scale Consultation Paper marks the first serious rethink of how ERA operates since the results of the original ERA trial in 2009 informed the design of the first full round in 2010. If implemented, the proposed reforms will be the first significant modifications since journal rankings were dropped before the second national round in 2012.

Like any other public policy reform, the proposed changes need to satisfy several questions: are they necessary?; do they work?; and are they worth it?

Some respondents to the 2020-21 ERA consultation (declaration of interest: including me) argued that ERA’s benefit to Australian research has all but plateaued, at the cost of as much work as ever, if not more. The architects of ERA, Professor Margaret Sheil and Leanne Harvey – then the ARC CEO and Executive General Manager, respectively – share this perspective, along with other experienced figures in university research policy.

On this view, ERA was instrumental in helping Australian researchers and research administrators refocus on research quality, in a system that had been heavily and unhelpfully obsessed with output quantity measures. Having achieved this pivot to quality it is reasonable to stop and ask whether a new round of ERA actually gives us enough back to justify all its expense and effort. (And it is a lot of expense and effort, not just for the ARC but for university administrators and – most of all – for researchers themselves.)

In other words, would it harm Australian research if we simply gave ERA a rest, and let researchers get on with, you know, researching?

The ARC has not openly pondered this issue, for the excellent reason that they were directed by their minister in his December letter of expectations to conclude the review pronto and run ERA again. Fair enough. With the more existential question moot, then, the present reform package is perhaps a response to the spectre of ERA’s diminishing returns. Since we are going around again, the paper silently asks, how can we get a better return on our mammoth collective investment?

The ARC’s answer to this challenge may be fairly technical but it is also quite inspired. The centrepiece is a realignment away from the sprawling middle of global research production onto the high quality top end, on the sound principle that at this point in our national university research trajectory we are now aiming to compete with the world’s best, not merely with the world’s average. Accordingly, the logic goes, the benchmarks need to be re-sighted towards the world’s top performers, and at higher definition to aid meaningful improvement.

When reading the consultation paper (and I recommend it – it’s a crisp 34 pages, including endpapers, detailed but eminently digestible) it is difficult not to be distracted by the prominent display of two options for implementation. This is not necessarily a trick but it is nonetheless better to ignore them in the first instance, since they serve chiefly to distract from the two main reforms which feature in both options – the introduction of the high performer benchmark, and the merger of the two lower ERA strata (one and two) into one “below world average” band.

The latter change seems reasonable, on the face of it, since we are perhaps less interested in whether we’re a bit below world average or a lot below world average. It’s the same thinking behind the classic scale that appears on many university transcripts, differentiating between degrees of success (High Distinction / Distinction / Credit / Pass) and aggregating failure into a single Fail category. (Incidentally, the original ERA avoided the alphabetical grading proposed in the present paper precisely to avoid the misleading echo of the classic school grading system.) The paper argues that Australian research has largely climbed out of these lower bands: accordingly, the proposed scales embody a view that it is less helpful to know whether we’re a bit sub-par or a lot sub-par than it is to make finer distinctions between performance in the upper echelons.

This makes particular sense in light of the explosion of research activity worldwide in recent years, generating a large volume of research outputs much of which is arguably of medium-to-low quality, thereby dragging down the world average.

This effect has accelerated the rise in our national performance relative to the global average, above gains made due to our objective increase in quality. Even without any national improvement, but particularly in light of it, there is good sense in benchmarking ourselves more with the world’s top performers than with the (declining) middle band.

How, though, do we do it? The ARC proposes the use of a more granular “high performer benchmark,” which compares Australian research to the top ten per cent of universities in the world in a given field. (The “world benchmark” – a modified version of the existing comparator – will also be deployed.) There are different methodologies for the so-called citation disciplines – roughly, STEM – and the peer-review disciplines – roughly, HASS – and they each bring their own challenges.

For the citation disciplines, this attempt at greater granularity now compels attention to an issue we have pretty successfully ignored so far over the life of ERA: the fact that the various metrics are actually proxies for what we are trying to estimate – research quality – rather than the thing itself. They may well be a fair approximation, but we don’t actually know for sure, since: (a) we can’t measure quality directly; and (b) the metrics have not been systematically benchmarked against the best estimation of quality we have, which is peer review.

Take the infamous Wakefield paper of 1998 (I realise invoking this paper violates a kind of Godwin’s Law for metaresearch – call it Byron’s Law – but bear with me). Even after but certainly before it was retracted by the Lancet, this deeply flawed paper attracted many mentions that could not be deployed as evidence of its quality – some of these mentions highlighted its poor quality, while others were made by bad-faith actors in support of their own spurious agendas. Many of those mentions were made in non-academic outlets, but many (especially of the former kind) contribute to the paper’s scholarly citation performance.

Alternatively, consider the comparative fortunes of a pretty good scientific paper in English led by an established, well-connected and highly visible American researcher from an Ivy League university, posing an idea that usefully but incrementally advances their field; and a genuinely excellent paper presenting a breakthrough concept by a team of mid-career Filipino researchers at a provincial university, writing in Tagalog. We all know who will get the most profile: there is no way that the initial citation metrics of these two publications tells us anything useful about their relative quality. Even if the latter paper fulfils its full global potential in due course (and that’s a big if), it will take some time to do so – plausibly longer than an ERA-like reference period, for instance.

These may be limit cases but they illustrate some of the problems with handling proxy measures as though they themselves are the desired outcomes. Citation metrics are approximations of underlying quality that are subject to a whole lot of imperfections – some acknowledged, many invisible – that add uncertainty to their application as proxies.

We have been using this instrument without calibration from the beginning, relying on the sum of experts’ gut feel that the accuracy of citations in estimating quality is about right. Maybe that’s been good enough so far; maybe the flaws in the glass have not distorted our view all that much up until now. But with the high performance indicators placing the citations under much higher magnification, it is time to ask whether the looking glass is up to the task. In other words, is the instrument sufficiently accurate to handle this much precision? The honest answer is, nobody knows. We’re just guessing. They may be pretty good guesses, but at some level of granularity we need more certainty than that.

The ARC paper attempts to address this problem through guidance, including the provision of “characteristics,” descriptors designed to help panels align the outputs to the relevant band. The characteristics currently hinge on fairly nebulous hierarchical terms such as “exceptional”, “forefront,” “world-leading” and “clearly ahead.” To be useful they will need much closer definition, illumination with specific real-world examples, or both. But who would draft such revisions? By what means would they be calibrated to the desired standards of quality across the corresponding stratum of high-performing global universities? Would these rules of thumb – which would serve as a kind of peer review ready reckoner – be compiled by scientists and scholars (which is to say, peers) or by ARC staff? How would they be tested? Quite a lot rides on these considerations.

At first glance one might think the situation is even more challenging on the peer review side, but the ARC’s optimism is perhaps supported by the proposition that ERA peer reviewers have been concentrating pretty much on the upper end all along, on the reasonable premise that researchers’ typical bed-time reading is drawn from the best global scholarship in their fields, rather than from the middling output, meaning their feel for “world average” is in fact already calibrated to a higher than average standard. (That is one of several plausible explanations for the fact that the peer review disciplines’ performance has risen more steadily than the rather more meteoric rise of the citation disciplines in recent ERA rounds: alongside the flooding argument adduced above, and the additional suggestion that peer review is less susceptible to gaming than the citation system.)

If ERA has been operating a form of high(ish)-performance peer review all along, it follows that it has been marking down peer review disciplines overall relative to the citation fields, because Australian research in these fields has been typically compared to the top quartile (say) rather than the entire collected mass of global output in the way the citation disciplines are compared, which is to say a tall spire of very good to truly excellent research towering above a sprawling base of middling to mediocre output. It could be a big ask for the ARC (and perhaps much of the sector) to accept that proposition, although the paper itself hints in that direction with the observation, “peer review is already capable of identifying world leading research” (p.12), a statement that does not have an equivalent on the citation side.

As bitter a pill as it may be to swallow, it seems to me we cannot hope to have a genuinely productive high performance peer review discussion without first engaging in such a recalibration of how the system has been working so far for the peer review fields.

In light of all these challenges, it is difficult to see how the ARC could implement its proposed reforms with the required degree of rigour without a significant quantum of additional work (such as a blind peer review process to calibrate the high performer citation indicators). But if reform it must be, then Option A (with five rather than six final bands) would carry less risk of introducing error through potentially unreliable over-magnification. (And whichever option is adopted, the inclusion of dedicated guidance for the assessment of research of FoR 45 Indigenous Studies is a significant advance on previous approaches.)

Even with these reservations resolved, I remain unconvinced that the reforms would inject enough new life into ERA to make another round worthwhile. That’s not a criticism of the ARC – the option of pausing ERA is not available to them at present – or of their reform proposals, which are imaginative and carefully considered. But an incoming government would be well advised to take a step back instead of just steaming relentlessly ahead, and considering what it hopes to achieve by holding another round. The ARC’s resources have been severely squeezed in recent years, so the staff time saved from a pause on ERA would be a huge boon to the grants programme; and the nation’s researchers might welcome the opportunity to get back to the bench or the archive, the lab or the library, the field or the survey group, to keep producing more of this world-beating Australian research.

Dr John Byron is Principal Policy Adviser at QUT


Subscribe

to get daily updates on what's happening in the world of Australian Higher Education