Google researchers propose a two-step method to make human translation evaluation more reliable — without doubling costs.
Normalisation adjusts scores based on percentile ranks to offset differences in question difficulty across shifts.