Google researchers propose a two-step method to make human translation evaluation more reliable — without doubling costs.
Normalisation adjusts scores based on percentile ranks to offset differences in question difficulty across shifts.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results