Human Evaluations No Longer the Gold Standard for NLG (AI Study)

     For years, natural language generation (NLG) researchers have been recruiting humans to evaluate their models’ text outputs. This practice is based on a reasonable assumption: As NLG aims at producing human-quality texts, the judgement of human evaluators should be the gold standard regarding model performance. But in a new study, University of Washington and Allen Institute for Artificial Intelligence researchers argue that untrained humans are not the natural language experts we’d like to think they are.
     NLG models are rapidly improving their ability to generate longer passages of context-conditioned texts. The researchers note that human evaluators tend to focus on fluency when assessing such texts’ “humanlikeness,” i.e. whether they were produced by a human or a machine. This poses a challenge for NLG model evaluation, where content-based errors can be more challenging to detect than fluency-based errors.
      The paper All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text examines human evaluators’ ability to assess and differentiate human and machine-authored text, and finds that humans’ reliance on surface-level text qualities negatively impacts their assessments of current NLG models’ capabilities. The study also suggests a number of methods to pretrain humans for such evaluation tasks.
     The team first explores how well such untrained evaluators can distinguish between state-of-the-art machine-generated text and human-generated text across three domains: stories, news articles, and recipes. The machine-generated texts come from the 175B parameter GPT-3 and GPT-2-XL large language models.
     The results show that evaluators correctly distinguished human-generated text from GPT2-generated text in 57.9 percent of the cases, and recorded a less-than-random chance score of 49.9 percent when asked to choose between human- and GPT3-generated text. The disappointing results suggest that text evaluations from untrained humans are essentially unreliable at the upper bounds of NLG model performance.
     In a bid to boost human evaluators’ accuracy when identifying machine vs. human-authored text, the team proposes three approaches for training evaluators: training with instructions, training with examples, and training with comparisons.
     In experiments conducted with Amazon Mechanical Turk human evaluators, the implementation of training examples and comparison trainings achieved the highest recall and F1 scores for the evaluators’ judgments. The researchers note however that evaluator agreement (Krippendorff’s α, a measure of annotator agreement that corrects for the probability of random agreement) remained low across all three of the studied domains. Further, none of the three human training methods significantly improved evaluators’ ability to detect machine-generated text, indicating that future development and improvements of evaluator training procedures may be required when humans evaluate state-of-the-art NLG models.
Jul 31th, 2021