[Submitted on 7 Dec 2022]
Abstract: Existing techniques for training language models can be misaligned with the
truth: if we train models with imitation learning, they may reproduce errors
that humans make; if we train them to generate text that humans rate highly,
they may output errors that human evaluators can’t detect. We propose
circumventing this issue by dir