Abstract
Neural network language models appear to be increasingly aligned with how humans process and generate language, but identifying their weaknesses through adversarial examples is challenging due to the discrete nature of language and the complexity of human language perception. We bypass these limitations by turning the models against each other. We generate controversial sentence pairs where two language models disagree about which sentence is more likely to occur. Considering nine language models (including n-gram, recurrent neural networks and transformers), we created hundreds of controversial sentence pairs through synthetic optimization or by selecting sentences from a corpus. Controversial sentence pairs proved highly effective at revealing model failures and identifying models that aligned most closely with human judgements of which sentence is more likely. The most human-consistent model tested was GPT-2, although experiments also revealed substantial shortcomings in its alignment with human perception.
Original language | English |
---|---|
Pages (from-to) | 952-964 |
Number of pages | 13 |
Journal | Nature Machine Intelligence |
Volume | 5 |
Issue number | 9 |
DOIs | |
State | Published - 1 Sep 2023 |
ASJC Scopus subject areas
- Software
- Human-Computer Interaction
- Computer Vision and Pattern Recognition
- Computer Networks and Communications
- Artificial Intelligence