EnronSR: A Benchmark for Evaluating AI-Generated Email Replies

Moran Shay, Roei Davidson, Nir Grinberg

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Human-to-human communication is no longer just mediated by computers, it is increasingly generated by them, including on popular communication platforms such as Gmail, Facebook Messenger, Linkedin, and others. Yet, little is known about the differences between human- and machine-generated responses in complex social settings. Here, we present EnronSR, a novel benchmark dataset that is based on the Enron email corpus and contains both naturally occurring human- and AI-generated email replies for the same set of messages. This resource enables the benchmarking of novel language-generation models in a public and reproducible manner, and facilitates a comparison against the strong, production-level baseline of Google Smart Reply used by millions of people. Moreover, we show that when language models produce responses they could align more closely with human replies in terms of when responses should be offered, their length, sentiment, and semantic meaning. We further demonstrate the utility of this benchmark in a case study of GPT-3, showing significantly better alignment with human responses than Smart Reply, albeit providing no guarantees for quality or safety.
Original languageEnglish
Title of host publicationProceedings of the International AAAI Conference on Web and Social Media
Pages2063-2075
Number of pages13
Volume18
Edition1
DOIs
StatePublished - 28 May 2024

Fingerprint

Dive into the research topics of 'EnronSR: A Benchmark for Evaluating AI-Generated Email Replies'. Together they form a unique fingerprint.

Cite this