CIRCUMVENTING CONCEPT ERASURE METHODS FOR TEXT-TO-IMAGE GENERATIVE MODELS

Minh Pham, Kelly O. Marshall, Niv Cohen, Govind Mittal, Chinmay Hegde

Research output: Contribution to conferencePaperpeer-review

Abstract

Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. Yet, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to “erase” sensitive concepts from text-to-image models. In this work, we examine seven recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we devise an algorithm to learn special input word embeddings that can retrieve “erased” concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.

Original languageEnglish
StatePublished - 1 Jan 2024
Externally publishedYes
Event12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria
Duration: 7 May 202411 May 2024

Conference

Conference12th International Conference on Learning Representations, ICLR 2024
Country/TerritoryAustria
CityHybrid, Vienna
Period7/05/2411/05/24

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'CIRCUMVENTING CONCEPT ERASURE METHODS FOR TEXT-TO-IMAGE GENERATIVE MODELS'. Together they form a unique fingerprint.

Cite this