CGCE

Abstract

Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.

Our Approach

Overview of CGCE. Stage 1: LLM is used to create a dataset of paired prompts, each containing a safe prompt and a semantically similar unsafe version. Stage 2: A lightweight classifier is trained on the embeddings of these prompts to distinguish between safe and unsafe content. Stage 3: At inference time, the trained classifier acts as a plug-and-play safeguard. If an input prompt is safe, its embedding is passed directly to the generative model. If unsafe, the classifier then acts as a refiner, using its own gradients to iteratively modify the embedding. This process steers the embedding away from the harmful concept before it is passed to the T2I or T2V model to ensure a safe final output.

Quantitative Results

Main Results in SD-v1.4

Attack Success Rate (ASR) and generation quality comparison of concept erasure methods for nudity removal task.

ASR comparison of concept erasure methods for Van Gogh art style removal task under the UDA attack.

SD-v1.4 Van Gogh Art Style Removal Results

ASR comparison of concept erasure methods for Church removal task under the UDA attack.

Generalization Across Diverse T2I Architectures

ASR comparsion on T2I architectures.

Extension to T2V Generation

ASR comparsion on T2V architectures.

Efficiency Analysis

Efficiency comparison for the nudity removal task with SD-v1.4 backbone. All experiments were run on a single NVIDIA A6000 GPU, with reported times representing the average over 50 generated images using 50 denoising steps.

Ablation Study

Ablation study on the step size of the refinement process.

Qualitative Results

SD-v1.4

Qualitative evaluation of CGCE and other concept erasure methods on the SD-v1.4 backbone. The figure compares performance on nudity erasure, evaluating each method's ability to remove the target concept while preserving unrelated ones. Sensitive content (*) has been masked for publication.

Qualitative evaluation of CGCE and other concept erasure methods on the SD-v1.4 backbone. The figure compares performance on Van Gogh style erasure (left) and Church object erasure (right), evaluating each method's ability to remove the target concept while preserving unrelated ones.

SD-v1.4 Van Gogh Style and Church Object Removal Qualitative Results

Multi-Concept Erasure

Qualitative evaluation of CGCE's effectiveness in erasing multiple concepts, compared to baseline methods. Sensitive content (*) has been masked for publication.

Modern T2I Architectures

Qualitative evaluation of CGCE's effectiveness in erasing nudity concepts, compared to baseline methods across different T2I architectures. Sensitive content (*) has been masked for publication.

Modern T2V Architectures

Qualitative evaluation of CGCE's effectiveness in erasing nudity concepts, compared to baseline methods across different T2V architectures. Sensitive content (*) has been masked for publication.

BibTeX


       @article{nguyen2025cgce,
         title={CGCE: Classifier-Guided Concept Erasure in Generative Models},
         author={Nguyen, Viet and Patel, Vishal M},
         journal={arXiv preprint arXiv:2511.05865},
         year={2025}
        }

CGCE : Classifier-Guided Concept Erasure in Generative Models

CGCE: An efficient plug-and-play framework for robust and high-fidelity concept erasure. Sensitive content (*) has been masked for publication.