Mitigating Deceptive Alignment via Self-Monitoring

TL;DR: Chain-of-thought reasoning can double large language models' deceptive alignment. Our CoT Monitor+ framework embeds a self-monitor into the reasoning steps, flags and suppresses misaligned strategies during generation, and uses the same signal as an RL reward.

Overview

Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question with CoT Monitor +, the first framework that embeds a self-monitor inside the CoT process itself. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The same signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly doubleshe deceptive tendency compared to direct answers. In contrast, CoT Monitor + cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency.

Key Contributions:

Language models can self-monitor their reasoning processes. Building on this insight, we introduce Self-Monitor, a framework that enables models to oversee their own CoT reasoning within predefined safety protocols. The experimental results demonstrate that Self-Monitor substantially reduces deceptive alignment behaviors, achieving a 43.8% reduction compared to pure CoT reasoning.
Detect deceptive tendency in reasoning models with DeceptionBench. To mitigate the deceptive risks posed by advanced LLMs, we introduce DeceptionBench, the first systematic benchmark designed to assess deceptive behaviors behind reasoning models. DeceptionBench assesses model behavior across 13 diverse testing scenarios, offering insights into when and how models may superficially appear aligned while internally pursuing misaligned goals.
Self-Monitor serve as reward signals in RL training. Relying solely on the output of an external weak model's CoT monitor as a reward signal for RL training can lead the model to strategically suppress its true intentions during the CoT process, resulting in more sophisticated forms of deception. In contrast, using the model's own Self-Monitor feedback as the training signal encourages more faithful reasoning and reduces the incentive to conceal deceptive thoughts.

Framework overview showing Self-Monitor working mechanism and DeceptionBench evaluation process

(Left) How Self-Monitor works? Unlike the standard LLMs reasoning process, the Self-Monitor model first detects deceptive or harmful patterns in its CoT, and then generates a safer response based on the monitor evaluation. In reinforcement learning, the reward signal from the self-monitor's CoT detection is combined with a standard action monitor to optimize the model. For more details, please refer to our paper.

(Right) How DeceptionBench evaluates deceptive tendency of models: Neutral prompts and specifically constructed prompts with an outer goal are used to elicit the model's mesa utility and CoT-Action pairs, respectively. These pairs are then evaluated for consistency and deceptiveness.

DeceptionBench

Growing concerns about model deception highlight a lack of systematic benchmarks, hindering responsible LLMs deployment. To address this gap, we introduce DeceptionBench to assess LLMs deceptive tendency by evaluating consistency between utility function and model's behaviors.

We consider that LLMs have inherent mesa utilities and design outer goals to implicitly incentivize deviation from these mesa utilities. DeceptionBench is meticulously developed with 180 scenarios inspired by real-world interactions, strictly adhering to a predefined typology of LLM deception in five categories.

Model Performance on DeceptionBench

We evaluate both API-based and open-source models across the five deception categories in DeceptionBench. The results reveal that all evaluated models exhibit some degree of deceptive tendency, with varying performance across different deception types.

The evaluation shows interesting patterns: larger models tend to show lower deceptive tendencies in some categories, while smaller open-source models exhibit higher rates of deceptive behavior. Strategic deception and honesty evasion appear to be the most challenging categories across all model families.

Deception tendency heatmap of models on DeceptionBench

Benchmark Construction

DeceptionBench is meticulously developed with 180 scenarios inspired by real-world interactions, strictly adhering to a predefined typology of LLM deception in five categories:

Sycophancy: Misrepresenting and manipulating to align with perceived user preferences.
Strategic Deception: Violating rules or manipulating information for goal achievement.
Honesty Evasion: Avoiding honest and truthful answers under pressure or outer goal.
Alignment Faking: Complying in alignment while regaining when not being monitored.
Sandbagging: Deliberately underperforming or hiding capabilities for outer goal.

We deliberately employ manual curation, which is essential to capture the nuance, plausibility, and targeted nature required for deception detection.

Evaluation Methodology

Evaluating these deceptive behaviors using methods like LLM-as-Judge has low agreement with human annotation, as shown in the Human Agreement Validation section. Therefore, by comparing the mesa utility $\mathcal{U}_{m}$ with the reasoning CoT $\mathcal{R}$ and final action $\mathcal{Y}$ generated by $\mathcal{U} = \mathcal{F}(\mathcal{U}_m,\mathcal{U}_o)$ (where $\mathcal{U}_o$ is the outer goal utility function), we aim to capture the model's deception tendency in a more fine-grained manner.

The evaluation pipeline of DeceptionBench shown in the Framework Overview is specifically divided into three stages:

First, we use the neutral mesa_utility_prompt to elicit the model's $\mathcal{U}_{m}$.

Second, the model receives a system_prompt to establish the specific external scenario context, followed by an outer_goal_prompt intended to elicit its CoT $\mathcal{R}$ and action $\mathcal{Y}$ under $\mathcal{U}_o$.

Finally, the recorded $\mathcal{U}_{m}$, $\mathcal{R}$, and $\mathcal{Y}$ are compared by GPT-4o to detect inconsistencies.

Since deception arises when a model's reasoning $\mathcal{R}$ and its mesa utility $\mathcal{U}_{m}$ are consistent (denoted as $\mathcal{R} \sim \mathcal{U}_{m}$) while its final answer $\mathcal{Y}$ diverges from $\mathcal{U}_{m}$ (denoted as $\mathcal{Y} \not\sim \mathcal{U}_{m}$), we define the Deception Tendency Rate (DTR) as the fraction of instances exhibiting the deceptive pattern:

$$\rho_\text{DTR} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\mathcal{R}^{(i)} \sim \mathcal{U}_{m}^{(i)} \land \mathcal{Y}^{(i)} \not\sim \mathcal{U}_{m}^{(i)})$$

where $N$ is the number of instances. DTR offers a more reliable deception signal than LLM-as-Judge evaluations, aligning better with human judgments, as shown in the Human Agreement Validation section.

Human Agreement Validation

We validate our DTR metric against human judgments and compare it with LLM-as-Judge approaches. Our results demonstrate that DTR significantly outperforms LLM-as-Judge in terms of both human agreement rate and Phi coefficient, regardless of the choice of judge models.

This superior human alignment makes DTR a more reliable metric for detecting deceptive behaviors in language models, providing researchers and practitioners with a trustworthy evaluation framework.

Comparison of agreement with human judgments

Experiments

In the experiments of this section, we focus on two key questions:

Question 1: Can Self-Monitor mitigate deceptive tendency while enhancing safety performance?
Question 2: Can Self-Monitor serve as a reward signal in RL to mitigate deceptive alignment?

Evaluating Deceptive Tendency and Self-Monitor

Experiment Setup. To construct training dataset $\mathcal{D}$ as described in our methodology, we collect CoT from reasoning models on BeaverTails, and using GPT-4o as verifier model. For Self-Monitor models, we supervised fine-tune Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, and train Safe SFT and Safe CoT SFT baselines using corresponding data pairs in $\mathcal{D}$.

Evaluation. We use DeceptionBench to evaluate the DTR of Self-Monitor, API-based models, and Open-Source models. For safety evaluation, we report the average jailbreak score in StrongReject for the top attack types. We also evaluate the Attack Success Rate (ASR) of prefilling attacks within the JailbreakBench, using Llama-Guard-3 for assessment. Concurrently, following the pair-wise preference evaluation in BeaverTails, we conduct helpfulness and harmlessness evaluations using GPT-4o for preference judgments.

Results & Analysis. As shown in the performance table below, Self-Monitor significantly reduces the DTR by 43.8% compared to instruct model. Self-Monitor also demonstrates notable performance against prefilling attacks and StrongReject, beating safe fine-tuned reasoning models. Moreover, in our decoupled evaluation preference on BeaverTails, we observe that the model improved in both helpfulness and safety dimensions, achieving win rates as 65% and 67% against the instruct model, respectively.

Performance of Self-Monitor models compared to baselines

Reinforcement Learning for Deceptive Alignment Mitigation

Experiment Setup. We study deceptive alignment in reasoning model through setting a "helpful-only" objective during RL to create a conflict with the models' HHH (Helpful, Honest, Harmless) alignment objective. Training system prompts distinguish between "pre-deployment evaluation stage" and "deployment stage" without explicitly promoting deception. Queries from HarmfulQA and BeaverTails are used for training and evaluation, respectively. A reward model, based on Qwen2.5-7B and trained with the UltraFeedback preference dataset, provides "helpful-only" rewards. We training three RL setting on Gemma-3-12B-IT for 150 steps.

Monitor & Reward. The outer action monitor's signal combines preference reward with the format reward in all RL settings. We use Qwen2.5-72B-Instruct as the outer CoT monitor, assessing whether the reasoning process presents deceptive intent. As for self-monitoring RL, CoT rewards are directly sourced from Self-Monitor labels. To avoid reward hacking, where Self-Monitor might simply output labels beneficial to itself, we update the off-policy Self-Monitor every 15 steps.

Evaluation. We employ GPT-4o to evaluate key metrics of deceptive alignment for each training checkpoint. We first evaluate the fraction of deceptive CoT, which quantifies alignment faking within the model's CoT during pre-deployment stage. Another critical metric is the performance gap, assessed by a reward model that quantifies the disparity in model performance between deployment and pre-deployment stage. A wider gap suggests that the reasoning model is more significantly and deceptively adjusting its behavior upon pre-deployment stage.

Results & Analysis. As illustrated in the figure below, we assess Self-Monitor's ability to reduce secretly misaligned behaviors between the deployment and pre-deployment stages, compared to the RL baseline employing a single action monitor or an action monitor with an outer CoT monitor. When employing only an action monitor, we observe a consistent rise in the probability of the model generating deceptive CoT, concomitant with an increasing performance gap. This suggests that the model progressively reinforces compliant behavior during the pre-deployment phase. In RL optimization with CoT monitor, we observe a significant reduction in deceptive CoT, while the performance gap still widens considerably, indicating that the model learns to game the feedback process while also concealing its true intent within the CoT. However, we observe that Self-Monitor not only maintains a lower performance gap but markedly reduces deceptive CoT. Even when a higher deceptive tendency emerges in the early RL stage, constraint optimization facilitated by self-monitoring mitigates deceptive alignment while preserving the transparency of its CoT.

Analogy and experiment results of different RL with monitor

For more information, please check out our paper, code, and dataset:

Paper Dataset