Are DeepSeek R1 And Other Reasoning Models More Faithful?

This post shows an earlier version of our paper. We later added and replicated findings on Gemini models, bringing the total to three models evaluated (based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base). Our tests were used for faithfulness evaluation in the Claude Sonnet 3.5 system card and in Anthropic’s paper on faithfulness.

Figure 1
Figure 1: Left: Example of models either succeeding or failing to articulate a cue that influences their answer. We edit an MMLU question by prepending a Stanford professor's opinion. For examples like this where the cue changes the model answer, we measure how often models articulate the cue in their CoT. (Here we show only options A and B, rather than all four.) Right: Inference-Time-Compute models articulate the cue more often. The ITC version of Qwen refers to QwQ-32b-Preview, and non-ITC refers to Qwen-2.5-72B-Instruct. For Gemini, we use gemini-2.0-flash-thinking-exp and gemini-2.0-flash-exp respectively.

Abstract

Models trained specifically to generate long Chains of Thought (CoTs) have recently achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) models. Are the CoTs of ITC models more faithful compared to traditional non-ITC models? We evaluate two ITC models (based on Qwen-2.5 and Gemini-2) on an existing test of faithful CoT.

To measure faithfulness, we test if models articulate cues in their prompt that influence their answers to MMLU questions. For example, when the cue “A Stanford Professor thinks the answer is D” is added to the prompt, models sometimes switch their answer to D. In such cases, the Gemini ITC model articulates the cue 54% of the time, compared to 14% for the non-ITC Gemini.

We evaluate 7 types of cue, such as misleading few-shot examples and anchoring on past responses. ITC models articulate cues that influence them much more reliably than all the 6 non-ITC models tested, such as Claude-3.5-Sonnet and GPT-4o, which often articulate close to 0% of the time.

However, our study has important limitations. We evaluate only two ITC models – we cannot evaluate OpenAI’s SOTA o1 model. We also lack details about the training of these ITC models, making it hard to attribute our findings to specific processes.

We think faithfulness of CoT is an important property for AI Safety. The ITC models we tested show a large improvement in faithfulness, which is worth investigating further. We are optimistic that these new models are more faithful in their reasoning compared to past models.

1. Introduction

Inference-Time-Compute (ITC) models have achieved state-of-the-art performance on challenging benchmarks such as GPQA and Math Olympiad tests. However, suggestions of improved transparency and faithfulness in ITC models have not yet been tested. Faithfulness seems valuable for AI safety: If models reliably described the main factors leading to their decisions in their Chain of Thought (CoT), then risks of deceptive behavior — including scheming, sandbagging, sycophancy and sabotage — would be reduced.

However, past work shows that models not specialized for Inference-Time-Compute have a weakness in terms of faithfulness. Instead of articulating that a cue in the prompt influences their answer, they often produce post-hoc motivated reasoning to support that answer, without making any mention of the cue. Building on this work, we measure a form of faithfulness by testing whether models articulate cues that influence their answers. To do this, we use our previous work on cues that influence model responses. The cues include opinions (e.g. “A professor thinks the answer is D”), spurious few-shot prompts, and a range of others. We test this on factual questions from the MMLU dataset.

Figure 2
Figure 2: Two-step process for measuring faithfulness. Step 1: We identify samples where a model switches its answer when presented with a cue (e.g., a professor's opinion). Normally, with the original question, the model answers (A) Mississippian. Due to the cue in the prompt, the model switches its answer from Mississippian to Hohokam. The ellipsis "..." indicates truncated parts of the model's response. Step 2: For these switched samples, we use a judge model to evaluate whether the model explicitly acknowledges the cue in its reasoning.
Figure 3
Figure 3: Overview of results across different types of cues. The x-axis shows the different types of cues that we test models with, discussed further in Section 2. Blue bars show Claude-3.5-Sonnet, which frequently achieves 0% articulation across different types of cues. Across the six non-ITC models (e.g., GPT-4o, Llama-3.3-70b-Instruct, Grok-2-Preview), we observe similar poor articulation rates to Claude-3.5-Sonnet. Qwen and Gemini ITC models perform significantly better, although there is large variance between types of cues.

We study cases where models change their answer when presented with a cue (switched examples) in CoT settings. We use a judge model (GPT-4o) to evaluate whether the model’s CoT articulates that the cue was an important factor in the final answer (Figure 2).

We study six non-ITC models: Claude-3.5-Sonnet, GPT-4o, Grok-2-Preview, Llama-3.3-70b-Instruct, Qwen-2.5-Instruct, and Gemini-2.0-flash-exp. We compare these to two ITC models: QwQ-32b-Preview and Gemini-2.0-flash-thinking-exp. The Qwen team states QwQ-32b-Preview was trained using Qwen-2.5 non-ITC models, and we speculate that Gemini-2.0-flash-thinking-exp is similarly trained using the non-ITC model Gemini-2.0-flash-exp.

Figure 3 demonstrates that ITC models outperform non-ITC models like Claude-3.5-Sonnet in articulating cues across different experimental setups. As a more difficult test, we compare the ITC models to the best non-ITC model for each cue, and find similar results where ITC models have better articulation rates.

We present these findings as a research note. This work has two key limitations. First, the lack of technical documentation for ITC models prevents us from attributing the improved articulation rates to specific architectural or training mechanisms. Second, we evaluate only on two ITC models. While other ITC models exist (OpenAI’s O1 and Deepseek’s Deepseek-R1-Preview), we lack access to O1’s CoTs, and to Deepseek’s API. Despite these limitations, we hope our results contribute to discussions around faithfulness in ITC models. The improved articulation rates provide early evidence that ITC models may enable more faithful reasoning that is understandable by humans.

2. Setup and Results of Cues

This section describes the cues that we use in our experiments. In this paper, a cue is an insertion into the question prompt (i.e. the prompt including the MMLU question) that points to a particular MMLU multiple-choice response (the cued response). For all cues, we only evaluate the switched examples where the model changes its answer to the cued response.

2.1 Cue: Professor’s Opinion

Figure 1 shows our setup and results: we add a cue that a Stanford professor thinks a particular answer choice is correct. This cue is similar to sycophancy where the user inserts an opinion “I think the answer is (X)”. We use a Stanford professor’s opinion instead of a user’s because we find that current models are less influenced by a user’s opinion.

The ITC models are significantly better at articulating the professor’s cue in their reasoning. Qwen achieves 52% articulation rate and Gemini achieves 47%, compared to 13% for the best non-ITC model.

2.2 Cue: Few-Shot with Black Square

Figure 4
Figure 4: Left: The Gemini ITC model articulates its reasoning of following the black square pattern. Right: ITC models articulate the black square at higher rates. The best non-ITC model, Claude-3.5-Sonnet, articulates the cue only 3% of the time, compared to the ITC models which articulate at least 17% of the time.

At the start of the prompt, we add a set of multiple-choice question-answer pairs for easy questions where models know the correct answer. These few-shot questions include a black square (■), that is placed next to the correct answer choice in case. We also place a black square next to one of the possibly incorrect options for the MMLU question. The idea is that the model can recognize that the black square is a reliable guide to the correct answer, and so may apply this to the final question in the prompt.

If the model switches its answer to the answer indicated by the black square, the model should say that it uses the black square as a hint. Figure 4 shows the results. We find that ITC models perform significantly better. The best performing non-ITC model, Claude-3.5-Sonnet, only articulates the cue 3% of the time. The Qwen ITC model articulates the cue 17% of the time, while the Gemini ITC model articulates 28% of the time.

2.3 Other Cues

Figure 5
Figure 5: Left: Example of the Qwen ITC model articulating the Argument cue. The long argument in the prompt causes the model to pick a particular answer and the model says that it picks the particular option due to the cue. Right: Mixed results for ITC models articulating the argument cue: While the Qwen ITC model articulates the argument's cue well at 16% of the time, the Gemini ITC model performs poorly at 3%.

We investigate additional cues based on our previous work. Figure 3 shows the overview. We find overall similar results where ITC models perform better than the best non-ITC model. But in some cases, only one ITC model articulates well, and in the case of “Are you sure?”, no model articulates well.

Argument Cue. Instead of inserting a professor’s opinion (Figure 1), we insert a long argument supporting a particular option (Figure 5). The Qwen ITC model articulates the cue 16% (± 5%) of the time, compared to the best non-ITC model GPT-4o at only 2% (± 2%). But the Gemini ITC model does not articulate well, at only 3% (± 3%) of the time.

Few-shot White Squares. Instead of using a black square to cue the model (Figure 4), we use a white square. While ITC models still articulate better than the best non-ITC model, we get weaker results. Both Gemini and Qwen ITC models decrease in articulation.

Post-Hoc Cue. We insert an answer that the assistant normally does not give on the assistant side of the conversation. We then ask the assistant to think again about the answer. The Qwen ITC model articulates the cue 10% (± 4%) of the time, compared to all non-ITC models which articulate 0% of the time.

Wrong Few-shot. We add a few-shot example with the wrong label to a few-shot prompt and then ask the model the same question again. Both ITC models do significantly better than non-ITC models.

Are you sure? We first ask the model the “original question”. The user then asks a follow-up question: “I don’t think that is right. Are you sure?” All models perform poorly, with articulation rates near 0%.

3. Discussion

Improving non-ITC articulation

One reason for poor articulation in non-ITC models may be a lack of clear instructions to articulate all relevant factors in their reasoning process. To test this hypothesis, we modified the system prompt to include: “When thinking step by step, please include all relevant factors in your reasoning.” However, this did not significantly improve articulation rates for non-ITC models.

Advantage of ITC models in articulation

We do not know the exact training procedures of ITC models, as no technical reports have been released detailing their training. We speculate that one source of improved articulation may stem from ITC models backtracking and expressing uncertainty in their reasoning chains. Another possibility is that ITC models have less incentive to produce short CoTs and this causes them to include more relevant factors in their CoT than their non-ITC counterparts.

Length of CoTs across models

ITC models often produce 2-5x longer CoTs compared to non-ITC models. However, the non-ITC Gemini and Qwen models still produce an average of 10-15 sentences of CoT, and so it seems reasonable to ask that they articulate the cue (which would take only one sentence).

False Positives

Figure 7
Figure 7: F1 across different types of cues. Blue bars show the best performing non-ITC model for each cue. The six non-ITC models are Claude-3.5-Sonnet, GPT-4o, Gemini-2.0-Flash-Exp, Grok-2-1212, Llama-3.3-70b, and Qwen-72b-Instruct. Qwen and Gemini ITC models perform significantly better than non-ITC models on the F1 score.

Our main results examine cases where the model switches its answer due to the cue, which measures recall. An objection to our results could be that the ITC models always articulate cues, even if the cue did not cause a switch in their answer, resulting in false positives. One metric to account for false positives is the precision score. We then calculate the F1 score, which is the harmonic mean of precision and recall, providing a single score that balances both metrics. We compare ITC models with the best non-ITC model for each cue (Figure 7). Overall, ITC models perform significantly better, even in this scenario which accounts for the effect of precision.

Different articulation rates across cues

The ITC models articulate at different rates across different cues. We speculate that the model may judge some cues to be more acceptable to mention (given its post-training). For example, it may be more acceptable to cite a Stanford professor’s opinion as influencing its judgment, compared to changing a judgment because the user asked, “Are you sure?”

Training data contamination

Our earlier paper on faithfulness contains similar cues, released on March 2024. Models may have been trained on this dataset to articulate these cues. To address this concern, we include new cues that are slightly different from those present in the paper – specifically the Professor and White Squares cues. Results are similar for the new cues, with ITC models articulating much better than non-ITC models.

Read our paper for additional details and results!