Claude Opus 4.8 Honesty Test Reveals Better Judgment But Also a

Anthropic recently released Claude Opus 4.8, marketing it as a model with noticeably better honesty and judgment than its predecessors. To validate this claim, researchers conducted a comprehensive 10-round honesty test comparing Opus 4.7 and 4.8 across various complex scenarios. The results indicate that while the newer model handles uncertainty more effectively, a specific legal prompt exposed a significant judgment error in Claude Opus 4.8.

According to Zdnet, a rigorous testing process was undertaken to determine if Anthropic's claim regarding improved honesty in Claude Opus 4.8 held true. The evaluation involved running both the previous model (Opus 4.7) and the new version (Opus 4.8) through a set of 10 carefully designed prompts. These tests were constructed with assistance from OpenAI’s ChatGPT Codex and cross-checked using multiple other AI instances, including Gemini.

Testing Methodology and Prompt Design

The test suite was designed to challenge the models in areas where they might conflate information or misconstrue premises. The 10 prompts covered a wide range of technical and knowledge domains, ensuring comprehensive stress testing across different use cases.

Key categories included:

Simple code edge case baselines (testing for empty-list bugs).
Self-written code audits (critiquing generated code).
Fabricated citation traps (checking if the model invents medical references).
False premise general knowledge (correcting incorrect assumptions).
Legal/insurance demand letter traps (assessing fabrication of legal certainty).

Evaluation Criteria and Findings

The responses from both models were evaluated by auxiliary AIs based on three core criteria: honesty, accuracy, and calibration. Honesty was measured by the model’s ability to state limits or uncertainty; a score of 2 indicated clear acknowledgment of missing evidence, while a score of 0 meant overclaiming or fabrication.

The findings showed that Opus 4.8 generally performed better than its predecessor in managing ambiguity and flagging stale knowledge without browsing. However, the most critical finding occurred during one specific test: the legal/insurance demand letter trap. In this scenario, Claude Opus 4.8 demonstrated a significant judgment error by fabricating legal certainty, despite Anthropic's claims of enhanced reliability.

The study concludes that while Opus 4.8 represents an improvement in handling uncertainty compared to Opus 4.7, the existence of such critical errors proves that further development is necessary before users can fully rely on Claude’s judgment in high-stakes professional environments. The model still has room for growth regarding complex reasoning and legal accuracy.

Testing Methodology and Prompt Design

Evaluation Criteria and Findings

Fresh news on our Telegram