AI Read the original on Yellow 2 min read 0

OpenAI and Anthropic models face off in coding benchmarks

The artificial intelligence landscape has shifted as OpenAI and Anthropic released new flagship models for software engineering tasks. While OpenAI's GPT-5.6 Sol leads in command-line agent performance, Anthropic's Claude Fable 5 maintains a significant advantage on complex repository-level fixes. These competing benchmarks reveal a split in capabilities where different models excel at specific coding workflows, ranging from autonomous tool coordination to end-to-end GitHub issue resolution.

Крупний план екрана сучасного смартфона з білою іконкою ChatGPT на яскравому рожевому фоні.
Крупний план екрана сучасного смартфона з білою іконкою ChatGPT на яскравому рожевому фоні. · Image source: Yellow

According to Yellow, the latest head-to-head evaluations between OpenAI and Anthropic highlight distinct strengths in their newest frontier models. The comparison centers on GPT-5.6 Sol, the flagship of OpenAI's recent three-tier release, and Claude Fable 5, which recently returned to global availability following a brief regulatory hiatus.

Benchmark performance and terminal capabilities

OpenAI reports that GPT-5.6 Sol achieved an 88.8% score on the Terminal-Bench 2.1, a metric designed to test command-line coding agents that must plan, iterate, and coordinate various tools. When utilizing the compute-heavy Ultra mode—which deploys coordinated subagents to handle complex tasks—the performance figure rises to 91.9%. This represents the highest published mark on the Terminal-Bench chart to date.

In contrast, reviewers noted that Claude Fable 5 trails Sol slightly in terminal-specific tests, with scores ranging between 83.4% and 84.3%. However, Sol's performance is also being scrutinized for its efficiency; it reportedly matches Mythos-class performance on the ExploitBench security suite while using approximately one third of the output tokens. This cost compression is considered a vital factor for long-running autonomous agent workflows.

Software engineering and accessibility gaps

Despite Sol's terminal dominance, Claude Fable 5 remains the leader in SWE-Bench Pro, which measures the ability to provide end-to-end fixes for real GitHub issues. The model scored 80.3% on this benchmark, significantly outperforming the older GPT-5.5 model, which sat at 58.6%. Because OpenAI has not yet published a GPT-5.6 figure for SWE-Bench Pro, analysts suggest that closing such a wide performance gap may require more than an incremental update.

The choice between the two models currently depends on specific use cases and access levels:

  • GPT-5.6 Sol is optimized for terminal-driven agents and offers lower pricing at $5 per million input tokens.
  • Claude Fable 5 is preferred for repository-level fixes and remains globally available as of July 1.
  • Sol is currently restricted to a limited preview for roughly 20 government-cleared partners due to security considerations.

The competitive landscape was further complicated in June when regulatory concerns forced Anthropic's models offline briefly following a reported jailbreak by Amazon researchers. While Mythos 5 was restricted to vetted organizations, Fable 5 has been restored for the general public. These developments suggest that while both companies are pushing the limits of coding automation, the path to full deployment remains tied to rigorous security vetting.

FAQ

What are the costs and availability of GPT-5.6 Sol?
GPT-5.6 Sol is currently restricted to a limited preview for roughly 20 government-cleared partners due to security considerations. It offers a pricing model of $5 per million input tokens.
How does Claude Fable 5 compare to GPT-5.5 on software engineering tasks?
Claude Fable 5 scored 80.3% on the SWE-Bench Pro benchmark for fixing real GitHub issues. This significantly outperformed the older GPT-5.5 model, which received a score of 58.6% on the same metric.
What are the specific strengths of OpenAI's GPT-5.6 Sol?
GPT-5.6 Sol is optimized for terminal-driven agents and command-line coding tasks. It achieved high scores on Terminal-Bench 2.1 and matches Mythos-class performance on the ExploitBench security suite while using approximately one third of the output tokens.
Telegram

Fresh news on our Telegram

Get instant alerts for new posts in «AI»

@proaiandevenmore