深度对比 · 2026-05-10 · by @zayuerweb-dev
GPT-5 vs Claude Sonnet 4.6: Which to Pick for Coding
For serious coding there are really two AI choices: OpenAI's GPT-5 (including Pro) and Anthropic's Claude Sonnet 4.6. I wired both into Cursor and ran them for a week, cross-checked against SWE-bench numbers and official prices. This piece tells you which to use for which job. None of the "they're both great" hand-waving.
30-second verdict
- Daily agent coding (Cursor / Cline / Aider): Claude Sonnet 4.6. Steady tool calling, doesn't go off-script, first on SWE-bench.
- System design, hard algorithms, fuzzy requirements: GPT-5 / GPT-5 Pro. Leads on asking the right questions and reasoning depth.
- Tight budget or high volume: GPT-5 (30-40% cheaper), or go further with Claude Haiku 4.5 / DeepSeek R1.
- Loading a whole repo to ask about it: neither is optimal, pick Gemini 2.5 Pro (2 million context).
- If you can't decide: Claude as the default, switch to GPT-5 Pro for hard problems.
Core specs compared
| Dimension | Claude Sonnet 4.6 | GPT-5 |
|---|---|---|
| Input price (per 1M tokens) | $3.00 | $2.50 |
| Output price | $15.00 | $10.00 |
| Context window | 200K (1M beta) | 400K |
| SWE-bench Verified | ~70% | ~65% |
| HumanEval | ~94% | ~96% |
| LiveCodeBench | ~72% | ~78% |
| Tool-calling reliability | Strongest | Very good |
| Cursor default | Yes | Alternative |
Sources: Anthropic / OpenAI official pricing pages, SWE-bench Verified, LiveCodeBench, and public Vellum evaluations, current as of May 2026.
Real scenario 1: fixing a cross-file bug in Cursor
Same task: "the error message doesn't show when login fails, please find and fix it." It spans three files: a front-end component, an API route, and error-handling middleware.
How Claude Sonnet 4.6 did: read the 3 relevant files, traced it to the middleware swallowing the error, produced a patch, passed type checks, and changed only the necessary lines. Done in one pass.
How GPT-5 did: read the same files, found the same root cause, but also "optimized" two unrelated early-return styles in the middleware. The code was correct, but the diff was 3x larger than Claude's. You have to manually strip out the unrelated changes.
Verdict: Claude is more restrained in agent mode. That's why Cursor, Cline, and Aider set it as the default. The bigger the codebase and the more people reviewing PRs, the more this matters.
Real scenario 2: algorithm / system design
"Design a URL shortener that handles 1 million QPS, covering consistency, capacity estimates, and a degradation plan."
GPT-5 Pro: asked clarifying questions first: "Read-heavy or write-heavy? What's the budget? Do you need custom-suffix analytics?" Then gave three designs and noted the trade-offs of each.
Claude Sonnet 4.6: went straight to a complete, good-quality design, but with a weaker instinct to ask questions.
Verdict: for open-ended questions, system design, and interview problems, GPT-5 Pro is clearly steadier. The real reasoning edge is "knowing to ask," not how prettily it writes up the answer.
Real scenario 3: cost-effectiveness
The same 100-file monorepo run through one AI code review. Estimated totals: 800K tokens in, 200K out.
- Claude Sonnet 4.6: $3 × 0.8 + $15 × 0.2 = $5.40 per run.
- GPT-5: $2.50 × 0.8 + $10 × 0.2 = $4.00 per run.
- Claude with prompt caching: ~$1.50.
- GPT-5 with batch: ~$2.00.
GPT-5's sticker price is cheaper, but Claude's prompt-caching discount is more aggressive. For a product that reuses the same system prompt repeatedly, Claude's real cost can come out below GPT-5. Test it on your own data.
When GPT-5 is the better fit
- Agents that need to ask questions. For example, having the model read a spec before writing code, GPT-5 Pro's question quality is clearly higher.
- Competitive programming, LeetCode. GPT-5 edges ahead on HumanEval and LiveCodeBench.
- Image understanding alongside. GPT-5's multimodal ability is slightly stronger.
- Already in the OpenAI ecosystem. If you use the Assistants API, file search, or Code Interpreter, migration costs are high.
- Budget-sensitive. Token prices are 30-40% cheaper.
When Claude is the must-pick
- Using Cursor / Windsurf / Cline / Aider. Every major agent tool is tuned most deeply for Claude.
- Multi-file refactors. No stray edits, keeps the code style consistent.
- Structured output. Tables, comparisons, and normalized documents are steadier with Claude.
- Long agent loops (10+ tool calls). Claude's drift rate is clearly lower.
- Teams picky about code style. Claude follows existing conventions by default; GPT-5 occasionally takes liberties.
How to use both
The most common engineer setup in 2026:
- Cursor defaulting to Claude Sonnet 4.6 for daily edits, completions, and refactors.
- Switch to GPT-5 Pro for hard problems (built into Cursor), letting it clarify first, then propose a design.
- Switch batch, low-value tasks (lint, doc generation, commit messages) to DeepSeek R1 or Claude Haiku 4.5 to save money.
- Switch whole-repo Q&A to Gemini 2.5 Pro, fitting the whole project into 2M context.
Don't bet on a single model. The frontier swaps leaders every 3-6 months. Switch when you can, and don't dig yourself a hole.
Call both with one API key
If you're building your own tool or agent, OpenRouter gives you one OpenAI-compatible endpoint that routes to GPT-5, Claude, DeepSeek, and Gemini, which makes switching and A/B testing easy. Note: OpenRouter has no public referral program; the link below is a plain recommendation.
The one-line summary
Claude Sonnet 4.6 as the default, switch to GPT-5 Pro for hard problems, switch batch tasks to DeepSeek R1. Use Gemini for whole-repo Q&A. Don't try to bet on one model. Switching is the best practice for 2026.