· 供应链分析

新基准显示中国AI模型在深度推理上大幅落后西方,数据与架构成瓶颈。

中文翻译

一项新基准显示,中国AI模型(Kimi、Minimax、DeepSeek)落后于西方前沿AI模型的程度远超市场预期。 Opus、Gemini和GPT的大语言模型(LLM)被证明处于领先地位。 名为SWE-rebench的新基准使用了新的GitHub任务: -> Minimax声称在原始SWE-bench上得分为80.2%。 -> 在未污染的SWE-rebench上,其得分暴跌至39.6%。 结论: 中国实验室已有效以极低成本解决了单提示推理和离散编码任务。 然而,长期行为所需的架构和高质量数据仍是严重瓶颈,蒸馏和优化基准无法伪造。 研究表明,中国模型在超大规模云服务商拥有的深度、适应性推理方面存在滞后。

英文原文

A new benchmark shows Chinese AI models (Kimi, Minimax, DeepSeek) are much further behind Western frontier AI models than markets expect. LLMs from Opus, Gemini, GPT and are shown to be leading. A new benchmark called SWE-rebench uses new GitHub tasks: -> Minimax claimed 80.2% on the original SWE-bench. -> On the uncontaminated SWE-rebench, it crashed to 39.6%. The takeaway: Chinese labs have effectively solved single-prompt reasoning and discrete coding tasks at a fraction of the cost. However, the architecture and high-quality data required for long-horizon behavior remain a severe bottleneck that distillation and optimizing for benchmarks cannot fake. Chinese models are shown to be lagging to the deep, adaptable reasoning that US hyperscalers have.

在 X 上查看原推 ↗