AI Pulse
📄 论文解读

AI 记不住刚看到的东西?新测试专测这个

现在的多模态大模型(能看能说那种)有个致命短板:它记不住刚看到的东西。比如你让它玩翻牌配对,牌翻过去它就忘了位置;让它走迷宫,拐个弯就不知道自己在哪。这不是它笨,是现有测试根本没单独测过「记忆+行动」这个能力。研究者搞了个新测试 RNG-Bench,专门把「记住过去」和「根据记忆做决策」拆开测。两个游戏:翻牌配对和 3D 迷宫,难度可以调(棋盘大小、图案复杂度、看的是图还是文字)。最难的版本一局要处理 128K 个 token 和 350 张图片,当前最强模型也搞不定。而且他们发现,模型犯错主要不是决策差,是真的忘了——记忆会随时间衰减。这不是你明天能用上的工具,但如果你关心 AI 什么时候能真的「记住你上一句话」,这就是那个测试尺。

📄 原文摘要(英文)

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部