📄 论文解读

AI 记不住刚看到的东西，新测试让它现原形

信赖通道 ▲ 38 多模态大模型记忆能力基准测试遗忘决策

现在的多模态大模型能聊天、能看图，但一遇到需要记住之前看到什么才能做决策的任务，就露馅了。研究者设计了两款游戏来专门测这个：翻牌配对（记住牌的位置）和 3D 迷宫（把走过的路拼成地图）。最难的关卡每局要处理 350 张图片、128K 上下文，当前最强模型也远没通关。更关键的是，他们拆出了「遗忘」和「决策差」两个原因——结果发现模型犯错主要是忘了，而不是不会选。这提醒我们：别指望 AI 在需要持续记忆的场景里靠谱，比如让它看监控找嫌疑人、或者玩需要记牌的策略游戏。

📄 原文摘要(英文)

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

arXiv 原文

订阅 AI Pulse