AI Pulse
📄 论文解读

AI学会自己整理数据了,不再靠人工规则

现在的AI处理视频、图片等原始数据时,要么靠人工写规则,要么用通用模型,又贵又死板,还抓不住数据里的深层逻辑。这篇提出让AI自己学会“整理数据”——像有个智能助手,能根据你的需求(比如做视频、回答问题)主动把杂乱的数据结构化、提炼精华。他们训练了一个9B参数的模型DataClaw_0,用强化学习对齐复杂指令,在视频生成、视觉问答等任务上验证了效果:用更少的数据就能让下游模型学得更好。它不是你明天就能直接用的工具,但指向一个趋势:未来AI可能不再依赖人工标注,而是自己学会“读懂”原始数据。

📄 原文摘要(英文)

Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, DataClaw_0-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct DataClaw_0-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部