GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation

Ying Chai*1, Litao Deng*2, Ruizhi Shao1, Jiajun Zhang3, Liangjun Xing1, Hongwen Zhang2, Yebin Liu1
1Department of Automation, Tsinghua University 2School of Artificial Intelligence, Beijing Normal University 3School of Electronic Engineering, Beijing University of Posts and Telecommunications

Abstract

Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we propose a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing simultaneous modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF supports three key query types: reconstruction of the current scene, prediction of future frames, and estimation of initial action via robot motion. Furthermore, the high-quality current and future frames generated by GAF facilitate manipulation action refinement through a GAF-guided diffusion model. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average success rate in robotic manipulation tasks by 10.33% over state-of-the-art methods.


Demo Overview

Part I: GAF Manipulation Process Display

Part II: Robustness Test Under Disturbance

Part III: Comparison with SOTA