We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.
Overview of TokenSplat. TokenSplat performs feed-forward 3D Gaussian reconstruction and camera pose estimation from unposed images. A shared ViT encoder extracts image tokens, which are processed by the Canonical Scene Decoder and the Asymmetric Dual-Flow Decoder (ADF-Decoder). The fused tokens are then used by the Token-aligned Gaussian Prediction module and the camera pose head to generate dense 3D Gaussians and accurate poses.
Our method produces rendered results with clearer structures.
Ours
SPFSplat
Ours
SPFSplat
Ours
AnySplat
Ours
AnySplat
Ours
VicaSplat
Ours
VicaSplat
Ours
FreeSplat
Ours
FreeSplat
Our approach is able to generate more structured and complete reconstructions of the entire scene when viewed holistically. In contrast, previous methods often result in numerous visual artifacts or floating objects.
Ours
FreeSplat
Ours
SPFSplat
Ours
AnySplat