TokenSplat

Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction

Yihui Li^1,2, Chengxin Lv^1,2, Zichen Tang³, Hongyu Yang^3*, Di Huang^1,2,4

¹State Key Laboratory of Complex and Critical Software Environment, Beihang University, China
²School of Computer Science and Engineering, Beihang University, China
³School of Artificial Intelligence, Beihang University, China
⁴Zhejiang Industrial Big Data and Robot Intelligent System Key Laboratory, Hangzhou Innovation Institute, Beihang University, China

CVPR 2026 (Highlight)

^*Indicates Corresponding Author

Abstract

We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.

Method

Interpolate start reference image.

Overview of TokenSplat. TokenSplat performs feed-forward 3D Gaussian reconstruction and camera pose estimation from unposed images. A shared ViT encoder extracts image tokens, which are processed by the Canonical Scene Decoder and the Asymmetric Dual-Flow Decoder (ADF-Decoder). The fused tokens are then used by the Token-aligned Gaussian Prediction module and the camera pose head to generate dense 3D Gaussians and accurate poses.

Comparison

Video Comparison

Our method produces rendered results with clearer structures.

Ours

SPFSplat

Ours

SPFSplat

Ours

AnySplat

Ours

AnySplat

Ours

VicaSplat

Ours

VicaSplat

Ours

FreeSplat

Ours

FreeSplat

Scene Comparison

Our approach is able to generate more structured and complete reconstructions of the entire scene when viewed holistically. In contrast, previous methods often result in numerous visual artifacts or floating objects.

Ours

FreeSplat

Ours

SPFSplat

Ours

AnySplat

Results

Re10K (8 Views)

Re10K (16 Views)

Scannet (16 Views)

Scannet (32 Views)

Zero-Shot Inference Results

ACID

LLFF/Replica/BungeeNeRF

Mill 19/UrbanScene 3D

Wild Data