Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction

CVPR 2026

Abstract

We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.

Method

Interpolate start reference image.

Overview of TokenSplat. TokenSplat performs feed-forward 3D Gaussian reconstruction and camera pose estimation from unposed images. A shared ViT encoder extracts image tokens, which are processed by the Canonical Scene Decoder and the Asymmetric Dual-Flow Decoder (ADF-Decoder). The fused tokens are then used by the Token-aligned Gaussian Prediction module and the camera pose head to generate dense 3D Gaussians and accurate poses.

Comparison

Our method produces rendered results with clearer structures.

Ours

SPFSplat

Ours

SPFSplat

Ours

AnySplat

Ours

AnySplat

Ours

VicaSplat

Ours

VicaSplat

Ours

FreeSplat

Ours

FreeSplat

Our approach is able to generate more structured and complete reconstructions of the entire scene when viewed holistically. In contrast, previous methods often result in numerous visual artifacts or floating objects.

Ours

FreeSplat

Ours

SPFSplat

Ours

AnySplat

Results

Zero-Shot Inference Results