CoV: Chain-of-View Prompting for Spatial Reasoning

Haoyu Zhao^1* Akide Liu^2* Zeyu Zhang^2* Weijie Wang^1*
Feng Chen³ Ruihan Zhu¹ Gholamreza Haffari² Bohan Zhuang¹

¹ZIP Lab, Zhejiang University ²Monash University ³AIML, Adelaide University

^*Equal contribution

Introduction

Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached.

Pipeline

Coarse-grained View Selection. In real-world scenarios, an agent’s visual input typically comes from a continuous video stream, where the raw frames often contain substantial redundancy. Only a small subset of frames is usually relevant, while excessive irrelevant visual information can instead distract the model. We introduce a coarse-grained view selection agent aiming to actively identify and select the key viewpoints most relevant to the current question from the available raw views, thereby providing a strongly question-aligned visual basis for subsequent reasoning steps.

Fine-grained View Adjustment. Inspired by chain-of-thought (CoT) reasoning, we introduce a fine-grained view adjustment mechanism. Starting from the visual anchors obtained in the coarse-grained view selection stage, the VLM plans and executes a sequence of view adjustment actions, including translations, rotations, and switching between different viewpoints. Once the agent determines that sufficient information has been gathered to answer the question, it terminates the view adjustment process and produces the final answer based on the carefully constructed visual context.

Test-time Scaling

Inspired by the budget forcing strategy used in S1, we set a lower bound on the number of action steps in the CoV agent’s prompt template. Compared to setting the minimum number of action steps to 1, increasing the action-step limit yields an average improvement of 2.51%.

CoV: Chain-of-View Prompting for Spatial Reasoning

Introduction

Pipeline

Evaluation

Test-time Scaling

More Visualization

Citation