CoV: Chain-of-View Prompting for Spatial Reasoning


Haoyu Zhao1*  Akide Liu2*  Zeyu Zhang2*  Weijie Wang1* 
Feng Chen3  Ruihan Zhu1  Gholamreza Haffari2  Bohan Zhuang1

1ZIP Lab, Zhejiang University   2Monash University   3AIML, Adelaide University

*Equal contribution

Introduction




Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached.

Pipeline




Coarse-grained View Selection. In real-world scenarios, an agent’s visual input typically comes from a continuous video stream, where the raw frames often contain substantial redundancy. Only a small subset of frames is usually relevant, while excessive irrelevant visual information can instead distract the model. We introduce a coarse-grained view selection agent aiming to actively identify and select the key viewpoints most relevant to the current question from the available raw views, thereby providing a strongly question-aligned visual basis for subsequent reasoning steps.

Fine-grained View Adjustment. Inspired by chain-of-thought (CoT) reasoning, we introduce a fine-grained view adjustment mechanism. Starting from the visual anchors obtained in the coarse-grained view selection stage, the VLM plans and executes a sequence of view adjustment actions, including translations, rotations, and switching between different viewpoints. Once the agent determines that sufficient information has been gathered to answer the question, it terminates the view adjustment process and produces the final answer based on the carefully constructed visual context.

Evaluation


Table 1
Table 2

Test-time Scaling


Step Distribution
Scaling

Inspired by the budget forcing strategy used in S1, we set a lower bound on the number of action steps in the CoV agent’s prompt template. Compared to setting the minimum number of action steps to 1, increasing the action-step limit yields an average improvement of 2.51%.


More Visualization


Visualization 1
Visualization 2
Visualization 3
Visualization 4

Citation


@article{zhao2026cov,
  title={CoV: Chain-of-View Prompting for Spatial Reasoning},
  author={Zhao, Haoyu and Liu, Akide and Zhang, Zeyu and Wang, Weijie and Chen, Feng and Zhu, Ruihan and Haffari, Gholamreza and Zhuang, Bohan},
  journal={arXiv preprint arXiv:2601.05172},
  year={2026}
}