Embodied active visual search in panoramic worlds

EAGLE-360

Embodied Active Global-to-Local Exploration in 360°

Jingtao Xu1 · Zizhuo Lin1 · Jianwen Sun2 · Yi Yang1 · Yawei Luo1,*

1Zhejiang University · 2Central China Normal University

*Corresponding author

Interactive taskDEMO1 / 6

Challenge: Polar Distortion

Search the panorama and finda pink stool

This interactive demo simulates an MLLM embodied 360° search process: overview the panorama, move the viewing direction, and zoom the FOV to localize the target. Move across the panorama to read azimuth and elevation. Hold the mouse button to call the perspective projection tool; keep holding and drag to inspect other directions. Scroll while holding to zoom the FOV.

Hold to inspect 100° FOV
Az 137.7° El -74.8° FOV 100°

Target query: a pink stool

Task motivation

Object search is different in a 360° world.

A panoramic image gives the model the whole scene at once, but it also changes the geometry of visual search. Targets can be squeezed near the poles, split across the left-right seam, or hidden as tiny evidence inside a large field of view. EAGLE-360 treats these effects as part of the task instead of cropping them away.

EAGLE-360 teaser showing panoramic target search challenges and global-to-local exploration.
The task asks a model to locate a queried object directly in spherical coordinates. Our agent starts from the full panorama, then actively narrows the search with perspective projections.
01

Polar distortion

Objects near the top or bottom of an ERP panorama can be heavily warped, making ordinary crop-based recognition brittle.

02

Edge continuity

The same object may sit across the panorama seam, so left and right image boundaries must be understood as adjacent directions.

03

Hard object search

Small targets are often visible only after the model chooses a useful direction and zoom level for local inspection.

Core idea

From panoramic prior to precise local grounding.

EAGLE-360 studies target object search in complete equirectangular panoramas. Rather than starting from a fixed narrow view, the model first reasons over the global 360° scene, then actively calls a projection tool to inspect local perspective views and refine the final azimuth and elevation.

Input Full panorama + target query

The model starts from the complete equirectangular image instead of a fixed perspective crop.

Action Rotate and project

It selects azimuth, elevation, and FOV to inspect local perspective views around promising regions.

Output Angular coordinates

The final answer reports the target direction as azimuth and elevation in the 360° scene.

Method

A global-to-local search loop for 360° scenes.

EAGLE-360 keeps the global panoramic context while giving the model an explicit projection action. Each turn updates the viewing direction and field of view, turning object localization into an active reasoning process over the sphere.

01

Global panoramic reasoning

The model receives the full ERP panorama and forms an initial spherical prior over likely target directions.

02

Active projection

It invokes a rotate-and-project tool to inspect perspective views with controllable azimuth, elevation, and FOV.

03

Fine-grained localization

Through multi-turn refinement, EAGLE-360 outputs the final azimuth and elevation for the target object.

EAGLE-360 training and inference pipeline.
The training recipe combines RoPE Rolling, supervised tool-use trajectories, and GRPO to encourage accurate multi-turn spatial refinement over continuous panoramic topology.

Experiments

Strong search accuracy on the EAGLE-360 benchmark.

We evaluate bFOV accuracy, great-circle distance, thresholded localization, and failure rate across proprietary models, open-source MLLMs, and fine-tuned panoramic agents.

Acc.64.44%

final bFOV accuracy

GCD @ 50°96.12%

within the angular threshold

Fail2.05%

low invalid-answer rate

Acc.

bFOV accuracy: whether the submitted view covers the target object.

GCD

Great-circle distance between predicted and ground-truth spherical coordinates, measured in degrees.

GCD @ 50°

The proportion of predictions whose GCD is no larger than 50°.

Fail

The rate of invalid, missing, or failed final submissions.

Method Acc. (%) ↑ GCD (°) ↓ GCD @ 50° (%) ↑ Fail (%) ↓
Proprietary Models
GPT-4o7.5044.2680.006.39
Gemini-2.5-Pro20.2848.8978.6118.33
Open-source Models
Gemma-3-4b-it1.3995.2325.0010.83
Gemma-3-12b-it2.7888.1130.8312.78
InternVL3.5-4b4.7277.7033.064.17
InternVL3.5-8b5.2882.2230.003.89
Qwen2.5-VL-7B-Instruct2.78101.1022.6020.28
Qwen3-VL-4B-Instruct8.3354.8957.226.11
Qwen3-VL-8B-Instruct8.3354.7269.1713.06
Fine-tuned Models
HVS-3B11.1141.7777.547.22
EAGLE-360 (w/o FOV)39.4417.5494.021.11
EAGLE-360 (w/o RoPE Rolling)46.0116.1594.722.70
EAGLE-360 (w/o GRPO)46.9414.0397.220.83
EAGLE-36064.4416.8996.122.05

Ablation study

Adaptive FOV, RoPE Rolling, and GRPO each matter.

The full EAGLE-360 model reaches 64.44% accuracy. Removing adaptive FOV causes the largest drop, while removing RoPE Rolling or GRPO also substantially weakens directional accuracy. GRPO slightly trades off average distance metrics against a much stronger final answer accuracy, indicating that reinforcement learning mainly improves the active search and submission policy.

Variant Acc. (%) ↑ Δ Acc. GCD (°) ↓ GCD @ 50° (%) ↑ Fail (%) ↓
EAGLE-360 (w/o FOV)39.44-25.0017.5494.021.11
EAGLE-360 (w/o RoPE Rolling)46.01-18.4316.1594.722.70
EAGLE-360 (w/o GRPO)46.94-17.5014.0397.220.83
EAGLE-36064.440.0016.8996.122.05

Response-step analysis

GRPO learns to inspect before answering.

SFT tends to submit early: most answers appear at Step 2, before the model has enough local evidence. GRPO shifts submission toward later turns, improves bFOV accuracy during intermediate search, and produces larger negative distance deltas, meaning each projection step more often moves the prediction closer to the target.

Response-step submission rate and bFOV accuracy for SFT and GRPO.
GRPO delays submission from early turns to the main search window and reaches much higher bFOV accuracy during tool-use refinement.
Average distance delta per response step for SFT and GRPO.
Negative deltas indicate improvement. GRPO yields stronger distance reductions at early steps and continues to refine after SFT has largely saturated.

Qualitative cases

Multi-turn projection makes the search path inspectable.

These examples show how EAGLE-360 progressively centers the target and reduces FOV. Each trajectory records the model's reasoning, the projection tool call, and the final spherical prediction.

Resources

Paper, code, public test split, and checkpoint are released.

The arXiv paper describes the method and experiments. The public repository contains the evaluator and panoramic model patches. The released Hugging Face dataset includes the public test split, and the model repository hosts the merged EAGLE-360 Qwen3-VL GRPO checkpoint.

Citation

Cite EAGLE-360.

@misc{xu2026eagle360embodiedactiveglobaltolocal,
      title={EAGLE-360: Embodied Active Global-to-Local Exploration in 360$^\circ$},
      author={Jingtao Xu and Zizhuo Lin and Jianwen Sun and Yi Yang and Yawei Luo},
      year={2026},
      eprint={2607.02479},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2607.02479},
}