LongFly Framework Enhances UAV Vision-and-Language Navigation

Global: LongFly Framework Improves UAV Vision-and-Language Navigation

Researchers introduced a new spatiotemporal context modeling system called LongFly to enhance long‑horizon navigation for unmanned aerial vehicles (UAVs) performing vision‑and‑language tasks. According to the preprint posted on arXiv (ID 2512.22010) in December 2025, the approach raises the success rate by 7.89 % and improves the success‑weighted path length metric by 6.33 % compared with existing baselines, across both familiar and novel environments.

Key Challenges in UAV Vision‑and‑Language Navigation

UAVs deployed for post‑disaster search and rescue must process dense visual information, adapt to rapidly shifting viewpoints, and navigate dynamic structures over extended distances. Current navigation methods often struggle to maintain accurate semantic alignment and stable path planning when faced with such long‑horizon spatiotemporal complexities.

Overview of the LongFly Architecture

LongFly adopts a history‑aware modeling strategy that converts fragmented, redundant observations into compact, expressive representations. The framework consists of three primary modules: a slot‑based historical image compression component, a spatiotemporal trajectory encoding unit, and a prompt‑guided multimodal integration layer.

Slot‑Based Historical Image Compression

The first module dynamically distills multi‑view historical images into fixed‑length contextual vectors. By organizing past visual data into slots, the system reduces redundancy while preserving salient features needed for downstream reasoning.

Spatiotemporal Trajectory Encoding

The second module captures both temporal dynamics and spatial structure of UAV flight paths. It encodes trajectory information to reflect how the vehicle’s position evolves over time, enabling more accurate anticipation of future waypoints.

Prompt‑Guided Multimodal Integration

The final component merges the encoded spatiotemporal context with current sensor inputs using a prompt‑based mechanism. This design supports time‑aware reasoning and robust waypoint prediction, helping the UAV maintain alignment with language instructions.

Experimental Validation

Evaluation on benchmark UAV VLN environments demonstrated that LongFly consistently outperforms state‑of‑the‑art baselines. The reported improvements of 7.89 % in raw success rate and 6.33 % in success weighted by path length were observed in both seen and unseen test settings, indicating strong generalization capabilities.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

LongFly Framework Boosts UAV Vision-and-Language Navigation Success Rates