# Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels **Source:** https://arxiv.org/abs/2602.06382 **Fetched:** 2026-02-13 **Type:** Research Paper --- ## Paper Information - **arXiv ID:** 2602.06382 - **Authors:** Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Finn Yan, Ethan Xie, Zongwu Xie - **Submission Date:** February 6, 2026 ## Abstract The researchers present "an end-to-end framework for vision-driven humanoid locomotion" addressing two key challenges: perception noise from sim-to-real transfer and conflicting learning objectives across diverse terrains. ## Core Contribution This paper proposes an end-to-end approach for humanoid locomotion that operates directly from raw depth pixel input, eliminating the need for separate perception and control modules. ## Technical Approach ### Perception Realism The team developed high-fidelity depth simulation capturing "stereo matching artifacts and calibration uncertainties inherent in real-world sensing." ### Knowledge Transfer: Vision-Aware Behavior Distillation They propose "vision-aware behavior distillation" combining latent space alignment with noise-invariant auxiliary tasks to transfer knowledge from privileged height maps to noisy depth observations. ### Terrain Versatility The approach integrates "terrain-specific reward shaping" with multi-critic and multi-discriminator learning to handle distinct dynamics across different terrain types. ## Validation The policy was tested on humanoid platforms with stereo depth cameras, demonstrating capability across extreme challenges (high platforms, wide gaps) and fine-grained tasks including bidirectional staircase traversal. ## Significance This work advances vision-based locomotion by directly bridging the sim-to-real gap for depth-based perception, enabling humanoid robots to traverse challenging terrains without hand-crafted perception pipelines.