# Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
**Source:** https://arxiv.org/abs/2602.06382
**Fetched:** 2026-02-13
**Type:** Research Paper

---

## Paper Information

- **arXiv ID:** 2602.06382
- **Authors:** Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Finn Yan, Ethan Xie, Zongwu Xie
- **Submission Date:** February 6, 2026

## Abstract

The researchers present "an end-to-end framework for vision-driven humanoid locomotion" addressing two key challenges: perception noise from sim-to-real transfer and conflicting learning objectives across diverse terrains.

## Core Contribution

This paper proposes an end-to-end approach for humanoid locomotion that operates directly from raw depth pixel input, eliminating the need for separate perception and control modules.

## Technical Approach

### Perception Realism

The team developed high-fidelity depth simulation capturing "stereo matching artifacts and calibration uncertainties inherent in real-world sensing."

### Knowledge Transfer: Vision-Aware Behavior Distillation

They propose "vision-aware behavior distillation" combining latent space alignment with noise-invariant auxiliary tasks to transfer knowledge from privileged height maps to noisy depth observations.

### Terrain Versatility

The approach integrates "terrain-specific reward shaping" with multi-critic and multi-discriminator learning to handle distinct dynamics across different terrain types.

## Validation

The policy was tested on humanoid platforms with stereo depth cameras, demonstrating capability across extreme challenges (high platforms, wide gaps) and fine-grained tasks including bidirectional staircase traversal.

## Significance

This work advances vision-based locomotion by directly bridging the sim-to-real gap for depth-based perception, enabling humanoid robots to traverse challenging terrains without hand-crafted perception pipelines.