--- id: learning-and-ai title: "Learning & AI" status: established source_sections: "reference/sources/github-unitree-rl-gym.md, reference/sources/github-unitree-rl-lab.md, reference/sources/github-xr-teleoperate.md, reference/sources/paper-bfm-zero.md, reference/sources/paper-gait-conditioned-rl.md" related_topics: [simulation, locomotion-control, manipulation, sdk-programming, whole-body-control, motion-retargeting, push-recovery-balance] key_equations: [] key_terms: [gait_conditioned_rl, curriculum_learning, sim_to_real, lerobot, xr_teleoperate, teleoperation] images: [] examples: [] open_questions: - "Optimal reward function design for G1 locomotion" - "Training time estimates for different policy types" - "How to fine-tune the stock locomotion policy" - "LLM-based task planning integration status (firmware v3.2+)" --- # Learning & AI Reinforcement learning, imitation learning, and AI-based control for the G1. ## 1. Reinforcement Learning ### Official RL Frameworks | Framework | Repository | Base Library | Sim Engine | G1 Support | Tier | |-------------------|------------------------------------------|--------------------|--------------|-----------:|------| | unitree_rl_gym | unitreerobotics/unitree_rl_gym | legged_gym + rsl_rl | Isaac Gym | Yes | T0 | | unitree_rl_lab | unitreerobotics/unitree_rl_lab | Isaac Lab | Isaac Lab | G1-29dof | T0 | ### unitree_rl_gym — Complete RL Pipeline The primary framework for training locomotion policies: [T0] - **Supported robots:** Go2, H1, H1_2, G1 - **Algorithm:** PPO (via rsl_rl) - **Training:** Parallel environments, GPU/CPU device selection, checkpoint management - **Pipeline:** Train → Play → Sim2Sim (MuJoCo validation) → Sim2Real (unitree_sdk2_python) - **Deployment:** Python scripts and C++ binaries with network interface configuration ### unitree_rl_lab — Isaac Lab Integration Advanced RL training on NVIDIA Isaac Lab: [T0] - **Supported robots:** Go2, H1, G1-29dof - **Simulation backends:** Isaac Lab (NVIDIA) and MuJoCo (cross-sim validation) - **Deployment:** Simulation → Sim-to-sim → Real robot via unitree_sdk2 - **Language mix:** Python 65.1%, C++ 31.3% ### Key RL Research on G1 | Paper | Contribution | Validated on G1? | Tier | |-------|-------------|-----------------|------| | Gait-Conditioned RL (arXiv:2505.20619) | Multi-phase curriculum, gait-specific reward routing | Yes | T1 | | Getting-Up Policies (arXiv:2502.12152) | Two-stage fall recovery via RL | Yes | T1 | | HoST (arXiv:2502.08378) | Multi-critic RL for diverse posture recovery | Yes | T1 | | Fall-Safety (arXiv:2511.07407) | Unified prevention + mitigation + recovery | Yes (zero-shot) | T1 | | Vision Locomotion (arXiv:2602.06382) | End-to-end depth-based locomotion | Yes | T1 | | Safe Control (arXiv:2502.02858) | Projected Safe Set for collision avoidance | Yes | T1 | ## 2. Imitation Learning ### Data Collection — Teleoperation | System | Device | Repository | Features | |---------------------|---------------------------|---------------------------------------|------------------------------| | XR Teleoperate | Vision Pro, PICO 4, Quest 3 | unitreerobotics/xr_teleoperate | Hand tracking, data recording | | Kinect Teleoperate | Azure Kinect DK | unitreerobotics/kinect_teleoperate | Body tracking, safety wake-up | ### Training Frameworks | Framework | Repository | Purpose | |---------------------|------------------------------------------|--------------------------------------| | unitree_IL_lerobot | unitreerobotics/unitree_IL_lerobot | Modified LeRobot for G1 dual-arm training | | HuggingFace LeRobot | huggingface.co/docs/lerobot/en/unitree_g1 | Standard LeRobot with G1 config | **LeRobot G1 integration:** Supports both 29-DOF and 23-DOF versions, includes gr00t_wbc locomotion integration for whole-body control during manipulation tasks. [T1] ### Imitation Learning Workflow ``` 1. Teleoperate (XR/Kinect) → record episodes 2. Process data → extract observation-action pairs 3. Train policy (LeRobot / custom) → behavior cloning or diffusion policy 4. Deploy → unitree_sdk2 on real robot ``` ## 3. Policy Deployment ### Deployment Options | Method | Language | Latency | Use Case | |---------------------|----------|------------|-------------------------------| | unitree_sdk2_python | Python | Higher | Prototyping, research | | unitree_sdk2 (C++) | C++ | Lower | Production, real-time control | ### Deployment Checklist 1. **Validate in simulation** — Run policy in unitree_mujoco or Isaac Lab 2. **Cross-sim validate** — Test in a second simulator (Sim2Sim) 3. **Low-gain start** — Deploy with reduced gains initially 4. **Tethered testing** — Support robot with a safety harness for first real-world tests 5. **Gradual ramp-up** — Increase to full gains after verifying stability ### Safety Wrappers When deploying custom policies, add safety layers: [T2 — Best practice] - Joint limit clamping (see [[equations-and-bounds]]) - Torque saturation limits - Fall detection with emergency stop - Velocity bounds for safe walking speeds ## 4. Foundation Models ### BFM-Zero (arXiv:2511.04131) First behavioral foundation model for real humanoids: [T1] - **Key innovation:** Promptable control without retraining (reward optimization, pose reaching, motion tracking) - **Training:** Motion capture data regularization + online off-policy unsupervised RL - **Validation:** Deployed on G1 hardware - **Significance:** Enables flexible task specification without policy retraining ### Behavior Foundation Model (arXiv:2509.13780) - Uses masked online distillation with Conditional Variational Autoencoder (CVAE) - Models behavioral distributions from large-scale datasets - Tested on G1 (1.3m, 29-DOF) [T1] ### LLM Integration (Firmware v3.2+) - Preliminary LLM integration support on EDU models [T2] - Natural language task commands via Jetson Orin [T2] - Status and capabilities not yet fully documented — see open questions ## 5. Motion Tracking Policies RL policies trained to imitate reference motions (from mocap) while maintaining balance: [T1 — Research papers] | Framework | Paper | Approach | G1 Validated? | |---|---|---|---| | BFM-Zero | arXiv:2511.04131 | Foundation model with motion tracking mode | Yes | | H2O | arXiv:2403.01623 | Real-time human-to-humanoid tracking | Humanoid (not G1 specifically) | | OmniH2O | arXiv:2406.08858 | Multi-modal input tracking | Humanoid | | HumanPlus | arXiv:2406.10454 | RGB camera shadow → imitation | Humanoid | **BFM-Zero** is the most directly G1-relevant: it provides a "motion tracking" mode where the policy receives a reference pose and tracks it while maintaining balance. Zero-shot generalization to unseen motions. Open-source. See [[motion-retargeting]] for the full retargeting pipeline. **Key insight:** These policies learn to simultaneously track the reference motion AND maintain balance. Push recovery is implicit — the same policy handles both. Training with perturbation curriculum further enhances robustness. See [[push-recovery-balance]]. ## 6. Residual Policy Learning Training a small correction policy on top of an existing base controller: [T1 — Established technique] ``` a_final = a_base + α * a_residual (α ∈ [0, 1] for safety scaling) ``` - **Base policy:** Stock G1 controller or a pre-trained locomotion policy - **Residual policy:** Small network trained to improve specific behavior (e.g., push recovery) - **Scaling factor α:** Limits maximum deviation from base behavior **Use case for G1:** Enhance the stock controller's push recovery without replacing it entirely. Train the residual in simulation with perturbation curriculum, deploy as an overlay. See [[push-recovery-balance]] §3b. ## 7. Perturbation Curriculum Training RL policies with progressively increasing external disturbances: [T1 — Multiple G1 papers] ``` Stage 1: No perturbations (learn basic locomotion) Stage 2: Small random pushes (10-30N, occasional) Stage 3: Medium pushes (30-80N, more frequent) Stage 4: Large pushes (80-200N) + terrain variation Stage 5: Large pushes + concurrent upper-body task ``` This is the primary method for achieving the "always-on balance" goal. Papers arXiv:2505.20619 and arXiv:2511.07407 demonstrate this approach on real G1 hardware. See [[push-recovery-balance]] §3a for detailed parameters. ## Key Relationships - Trains in: [[simulation]] (MuJoCo, Isaac Lab, Isaac Gym) - Deploys via: [[sdk-programming]] (unitree_sdk2 DDS interface) - Controls: [[locomotion-control]] (RL-trained gait policies) - Controls: [[manipulation]] (learned manipulation policies) - Data from: [[manipulation]] (teleoperation → imitation learning) - Enables: [[motion-retargeting]] (RL-based motion tracking policies) - Enables: [[push-recovery-balance]] (perturbation curriculum, residual policies) - Coordinated by: [[whole-body-control]] (WBC training frameworks)