13 KiB
| id | title | status | source_sections | related_topics | key_equations | key_terms | images | examples | open_questions |
|---|---|---|---|---|---|---|---|---|---|
| motion-retargeting | Motion Capture & Retargeting | established | reference/sources/paper-bfm-zero.md, reference/sources/paper-h2o.md, reference/sources/paper-omnih2o.md, reference/sources/paper-humanplus.md, reference/sources/dataset-amass-g1.md, reference/sources/github-groot-wbc.md, reference/sources/community-mocap-retarget-tools.md | [whole-body-control joint-configuration simulation learning-and-ai equations-and-bounds push-recovery-balance] | [inverse_kinematics kinematic_scaling] | [motion_retargeting mocap amass smpl kinematic_scaling inverse_kinematics] | [] | [] | [What AMASS motions have been successfully replayed on physical G1? What is the end-to-end latency from mocap capture to robot execution? Which retargeting approach gives best visual fidelity on G1 (IK vs. RL)? Can video-based pose estimation (MediaPipe/OpenPose) provide sufficient accuracy for G1 retargeting?] |
Motion Capture & Retargeting
Capturing human motion and replaying it on the G1, including the kinematic mapping problem, data sources, and execution approaches.
1. The Retargeting Problem
A human has ~200+ degrees of freedom (skeleton + soft tissue). The G1 has 23-43 DOF. Retargeting must solve three mismatches: [T1 — Established robotics problem]
| Mismatch | Human | G1 (29-DOF) | Challenge |
|---|---|---|---|
| DOF count | ~200+ | 29 | Many human motions have no G1 equivalent |
| Limb proportions | Variable | Fixed (1.32m height, 0.6m legs, ~0.45m arms) | Workspace scaling needed |
| Joint ranges | Very flexible | Constrained (e.g., knee 0-165°, hip pitch ±154°) | Motions may exceed limits |
| Dynamics | ~70kg average | ~35kg, different mass distribution | Forces/torques don't scale linearly |
What Works Well on G1
- Walking, standing, stepping motions
- Upper-body gestures (waving, pointing, reaching)
- Pick-and-place style manipulation
- Simple dance or expressive motions
What's Difficult or Impossible
- Motions requiring finger dexterity (without hands attached)
- Deep squats or ground-level motions (joint limit violations)
- Fast acrobatic motions (torque/speed limits)
- Motions requiring more DOF than available (e.g., spine articulation with 1-DOF waist)
2. Retargeting Approaches
2a. IK-Based Retargeting (Classical)
Solve inverse kinematics to map human end-effector positions to G1 joint angles: [T1]
Pipeline:
Mocap data (human skeleton) → Extract key points (hands, feet, head, pelvis)
→ Scale to G1 proportions → Solve IK per frame → Smooth trajectory
→ Check joint limits → Execute or reject
Tools:
- Pinocchio: C++/Python rigid body dynamics with fast IK solver (see whole-body-control)
- MuJoCo IK: Built-in inverse kinematics in MuJoCo simulator
- Drake: MIT's robotics toolbox with optimization-based IK
- IKPy / ikflow: Lightweight Python IK libraries
Pros: Fast, interpretable, no training required, deterministic Cons: Frame-by-frame IK can produce jerky motions, doesn't account for dynamics/balance, may violate torque limits even if joint limits are satisfied
2b. Optimization-Based Retargeting
Solve a trajectory optimization over the full motion: [T1]
minimize Σ_t || FK(q_t) - x_human_t ||^2 (tracking error)
+ Σ_t || q_t - q_{t-1} ||^2 (smoothness)
subject to q_min ≤ q_t ≤ q_max (joint limits)
CoM_t ∈ support_polygon_t (balance)
|| tau_t || ≤ tau_max (torque limits)
no self-collision (collision avoidance)
Tools: CasADi, Pinocchio + ProxQP, Drake, Crocoddyl Pros: Globally smooth, respects all constraints, can enforce balance Cons: Slow (offline only), requires accurate dynamics model, problem formulation complexity
2c. RL-Based Motion Tracking (Recommended for G1)
Train an RL policy that imitates reference motions while maintaining balance: [T1 — Multiple papers validated on G1]
Pipeline:
Mocap data → Retarget to G1 skeleton (rough IK) → Use as reference
→ Train RL policy in sim: reward = tracking + balance + energy
→ Deploy on real G1 via sim-to-real transfer
This is the approach used by BFM-Zero, H2O, OmniH2O, and HumanPlus. The RL policy learns to:
- Track the reference motion as closely as possible
- Maintain balance even when the reference motion would be unstable
- Respect joint and torque limits naturally (they're part of the sim environment)
- Recover from perturbations (if trained with perturbation curriculum)
Key advantage: Balance is baked into the policy — you don't need a separate balance controller.
Key RL Motion Tracking Frameworks
| Framework | Paper | G1 Validated? | Key Feature |
|---|---|---|---|
| BFM-Zero | arXiv:2511.04131 | Yes | Zero-shot generalization to unseen motions, open-source |
| H2O | arXiv:2403.01623 | On humanoid (not G1 specifically) | Real-time teleoperation |
| OmniH2O | arXiv:2406.08858 | On humanoid | Multi-modal input (VR, RGB, mocap) |
| HumanPlus | arXiv:2406.10454 | On humanoid | RGB camera → shadow → imitate |
| GMT | Generic Motion Tracking | In sim | Tracks diverse AMASS motions |
2d. Hybrid Approach: IK + WBC
Use IK for the upper body, WBC for balance: [T1 — GR00T-WBC approach]
Mocap data → IK retarget (upper body only: arms, waist)
→ Feed to GR00T-WBC as upper-body targets
→ WBC locomotion policy handles legs/balance automatically
→ Execute on G1
This is likely the most practical near-term approach for the G1, using GR00T-WBC as the coordination layer. See whole-body-control for details.
3. Motion Capture Sources
3a. AMASS — Archive of Motion Capture as Surface Shapes
The largest publicly available human motion dataset: [T1]
| Property | Value |
|---|---|
| Motions | 11,000+ sequences from 15 mocap datasets |
| Format | SMPL body model parameters |
| G1 retarget | Available on HuggingFace (unitree) — pre-retargeted |
| License | Research use (check individual sub-datasets) |
G1-specific: Unitree has published AMASS motions retargeted to the G1 skeleton on HuggingFace. This provides ready-to-use reference trajectories for RL training or direct playback.
3b. CMU Motion Capture Database
Classic academic motion capture archive: [T1]
| Property | Value |
|---|---|
| Subjects | 144 subjects |
| Motions | 2,500+ sequences |
| Categories | Walking, running, sports, dance, interaction, etc. |
| Formats | BVH, C3D, ASF+AMC |
| License | Free for research |
| URL | mocap.cs.cmu.edu |
3c. Real-Time Sources (Live Mocap)
| Source | Device | Latency | Accuracy | G1 Integration |
|---|---|---|---|---|
| XR Teleoperate | Vision Pro, Quest 3, PICO 4 | Low (~50ms) | High (VR tracking) | Official (unitreerobotics/xr_teleoperate) |
| Kinect | Azure Kinect DK | Medium (~100ms) | Medium | Official (kinect_teleoperate) |
| MediaPipe | RGB camera | Low (~30ms) | Low-Medium | Community, needs retarget code |
| OpenPose | RGB camera | Medium | Medium | Community, needs retarget code |
| OptiTrack/Vicon | Marker-based system | Very low (~5ms) | Very high | Custom integration needed |
For the user's goal (mocap → robot), the XR teleoperation system is the most direct path for real-time, while AMASS provides offline motion libraries.
3d. Video-Based Pose Estimation
Extract human pose from standard RGB video without mocap hardware: [T2]
- MediaPipe Pose: 33 landmarks, real-time on CPU, Google
- OpenPose: 25 body keypoints, GPU required
- HMR2.0 / 4DHumans: SMPL mesh recovery from single image — richer than keypoints
- MotionBERT: Temporal pose estimation from video sequences
These are lower fidelity than marker-based mocap but require only a webcam. HumanPlus (arXiv:2406.10454) uses RGB camera input specifically for humanoid shadowing.
4. The Retargeting Pipeline
End-to-end pipeline from human motion to G1 execution:
┌─────────────┐ ┌──────────────┐ ┌───────────────┐
│ Motion │ │ Skeleton │ │ Kinematic │
│ Source │────►│ Extraction │────►│ Retargeting │
│ (mocap/video)│ │ (SMPL/joints)│ │ (scale + IK) │
└─────────────┘ └──────────────┘ └───────┬───────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌───────────────┐
│ Execute on │ │ WBC / RL │ │ Feasibility │
│ Real G1 │◄───│ Policy │◄───│ Check │
│ (sdk2) │ │ (balance + │ │ (joint limits, │
└─────────────┘ │ tracking) │ │ stability) │
└──────────────┘ └───────────────┘
Step 1: Motion Source
- Offline: AMASS dataset, CMU mocap, recorded demonstrations
- Real-time: XR headset, Kinect, RGB camera
Step 2: Skeleton Extraction
- AMASS: Already in SMPL format, extract joint angles
- BVH/C3D: Parse standard mocap formats
- Video: Run pose estimator (MediaPipe, OpenPose, HMR2.0)
- Output: Human joint positions/rotations per frame
Step 3: Kinematic Retargeting
- Map human skeleton to G1 skeleton (limb length scaling)
- Solve IK for each frame or use direct joint angle mapping
- Handle DOF mismatch (project higher-DOF human motion to G1 subspace)
- Clamp to G1 joint limits (see equations-and-bounds)
Step 4: Feasibility Check
- Verify all joint angles within limits
- Check CoM remains within support polygon (static stability)
- Estimate required torques (inverse dynamics) — reject if exceeding actuator limits
- Check for self-collisions
Step 5: Execution Policy
- Direct playback: Send retargeted joint angles via rt/lowcmd (no balance guarantee)
- WBC execution: Feed to GR00T-WBC as upper-body targets, let locomotion policy handle balance
- RL tracking: Use trained motion tracking policy (BFM-Zero style) that simultaneously tracks and balances
Step 6: Deploy on Real G1
- Via unitree_sdk2_python (prototyping) or unitree_sdk2 C++ (production)
- 500 Hz control loop, 2ms DDS latency
- Always validate in simulation first (see simulation)
5. SMPL Body Model
SMPL (Skinned Multi-Person Linear model) is the standard representation for human body shape and pose in mocap datasets: [T1]
- Parameters: 72 pose parameters (24 joints x 3 rotations) + 10 shape parameters
- Output: 6,890 vertices mesh + joint locations
- Extensions: SMPL-X (hands + face), SMPL+H (hands)
- Relevance: AMASS uses SMPL, so retargeting from AMASS means mapping SMPL joints → G1 joints
SMPL to G1 Joint Mapping (Approximate)
| SMPL Joint | G1 Joint(s) | Notes |
|---|---|---|
| Pelvis | Waist (yaw) | G1 has 1-3 waist DOF vs. SMPL's 3 |
| L/R Hip | left/right_hip_pitch/roll/yaw | Direct mapping, 3-DOF each |
| L/R Knee | left/right_knee | Direct mapping, 1-DOF |
| L/R Ankle | left/right_ankle_pitch/roll | Direct mapping, 2-DOF |
| L/R Shoulder | left/right_shoulder_pitch/roll/yaw | Direct mapping, 3-DOF |
| L/R Elbow | left/right_elbow | Direct mapping, 1-DOF |
| L/R Wrist | left/right_wrist_yaw(+pitch+roll) | 1-DOF (23-DOF) or 3-DOF (29-DOF) |
| Spine | Waist (limited) | SMPL has 3 spine joints, G1 has 1-3 waist |
| Head/Neck | — | G1 has no head/neck DOF |
| Fingers | Hand joints (if equipped) | Only with Dex3-1 or INSPIRE |
6. Key Software & Repositories
| Tool | Purpose | Language | License |
|---|---|---|---|
| GR00T-WBC | End-to-end WBC + retargeting for G1 | Python/C++ | Apache 2.0 |
| Pinocchio | Rigid body dynamics, IK, Jacobians | C++/Python | BSD-2 |
| xr_teleoperate | Real-time VR mocap → G1 | Python | Unitree |
| unitree_mujoco | Simulate retargeted motions | C++/Python | BSD-3 |
| smplx (Python) | SMPL body model processing | Python | MIT |
| rofunc | Robot learning from human demos + retargeting | Python | MIT |
| MuJoCo Menagerie | G1 model (g1.xml) for IK/simulation | MJCF | BSD-3 |
Key Relationships
- Requires: joint-configuration (target skeleton — DOF, joint limits, link lengths)
- Executed via: whole-body-control (WBC provides balance during playback)
- Stabilized by: push-recovery-balance (perturbation robustness during execution)
- Trained in: simulation (RL tracking policies trained in MuJoCo/Isaac)
- Training methods: learning-and-ai (RL, imitation learning frameworks)
- Bounded by: equations-and-bounds (joint limits, torque limits for feasibility)