Paper Review — HumanPlus: Humanoid Shadowing and Imitation from Humans

A paper review and Code to get started

In this article, we will explore the paper “HumanPlus” by Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. The goal of this paper was to develop a full-stack framework that enables high degrees of freedom for humanoids to learn robust control from human demonstrations.

First, let’s talk about the motivation for humanoids. Humans have created spaces for other humans to perform tasks, from unloading from a dishwasher, and washing clothes to working on assembly lines and even more dangerous occupations like handling hazardous materials. Humanoids can be used for human-centric tasks because they are designed to mimic humans, unlike other robots that are not intended to operate smoothly in human environments. Another benefit of humanoids is their ability to perform dangerous jobs that humans are made to do, such as handling hazardous materials and working in factories. These already dangerous settings are appropriate for humanoids, given current technology limitations, as humanoids themselves perform hazardous actions around humans. The current state of the art in human-robot interaction still has many shortcomings in terms of safety. Another primary motivation for humanoids is data. Modern machine learning models for robots require large datasets of demonstrations to learn manipulation and whole-body control, but collecting high-quality robot data is expensive and time-consuming. However, humans generate abundant motion data in everyday activities, and large-scale human motion datasets already exist, which can be used to train machine learning models for the humanoid more efficiently.

Paper Overview

HumanPlus Robot Hardware Overview (from the paper)

The hardware they used was a Unitree H1 robot with 33 degrees of freedom, 6 degrees of freedom in the hand, 1 degree of freedom in the wrist, and two egocentric RGB cameras on the head. Additionally, the robot weighs 60kg (130lbs) and is about 180cm (5’9 “) tall. You can see the rest of the humanoids’ statistics in the image from the HumanPlus paper above.

HumanPlus Pipeline (from the paper)

Human Plus System Pipeline

Furthermore, the HumanPlus framework is designed to train humanoids using human motion data and to have the humanoid collect its own real-world robot data for autonomous skill learning. This framework is accomplished through a system that relies on two decoder transformer models: the human shadowing transformer (HST), trained on human motion capture, and the human imitation transformer (HIT), trained on real-world teleoperation data for data collection and imitation learning.

Data collection starts with teleoperation, with a human operator performing a set of tasks via a single external RGB camera set up to view the human next to the humanoid. This data is fed into pre-trained transformers for pose estimation. The first is called WHAM, which performs whole-body pose estimation, and the second is HaMeR for hand pose estimation. These models convert raw RGB frames collected from the camera into detailed SMPL-X body poses and MANO hand poses in real time, eventually mapping them to the humanoid. Due to these, pose estimation transformer models that run in real time at high FPS require no additional special equipment for data collection, keeping the process low cost and requiring only one external RGB camera and a single human operator.

Once the human pose is estimated, it is then translated into a form the humanoid can use. HumanPlus does this via real-time retargeting that maps the human’s body configuration, with many more degrees of freedom, to the robot’s 33 degrees of freedom. This produces a target pose for the humanoid, which the low-level controller uses to decide how the humanoid will move.

Next, the output target poses will be passed to the HST, a low-level controller trained on the AMASS data set, which contains 40 hours of human motion capture data. During training, HST learns to convert human joint trajectories into stable humanoid movement using Proximal Policy Optimization (PPO) reinforcement learning. When deployed on the humanoid, the HST performs a zero-shot transfer, meaning the humanoid can accurately mimic human motion in real time with no fine-tuning required.

The HIT model is trained on the system’s real-time teleoperation capability, with the humanoid being teleoperated by a human operator and an external RGB camera. The humanoid also records its own egocentric binocular RGB camera streams and proprioceptive state (joint angles, torques, and velocities). This form of data collection creates a dataset from the robot’s viewpoint, which eliminates the need for other, more time-intensive or expensive data collection methods such as ALOHA, Kinesthetic Teaching, or a VR headset. The gathered demonstration data is then fed into the HIT, which allows the robot to perform the given tasks autonomously.

Transformer Architecture During Inference (from the paper)

Deep Dive Into Transformer The Architecture

First, the Humanoid Imitation Transformer (HIT) is a vision-based high-level policy trained via supervised imitation learning from data collected during teleoperation. In simpler terms, it takes the robot’s egocentric images and proprioception (its current position) and predicts how it should move.

What the HIT takes as input:

  • Two egocentric RGB images, encoded using a pretrained residual network (ResNet) encoder.
  • Proprioception includes joint angles, joint velocities, head pose, and camera pose.
  • Positional embeddings that encode the specific sequence of the humanoid’s positions.

Using these inputs, HIT predicts a sequence of target poses at 50 future timesteps at 25Hz. This means the humanoid receives pose sequences it should perform over the next 2 seconds, providing a stable action trajectory rather than a single-step action command. Additionally, the HIT includes a feature-prediction branch that predicts future visual features, which the robot expects to see if it moves according to its predicted actions. The loss function this transformer uses is an L2 loss, or mean squared error, which measures how close the predicted pose sequence and predicted image features are to the actual data collected during teleoperation. Minimizing this loss function allows the model to learn to match the future poses to the data collected during teleoperation, but also to anticipate what the robot’s cameras should see next when its actions are executed, as having both proprioception and visual prediction makes the model more robust and generalizes better to new environments, as it does not only rely on proprioception.

Now, the Humanoid Shadowing Transformer (HST) is a low-level motor controller that converts high-level trajectories from the HIT into stable and precise humanoid movements. The HST runs at 50Hz and outputs joint position setpoints for the torso, hips, knees, shoulders, and hands on the humanoid.

What the HST takes as input:

  • The current proprioception (root pose, joint positions, velocities, and previous actions).
  • The target pose outputted by either the HIT for autonomous actions or by the human pose estimator during teleoperation.

HST is a pretrained model trained entirely in simulation via PPO reinforcement learning using the AMASS dataset. Since the HST learns a generalizable low-level controller, it learns human-like joint coordination patterns and whole-body poses, instead of learning how to perform specific tasks.

Furthermore, the HST outputs joint-position targets, which are passed to a 1000Hz PD controller that converts them into torques for the humanoid. This frequency pipeline with HIT at 25Hz, HST at 50Hz, and the PD controller at 1000Hz enables smooth execution, efficient error correction, and real-time stability.

Overall, during teleoperation, the HST enables the robot to mimic the human operator’s movements accurately. During autonomous execution, the HST retrieves plans from the HIT and executes them with robustness and precision.

Tasks Preformed by HumanPlus (from the paper)

Results

The HumanPlus system is tested on various tasks, both autonomous and teleoperated. Some of the autonomous functions performed include standing up and walking in shoes, folding clothes, unloading objects from a warehouse, typing “AI” on a keyboard, and many others. While these tasks may seem simple for humans to perform, there are many steps involved, and the humanoid must have the correct joint positions, torques, and velocities to execute them. For example, putting on a shoe and standing up to walk includes flipping the shoe, picking it up, putting it on, pressing it down to secure the fit, tying the laces, and so on. The HumanPlus system achieves these tasks through its transformer models. When comparing its HIT transformer performing various tasks, it outperforms current state-of-the-art models, such as the monocular model, which is the same as the HIT model but uses only one camera on its head, lacking depth perception. Additionally, it outperforms the action chunking transformer (ACT), a prior imitation learning model used in ALOHA-style bimanual robots, only tying it in performance accuracy across three categories: folding clothes, unloading items in a warehouse, and greeting another robot.

Another way the HumanPlus system outperformed previous data collection baselines (such as ALOHA, Kinesthetic, and Meta Quest VR headset) is that it is more cost-effective and efficient, as it ties with kinesthetic learning for the lowest price among the training methods. However, kinesthetic requires physical contact and guidance of the robot, which raises safety concerns due to the robot’s unpredictability. Also, HumanPlus requires only one operator, unlike other systems, which require more operators, leading to less efficient data collection. Also, the HumanPlus system is the only system that allows for whole body movement compared to the three others it is also faster in every metric when it comes to rearranging objects as the entire task is about 1–3 seconds faster than any other method it also has a 100% achievement rate in the standing up task and is the only system that can rearrange lower objects.

Key Contributions and Overview

The HumanPlus paper makes several key contributions to humanoid learning. The first contribution is an end-to-end framework that enables humanoids to learn directly from human data via pose estimation, teleoperation, and imitation learning, all running in real time with just a single $50 RGB camera, thereby minimizing cost. The paper’s two-level decoder-only transformer architecture introduces a novel approach: the Humanoid Shadowing Transformer (HST) for low-level whole-body control and the Humanoid Imitation Transformer (HIT) for learning autonomous skills from demonstration, enabling the HumanPlus system to learn autonomous skills from demonstration. The HumanPlus low-cost single-operator setup supports 33 degrees of freedom of motion and achieves state-of-the-art performance by completing some tasks up to 3 seconds faster than previous approaches and achieving a 60–100% success rate across performed tasks. Overall, HumanPlus outperforms previous methods such as ALOHA, kinesthetic learning, and VR headsets like Meta Quest in terms of accuracy, cost, and efficiency.

Future Directions

The authors suggest several future directions: first, hardware improvements to make these systems work effectively. There needs to be advancements in hardware, such as more flexible joints; currently, humanoids have only one joint per ankle and five joints per arm. Increasing the degrees of freedom would make the humanoid more agile and able to perform more seamless actions. Additionally, better cameras would improve the humanoid’s performance, as the current cameras are fixed on the robot’s head, meaning that while the humanoid is moving, the robot’s hands and other parts of the body go out of view. Instead, having a moving camera that can focus on the parts of the body the robot is currently moving may help performance. Another potential improvement mentioned is improved motion mapping, which would design a more flexible way to translate human movements to the robot so it can learn from a wider variety of human demonstrations.

How to Install and Use HumanPlus Code

Now we will discuss how to implement the HumanPlus code. To follow along, go to the HumanPlus GitHub repository: https://github.com/MarkFzp/humanplus

First open your terminal and clone the repo then go into the folder:

# 1. Clone the repo
git clone https://github.com/MarkFzp/humanplus.git
cd humanplus

Set Up and Train HST

The below commands install the reinforcement-learning environment (rsl_rl + Isaac Gym) and train the Humanoid Shadowing Transformer (HST) in simulation using AMASS. HST learns whole-body pose tracking, which enables teleoperation and stable control.

cd HST/rsl_rl && pip install -e .
cd ../legged_gym && pip install -e .
python legged_gym/scripts/train.py \
--run_name hst_train \
--headless \
--sim_device cuda:0 \
--rl_device cuda:0

The below runs a simulation to play back the trained HST model so you can see the low-level controller tracking poses.

python legged_gym/scripts/play.py \
--run_name hst_train \
--checkpoint -1 \
--headless \
--sim_device cuda:0 \
--rl_device cuda:0

This creates a Python environment and installs the dependencies for HIT, which learns autonomous skills from teleoperation data. It includes Torch, vision modules, MuJoCo, and the DETR vision backbone.

Set Up HIT Environment

This creates a Python environment and installs the dependencies for HIT, which learns autonomous skills from teleoperation data. It includes Torch, vision modules, MuJoCo, and the DETR vision backbone.

conda create -n HIT python=3.8.10
conda activate HIT
pip install torchvision torch pyquaternion pyyaml rospkg pexpect \
mujoco==2.3.7 dm_control==1.0.14 opencv-python matplotlib \
einops packaging h5py ipython getkey wandb chardet h5py_cache
cd HIT/detr && pip install -e .

This script trains HIT using the robot’s egocentric camera data + proprioception collected during shadowing. HIT learns to predict future poses and future visual features, enabling long-horizon autonomous tasks like folding clothes or rearranging objects.

python imitate_episodes_h1_train.py \
--task_name data_fold_clothes \
--policy_class HIT \
--chunk_size 50 \
--hidden_dim 512 \
--dec_layers 6 \
--batch_size 48 \
--lr 1e-5

Learn more about Paper Review — HumanPlus: Humanoid Shadowing and Imitation from Humans

Leave a Reply