Humanoid Touch Dream provides a whole-body learning framework for versatile contact-rich humanoid loco-manipulation.
Abstract.
Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, dexterous hands, and contact-aware perception under frequent contact changes.
In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based whole-body controller that provides stable lower-body and torso execution during complex manipulation. Built on this controller, we develop a whole-body humanoid data collection system that combines VR-based teleoperation with human-to-humanoid motion mapping, enabling efficient collection of real-world demonstrations. We then propose Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder–decoder Transformer that models touch as a core modality alongside multi-view vision and proprioception. HTD is trained in a single stage with behavioral cloning augmented by touch dreaming: in addition to predicting action chunks, the policy predicts future hand-joint forces and future tactile latents, encouraging the shared Transformer trunk to learn contact-aware representations for dexterous interaction. Across five contact-rich tasks, Insert-T, Book Organization, Towel Folding, Cat Litter Scooping, and Tea Serving, HTD achieves a 90.9% relative improvement in average success rate over the stronger baseline. Ablation results further show that latent-space tactile prediction is more effective than raw tactile prediction, yielding a 30% relative gain in success rate. These results demonstrate that combining robust whole-body execution, scalable humanoid data collection, and predictive touch-centered learning enables versatile, high-dexterity humanoid manipulation in the real world.
Autonomous Policies
Policy rollouts for cat litter scooping. All videos are played at original speed (1X) unless otherwise noted. Most videos were recorded while the left hand was experiencing intermittent communication failures (see Video 2), shortly before it went fully offline.
Policy rollouts for towel folding. All videos are played at original speed (1X) unless otherwise noted.
Policy rollouts for insert-t with a clearance of 3.5mm. All videos are played at original speed (1X) unless otherwise noted.
Policy rollouts for tea serving. All videos are played at original speed (1X).
Policy rollouts for book organization. All videos are played at original speed (1X) unless otherwise noted.
Policy Performance. Comparison of success rate and task score across five contact-rich tasks. HTD (Ours) consistently outperforms ACT baselines with and without touch input, achieving the highest average success rate and task score.
Touch Dreaming Visualization
Explore touch dreaming predictions interactively. The left panel shows
the robot's head camera view. The right panel visualizes predicted vs.
ground-truth touch signals—switch between Force,
Latent Tactile, and Raw Tactile modes.
For the latent tactile heatmaps, each latent dimension is independently
normalized over the episode, with a minimum-range threshold derived from the
most active dimension across all fingers to distinguish active from inactive
latent contact regions. Note that this per-dimension normalization amplifies subtle
changes and prediction errors in the latent space for better visibility.
Loading video…
Head Camera (Right Eye)
0:00 / 0:00
Ablation Study. Effect of touch dreaming on Insert-T and Towel Folding tasks. Dream Latent Tactile achieves the best overall performance, demonstrating the benefit of predicting future latent tactile representations.
Whole-Body Controller
Whole-body Controller under Teleoperation. All videos are played at original speed (1X).
Metric
Ours
AMO
FALCON
Ev (m/s)
0.1420 ± 0.0568
0.1779 ± 0.0642
0.1641 ± 0.0309
Eω (rad/s)
0.1806 ± 0.0534
0.1540 ± 0.0316
0.1874 ± 0.0263
Eh (m)
0.0280 ± 0.0438
0.0568 ± 0.0814
0.1299 ± 0.0082
Ey (rad)
0.0126 ± 0.0051
0.1540 ± 0.0534
0.1215 ± 0.0111
Ep (rad)
0.0487 ± 0.1796
0.1519 ± 0.1254
(not tracked)
Er (rad)
0.0157 ± 0.0065
0.0735 ± 0.0447
(not tracked)
Tracking Error Comparison. Our whole-body controller achieves the lowest tracking error across most metrics compared to AMO (Li et al., RSS 2025) and FALCON (Zhang et al., L4DC 2026). Bold values indicate the best result in each row.
BibTeX
@misc{niu2026htd,
title={Learning Versatile Humanoid Manipulation with Touch Dreaming},
author={Yaru Niu and Zhenlong Fang and Binghong Chen and Shuai Zhou and Revanth Senthilkumaran and Hao Zhang and Bingqing Chen and Chen Qiu and H. Eric Tseng and Jonathan Francis and Ding Zhao},
year={2026},
eprint={2604.13015},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2604.13015},
}