Humanoid Touch Dream

Learning Versatile Humanoid Manipulation
with Touch Dreaming

Yaru Niu^1,3, Zhenlong Fang¹, Binghong Chen¹, Shuai Zhou¹, Revanth Senthilkumaran^1,3, Hao Zhang^1,2, Bingqing Chen³, Chen Qiu³, H. Eric Tseng², Jonathan Francis^1,3, Ding Zhao¹ ¹Carnegie Mellon University, ²UT Arlington, ³Bosch Center for AI

Paper arXiv Video Code Summary

Humanoid Touch Dream provides a whole-body learning framework for versatile contact-rich humanoid loco-manipulation.

Abstract.

Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, dexterous hands, and contact-aware perception under frequent contact changes. In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based whole-body controller that provides stable lower-body and torso execution during complex manipulation. Built on this controller, we develop a whole-body humanoid data collection system that combines VR-based teleoperation with human-to-humanoid motion mapping, enabling efficient collection of real-world demonstrations. We then propose Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder–decoder Transformer that models touch as a core modality alongside multi-view vision and proprioception. HTD is trained in a single stage with behavioral cloning augmented by touch dreaming: in addition to predicting action chunks, the policy predicts future hand-joint forces and future tactile latents, encouraging the shared Transformer trunk to learn contact-aware representations for dexterous interaction. Across five contact-rich tasks, Insert-T, Book Organization, Towel Folding, Cat Litter Scooping, and Tea Serving, HTD achieves a 90.9% relative improvement in average success rate over the stronger baseline. Ablation results further show that latent-space tactile prediction is more effective than raw tactile prediction, yielding a 30% relative gain in success rate. These results demonstrate that combining robust whole-body execution, scalable humanoid data collection, and predictive touch-centered learning enables versatile, high-dexterity humanoid manipulation in the real world.

Autonomous Policies

Policy rollouts for cat litter scooping. All videos are played at original speed (1X) unless otherwise noted. Most videos were recorded while the left hand was experiencing intermittent communication failures (see Video 2), shortly before it went fully offline.

Policy rollouts for towel folding. All videos are played at original speed (1X) unless otherwise noted.

Policy rollouts for insert-t with a clearance of 3.5mm. All videos are played at original speed (1X) unless otherwise noted.

Policy rollouts for tea serving. All videos are played at original speed (1X).

Policy rollouts for book organization. All videos are played at original speed (1X) unless otherwise noted.

Main comparison results: success rate and task score across five tasks

Policy Performance. Comparison of success rate and task score across five contact-rich tasks. HTD (Ours) consistently outperforms ACT baselines with and without touch input, achieving the highest average success rate and task score.

Touch Dreaming Visualization

Explore touch dreaming predictions interactively. The left panel shows the robot's head camera view. The right panel visualizes predicted vs. ground-truth touch signals—switch between Force, Latent Tactile, and Raw Tactile modes. For the latent tactile heatmaps, each latent dimension is independently normalized over the episode, with a minimum-range threshold derived from the most active dimension across all fingers to distinguish active from inactive latent contact regions. Note that this per-dimension normalization amplifies subtle changes and prediction errors in the latent space for better visibility.

Episode

Loading video…

Head Camera (Right Eye)

0:00 / 0:00

Ablation study: effect of touch dreaming variants on Insert-T and Towel tasks

Ablation Study. Effect of touch dreaming on Insert-T and Towel Folding tasks. Dream Latent Tactile achieves the best overall performance, demonstrating the benefit of predicting future latent tactile representations.

Whole-Body Controller

Whole-body Controller under Teleoperation. All videos are played at original speed (1X).

Metric	Ours	AMO	FALCON
E_v (m/s)	0.1420 ± 0.0568	0.1779 ± 0.0642	0.1641 ± 0.0309
E_ω (rad/s)	0.1806 ± 0.0534	0.1540 ± 0.0316	0.1874 ± 0.0263
E_h (m)	0.0280 ± 0.0438	0.0568 ± 0.0814	0.1299 ± 0.0082
E_y (rad)	0.0126 ± 0.0051	0.1540 ± 0.0534	0.1215 ± 0.0111
E_p (rad)	0.0487 ± 0.1796	0.1519 ± 0.1254	(not tracked)
E_r (rad)	0.0157 ± 0.0065	0.0735 ± 0.0447	(not tracked)

Tracking Error Comparison. Our whole-body controller achieves the lowest tracking error across most metrics compared to AMO (Li et al., RSS 2025) and FALCON (Zhang et al., L4DC 2026). Bold values indicate the best result in each row.

BibTeX

@misc{niu2026htd,
      title={Learning Versatile Humanoid Manipulation with Touch Dreaming},
      author={Yaru Niu and Zhenlong Fang and Binghong Chen and Shuai Zhou and Revanth Senthilkumaran and Hao Zhang and Bingqing Chen and Chen Qiu and H. Eric Tseng and Jonathan Francis and Ding Zhao},
      year={2026},
      eprint={2604.13015},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.13015},
}

Content