Baijun Chen5,2*, Weijie Wan6*, Tianxing Chen4*, Xianda Guo7,2*, Congsheng Xu1, Yuanyang Qi3, Haojie Zhang3, Longyan Wu8, Tianling Xu1, Zixuan Li6, Yizhe Wu3, Rui Li3,9, Xiaokang Yang1, Ping Luo4, Wei Sui2,†, and Yao Mu1,†
1ScaleLab, Shanghai Jiao Tong University
2D-Robotics
3ViTai Robotics
4The University of Hong Kong
5Nanjing University
6Shenzhen University
7Wuhan University
8Fudan University
9Tsinghua University
*Equal Contribution †Corresponding Authors
Visuo-tactile perception is critical for contact-rich manipulation, enabling robots to handle intricate tasks where vision alone is insufficient due to occlusion or lighting conditions. However, progress has been significantly hindered by the scarcity of diverse tactile datasets and the lack of unified evaluation platforms. Existing solutions often struggle with limited sensor support or non-standardized benchmarks.
We present UniVTAC, a comprehensive solution addressing these challenges. By integrating high-fidelity simulation with a robust representation learning framework, UniVTAC facilitates the development of generalizable tactile policies that transfer effectively to the real world. Our method achieves +17.1% success rate on simulated benchmarks and +25% in real-world sim-to-real transfer compared to vision-only baselines.
A unified simulation environment supporting varied sensors (GelSight, ViTai, Xense) with intuitive APIs for scalable, contact-rich data generation.
A pretrained tactile representation model learning shape, contact dynamics, and pose via multi-pathway supervision on synthetic data.
A systematic suite of 8 diverse manipulation tasks (pose reasoning, shape perception, insertion) to evaluate tactile policies.
The UniVTAC platform is built to democratize visuo-tactile research. It offers a scalable, GPU-accelerated simulation environment capable of generating massive amounts of labeled tactile data. Unlike previous works that focus on single sensors, our platform provides unified support for three distinct types of tactile sensors: GelSight Mini (optical), ViTai GF225 (soft gel-based), and Xense WS (force/torque), ensuring broad applicability across different hardware setups.
GelSight Mini
ViTai GF225
Xense WS
To further lower the barrier to entry, UniVTAC provides a suite of high-level control APIs. These primitives allow researchers to script complex manipulation sequences without dealing with low-level physics integration.
Adaptive velocity control based on depth feedback to prevent clipping.
Collision-free trajectory generation via cuRobo.
Stable object placement using trajectory optimization.
Safe contact initiation without penetration for readings.
Small-scale rotations to induce shear force patterns.
The core innovation of our learning framework is the UniVTAC Encoder. We employ a ResNet-18 backbone initialized with ImageNet weights, which is then fine-tuned using a multi-task objective. By supervising the model on three distinct pathways—Shape, Contact, and Pose—we ensure that the learned representations are not only discriminative for geometry but also aware of physical dynamics and spatial configuration. This disentanglement is crucial for generalization across different objects and tasks.
Focuses on recovering the intrinsic geometry of the object. It uses reconstruction supervision to disentangle object shape from sensor deformation artifacts.
Models the local interaction dynamics. This pathway is trained to predict surface deformation and marker flow, which are direct proxies for force and slip events.
Anchors the tactile signals in the global metric space by regressing the pose of the object relative to the sensor, enabling precise manipulation.
The effectiveness of our encoder is visualized below. The model successfully reconstructs high-fidelity tactile images and accurately infers contact areas, verifying that it has captured the underlying physical properties of the interaction.
To rigorously evaluate tactile policies, we establish the UniVTAC Benchmark, consisting of 8 varied tasks sorted by complexity. These tasks range from simple object classification to intricate insertion operations that require continuous tactile feedback.
Tasks requiring the robot to infer the precise orientation and position of held objects using only tactile feedback.
Focuses on differentiating objects based on their local geometric features extracted during exploration.
Complex dynamics where visual occlusion is high. The policy must adjust actions in real-time based on contact forces.
We conducted extensive evaluations using Action Chunking Transformers (ACT) and VITaL policies. The results demonstrate that incorporating the UniVTAC encoder consistently boosts performance. In particular, for high-precision tasks like 'Insert Tube' and 'Insert HDMI', the tactile feedback is indispensable, providing a significant advantage over vision-only baselines.
| Method | Lift Bottle | Pull-out Key | Lift Can | Put Bottle | Insert Hole | Insert HDMI | Insert Tube | Grasp Classify | Avg |
|---|---|---|---|---|---|---|---|---|---|
| ACT (Vision Only) | 42.0% | 28.0% | 20.0% | 28.0% | 19.0% | 15.0% | 45.0% | 50.0% | 30.9% |
| VITaL | 72.0% | 47.0% | 8.0% | 32.0% | 25.0% | 6.0% | 34.0% | 100.0% | 40.5% |
| ACT + UniVTAC | 71.0% | 46.0% | 29.0% | 31.0% | 24.0% | 28.0% | 56.0% | 99.0% | 48.0% |
Validating simulation results on real hardware is the ultimate test of any tactile learning framework. We deployed our trained policies on a real-world system comprising a Tianji Marvin robot equipped with ViTai GF225 sensors. The experimental setup mirrors the simulation environments, specifically targeting the 'Insert Tube', 'Insert USB', and 'Bottle Upright' tasks which involve significant contact uncertainty.
Figure 5 details our physical rig and the objects used. Despite the inevitable domain gap between simulated and real tactile readings, our UniVTAC encoder—trained purely on synthetic data—demonstrated remarkable robustness. Figure 6 highlights key frames from successful execution sequences, showing how the robot utilizes tactile feedback to correct 6D pose errors during insertion.
| Task | Vision Only | Vision + UniVTAC | Gain |
|---|---|---|---|
| Insert Tube | 55.0% | 85.0% | +30.0% |
| Insert USB | 15.0% | 25.0% | +10.0% |
| Bottle Upright | 60.0% | 95.0% | +35.0% |
| Average | 43.3% | 68.3% | +25.0% |
Below are video demonstrations comparing our UniVTAC-enhanced policy against the vision-only baseline.
@misc{chen2026univtacunifiedsimulationplatform,
title={UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking},
author={Baijun Chen and Weijie Wan and Tianxing Chen and Xianda Guo and Congsheng Xu and Yuanyang Qi and Haojie Zhang and Longyan Wu and Tianling Xu and Zixuan Li and Yizhe Wu and Rui Li and Xiaokang Yang and Ping Luo and Wei Sui and Yao Mu},
year={2026},
eprint={2602.10093},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.10093},
}