UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation

Overview

Visuo-tactile perception is critical for contact-rich manipulation, enabling robots to handle intricate tasks where vision alone is insufficient due to occlusion or lighting conditions. However, progress has been significantly hindered by the scarcity of diverse tactile datasets and the lack of unified evaluation platforms. Existing solutions often struggle with limited sensor support or non-standardized benchmarks.

We present UniVTAC, a comprehensive solution addressing these challenges. By integrating high-fidelity simulation with a robust representation learning framework, UniVTAC facilitates the development of generalizable tactile policies that transfer effectively to the real world. Our method achieves +17.1% success rate on simulated benchmarks and +25% in real-world sim-to-real transfer compared to vision-only baselines.

UniVTAC Platform

A unified simulation environment supporting varied sensors (GelSight, ViTai, Xense) with intuitive APIs for scalable, contact-rich data generation.

UniVTAC Encoder

A pretrained tactile representation model learning shape, contact dynamics, and pose via multi-pathway supervision on synthetic data.

UniVTAC Benchmark

A systematic suite of 8 diverse manipulation tasks (pose reasoning, shape perception, insertion) to evaluate tactile policies.

UniVTAC Platform

The UniVTAC platform is built to democratize visuo-tactile research. It offers a scalable, GPU-accelerated simulation environment capable of generating massive amounts of labeled tactile data. Unlike previous works that focus on single sensors, our platform provides unified support for three distinct types of tactile sensors: GelSight Mini (optical), ViTai GF225 (soft gel-based), and Xense WS (force/torque), ensuring broad applicability across different hardware setups.

UniVTAC Data Generation Pipeline — Figure 1: The UniVTAC data generation pipeline. The system automates the scene setup, object randomization, and multi-modal data collection (RGB, Depth, Tactile), streamlining the training process.

Supported Tactile Sensors

GelSight Mini

ViTai GF225

Xense WS

Automated Manipulation APIs

To further lower the barrier to entry, UniVTAC provides a suite of high-level control APIs. These primitives allow researchers to script complex manipulation sequences without dealing with low-level physics integration.

Grasp

Adaptive velocity control based on depth feedback to prevent clipping.

Move

Collision-free trajectory generation via cuRobo.

Place

Stable object placement using trajectory optimization.

Probe

Safe contact initiation without penetration for readings.

Rotate

Small-scale rotations to induce shear force patterns.

UniVTAC Encoder

The core innovation of our learning framework is the UniVTAC Encoder. We employ a ResNet-18 backbone initialized with ImageNet weights, which is then fine-tuned using a multi-task objective. By supervising the model on three distinct pathways—Shape, Contact, and Pose—we ensure that the learned representations are not only discriminative for geometry but also aware of physical dynamics and spatial configuration. This disentanglement is crucial for generalization across different objects and tasks.

Shape Perception

Focuses on recovering the intrinsic geometry of the object. It uses reconstruction supervision to disentangle object shape from sensor deformation artifacts.

Contact Perception

Models the local interaction dynamics. This pathway is trained to predict surface deformation and marker flow, which are direct proxies for force and slip events.

Pose Perception

Anchors the tactile signals in the global metric space by regressing the pose of the object relative to the sensor, enabling precise manipulation.

The effectiveness of our encoder is visualized below. The model successfully reconstructs high-fidelity tactile images and accurately infers contact areas, verifying that it has captured the underlying physical properties of the interaction.

Tactile Reconstruction Results — Figure 3: Qualitative results of tactile image reconstruction. Top row: Input tactile images. Bottom row: Reconstruction results from the Shape Perception pathway.

UniVTAC Benchmark

To rigorously evaluate tactile policies, we establish the UniVTAC Benchmark, consisting of 8 varied tasks sorted by complexity. These tasks range from simple object classification to intricate insertion operations that require continuous tactile feedback.

Pose Reasoning

Tasks requiring the robot to infer the precise orientation and position of held objects using only tactile feedback.

Lift Bottle
Lift Can
Put Bottle in Shelf

Shape Perception

Focuses on differentiating objects based on their local geometric features extracted during exploration.

Grasp Classify

Contact-Rich Interaction

Complex dynamics where visual occlusion is high. The policy must adjust actions in real-time based on contact forces.

Insert Hole
Insert Tube
Insert HDMI
Pull Out Key

Simulation Experiments

We conducted extensive evaluations using Action Chunking Transformers (ACT) and VITaL policies. The results demonstrate that incorporating the UniVTAC encoder consistently boosts performance. In particular, for high-precision tasks like 'Insert Tube' and 'Insert HDMI', the tactile feedback is indispensable, providing a significant advantage over vision-only baselines.

Method	Lift Bottle	Pull-out Key	Lift Can	Put Bottle	Insert Hole	Insert HDMI	Insert Tube	Grasp Classify	Avg
ACT (Vision Only)	42.0%	28.0%	20.0%	28.0%	19.0%	15.0%	45.0%	50.0%	30.9%
VITaL	72.0%	47.0%	8.0%	32.0%	25.0%	6.0%	34.0%	100.0%	40.5%
ACT + UniVTAC	71.0%	46.0%	29.0%	31.0%	24.0%	28.0%	56.0%	99.0%	48.0%

Real-World Experiments

Validating simulation results on real hardware is the ultimate test of any tactile learning framework. We deployed our trained policies on a real-world system comprising a Tianji Marvin robot equipped with ViTai GF225 sensors. The experimental setup mirrors the simulation environments, specifically targeting the 'Insert Tube', 'Insert USB', and 'Bottle Upright' tasks which involve significant contact uncertainty.

Figure 5 details our physical rig and the objects used. Despite the inevitable domain gap between simulated and real tactile readings, our UniVTAC encoder—trained purely on synthetic data—demonstrated remarkable robustness. Figure 6 highlights key frames from successful execution sequences, showing how the robot utilizes tactile feedback to correct 6D pose errors during insertion.

Real World Task Setup — Figure 5: Real-world experimental setup. A Tianji Marvin arm is equipped with ViTai sensors to perform manipulation tasks.

Real World Key Frames — Figure 6: Key interaction frames from real-world experiments. The tactile feedback enables the robot to make micro-adjustments for successful insertion.

Task	Vision Only	Vision + UniVTAC	Gain
Insert Tube	55.0%	85.0%	+30.0%
Insert USB	15.0%	25.0%	+10.0%
Bottle Upright	60.0%	95.0%	+35.0%
Average	43.3%	68.3%	+25.0%

Below are video demonstrations comparing our UniVTAC-enhanced policy against the vision-only baseline.

Insert Tube

UniVTAC Success

Vision Only Failure

Insert USB

UniVTAC Success

Vision Only Failure

Bottle Upright

UniVTAC Success

Vision Only Failure

BibTeX

@misc{chen2026univtacunifiedsimulationplatform,
    title={UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking}, 
    author={Baijun Chen and Weijie Wan and Tianxing Chen and Xianda Guo and Congsheng Xu and Yuanyang Qi and Haojie Zhang and Longyan Wu and Tianling Xu and Zixuan Li and Yizhe Wu and Rui Li and Xiaokang Yang and Ping Luo and Wei Sui and Yao Mu},
    year={2026},
    eprint={2602.10093},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2602.10093}, 
}