UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking

Baijun Chen5,2*, Weijie Wan6*, Tianxing Chen4*, Xianda Guo7,2*, Congsheng Xu1, Yuanyang Qi3, Haojie Zhang3, Longyan Wu8, Tianling Xu1, Zixuan Li6, Yizhe Wu3, Rui Li3,9, Xiaokang Yang1, Ping Luo4, Wei Sui2,†, and Yao Mu1,†

1ScaleLab, Shanghai Jiao Tong University 2D-Robotics 3ViTai Robotics 4The University of Hong Kong
5Nanjing University 6Shenzhen University 7Wuhan University 8Fudan University 9Tsinghua University

*Equal Contribution    Corresponding Authors

UniVTAC Teaser

Overview

Visuo-tactile perception is critical for contact-rich manipulation, enabling robots to handle intricate tasks where vision alone is insufficient due to occlusion or lighting conditions. However, progress has been significantly hindered by the scarcity of diverse tactile datasets and the lack of unified evaluation platforms. Existing solutions often struggle with limited sensor support or non-standardized benchmarks.

We present UniVTAC, a comprehensive solution addressing these challenges. By integrating high-fidelity simulation with a robust representation learning framework, UniVTAC facilitates the development of generalizable tactile policies that transfer effectively to the real world. Our method achieves +17.1% success rate on simulated benchmarks and +25% in real-world sim-to-real transfer compared to vision-only baselines.

UniVTAC Platform

A unified simulation environment supporting varied sensors (GelSight, ViTai, Xense) with intuitive APIs for scalable, contact-rich data generation.

UniVTAC Encoder

A pretrained tactile representation model learning shape, contact dynamics, and pose via multi-pathway supervision on synthetic data.

UniVTAC Benchmark

A systematic suite of 8 diverse manipulation tasks (pose reasoning, shape perception, insertion) to evaluate tactile policies.

UniVTAC Platform

The UniVTAC platform is built to democratize visuo-tactile research. It offers a scalable, GPU-accelerated simulation environment capable of generating massive amounts of labeled tactile data. Unlike previous works that focus on single sensors, our platform provides unified support for three distinct types of tactile sensors: GelSight Mini (optical), ViTai GF225 (soft gel-based), and Xense WS (force/torque), ensuring broad applicability across different hardware setups.

UniVTAC Data Generation Pipeline
Figure 1: The UniVTAC data generation pipeline. The system automates the scene setup, object randomization, and multi-modal data collection (RGB, Depth, Tactile), streamlining the training process.

Supported Tactile Sensors

GelSight Mini GelSight Mini
ViTai GF225 ViTai GF225
Xense WS Xense WS

Automated Manipulation APIs

To further lower the barrier to entry, UniVTAC provides a suite of high-level control APIs. These primitives allow researchers to script complex manipulation sequences without dealing with low-level physics integration.

Grasp

Adaptive velocity control based on depth feedback to prevent clipping.

Move

Collision-free trajectory generation via cuRobo.

Place

Stable object placement using trajectory optimization.

Probe

Safe contact initiation without penetration for readings.

Rotate

Small-scale rotations to induce shear force patterns.

UniVTAC Encoder

The core innovation of our learning framework is the UniVTAC Encoder. We employ a ResNet-18 backbone initialized with ImageNet weights, which is then fine-tuned using a multi-task objective. By supervising the model on three distinct pathways—Shape, Contact, and Pose—we ensure that the learned representations are not only discriminative for geometry but also aware of physical dynamics and spatial configuration. This disentanglement is crucial for generalization across different objects and tasks.

UniVTAC Encoder Architecture
Figure 2: Multi-pathway encoder architecture. The network processes tactile images to simultaneously reconstruct sensor views, estimate contact forces, and predict object pose.

Shape Perception

Focuses on recovering the intrinsic geometry of the object. It uses reconstruction supervision to disentangle object shape from sensor deformation artifacts.

Contact Perception

Models the local interaction dynamics. This pathway is trained to predict surface deformation and marker flow, which are direct proxies for force and slip events.

Pose Perception

Anchors the tactile signals in the global metric space by regressing the pose of the object relative to the sensor, enabling precise manipulation.

The effectiveness of our encoder is visualized below. The model successfully reconstructs high-fidelity tactile images and accurately infers contact areas, verifying that it has captured the underlying physical properties of the interaction.

Tactile Reconstruction Results
Figure 3: Qualitative results of tactile image reconstruction. Top row: Input tactile images. Bottom row: Reconstruction results from the Shape Perception pathway.

UniVTAC Benchmark

To rigorously evaluate tactile policies, we establish the UniVTAC Benchmark, consisting of 8 varied tasks sorted by complexity. These tasks range from simple object classification to intricate insertion operations that require continuous tactile feedback.

UniVTAC Benchmark Tasks
Figure 4: The 8 benchmark tasks. Tasks are categorized into Pose Reasoning, Shape Perception, and Contact-Rich Interaction.

Pose Reasoning

Tasks requiring the robot to infer the precise orientation and position of held objects using only tactile feedback.

  • Lift Bottle
  • Lift Can
  • Put Bottle in Shelf

Shape Perception

Focuses on differentiating objects based on their local geometric features extracted during exploration.

  • Grasp Classify

Contact-Rich Interaction

Complex dynamics where visual occlusion is high. The policy must adjust actions in real-time based on contact forces.

  • Insert Hole
  • Insert Tube
  • Insert HDMI
  • Pull Out Key

Simulation Experiments

We conducted extensive evaluations using Action Chunking Transformers (ACT) and VITaL policies. The results demonstrate that incorporating the UniVTAC encoder consistently boosts performance. In particular, for high-precision tasks like 'Insert Tube' and 'Insert HDMI', the tactile feedback is indispensable, providing a significant advantage over vision-only baselines.

Method Lift Bottle Pull-out Key Lift Can Put Bottle Insert Hole Insert HDMI Insert Tube Grasp Classify Avg
ACT (Vision Only) 42.0% 28.0% 20.0% 28.0% 19.0% 15.0% 45.0% 50.0% 30.9%
VITaL 72.0% 47.0% 8.0% 32.0% 25.0% 6.0% 34.0% 100.0% 40.5%
ACT + UniVTAC 71.0% 46.0% 29.0% 31.0% 24.0% 28.0% 56.0% 99.0% 48.0%

Real-World Experiments

Validating simulation results on real hardware is the ultimate test of any tactile learning framework. We deployed our trained policies on a real-world system comprising a Tianji Marvin robot equipped with ViTai GF225 sensors. The experimental setup mirrors the simulation environments, specifically targeting the 'Insert Tube', 'Insert USB', and 'Bottle Upright' tasks which involve significant contact uncertainty.

Figure 5 details our physical rig and the objects used. Despite the inevitable domain gap between simulated and real tactile readings, our UniVTAC encoder—trained purely on synthetic data—demonstrated remarkable robustness. Figure 6 highlights key frames from successful execution sequences, showing how the robot utilizes tactile feedback to correct 6D pose errors during insertion.

Real World Task Setup
Figure 5: Real-world experimental setup. A Tianji Marvin arm is equipped with ViTai sensors to perform manipulation tasks.
Real World Key Frames
Figure 6: Key interaction frames from real-world experiments. The tactile feedback enables the robot to make micro-adjustments for successful insertion.
Task Vision Only Vision + UniVTAC Gain
Insert Tube 55.0% 85.0% +30.0%
Insert USB 15.0% 25.0% +10.0%
Bottle Upright 60.0% 95.0% +35.0%
Average 43.3% 68.3% +25.0%

Below are video demonstrations comparing our UniVTAC-enhanced policy against the vision-only baseline.

Insert Tube

UniVTAC Success
Vision Only Failure

Insert USB

UniVTAC Success
Vision Only Failure

Bottle Upright

UniVTAC Success
Vision Only Failure

BibTeX

@misc{chen2026univtacunifiedsimulationplatform,
    title={UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking}, 
    author={Baijun Chen and Weijie Wan and Tianxing Chen and Xianda Guo and Congsheng Xu and Yuanyang Qi and Haojie Zhang and Longyan Wu and Tianling Xu and Zixuan Li and Yizhe Wu and Rui Li and Xiaokang Yang and Ping Luo and Wei Sui and Yao Mu},
    year={2026},
    eprint={2602.10093},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2602.10093}, 
}