The "Data Thirst" of Embodied Intelligence, Is There a Solution?

Edited by Aya From Gasgoo

Gasgoo Munich- Much like the early days of autonomous driving, embodied intelligence has hit a "data drought" moment.

Analysis suggests training embodied robots requires hundreds of billions of interaction data points. Yet, the entire industry currently holds only a few million entries—a gap of 100,000 times.

Bridging such a massive chasm is clearly impossible for any single enterprise or institution working alone.

Recognizing this, the sector is shifting away from going it alone, embracing industrial collaboration instead. From startups and giants to local governments, players are flocking to joint data initiatives. The goal: break down data "silos" at the source and fuel the industry's evolution with richer resources.

Cracking the "Data Drought" Requires Collaboration

Recently, the "Embodied AI Open Source Dataset Community" officially launched. Guided by the Ministry of Industry and Information Technology and initiated by the OpenAtom Foundation, the project is led by Leju Robot in partnership with core players including Robbyant, Shanghai Jiao Tong University, and Unitree.

Image source: Leju Robot

Two years ago, this news might have been just a brief item in the embodied intelligence sector. Today, in 2026, it carries a very different weight.

As the first community of its kind backed by a national platform, the project's goal is blunt: shatter the four shackles holding back the industry—data silos, high collection costs, inefficient labeling, and weak model generalization.

Put simply: fixing the embodied "data drought" is beyond the reach of any lone player.

Because the sector relies heavily on the data flywheel, companies have spent years building proprietary collection systems, treating data as a core asset. But entering 2026, the massive gap acts like a mirror, revealing a hard truth: no single company can fill this void on its own.

The industry mindset is shifting subtly. National platforms are stepping up to organize, but even former rivals like Leju, Unitree, and AgiBot are opening up their data. They are building alliances and open-sourcing their datasets.

image.png

Image source: TARS

Datasets like AgiBot World, Leju LET, Galbot's DexonomySim, and TARS's WIYH have all been open-sourced. They cover multimodal training, dexterous manipulation for humanoid robots, and whole-body motion.

TARS, for instance, launched the "Embodied Data Spark Plan," aiming to facilitate 100 million hours of data sharing. Meanwhile, Horizon Robotics, D-Robotics, and Wuwen.ai recently jointly announced a plan for an open-source dataset exceeding 10,000 hours.

Why the sudden enthusiasm for open-sourcing datasets?

"For companies, open-sourcing carries little risk," an industry insider noted. "Some may use it to gain influence, but once data is open, it allows for exchange and joint innovation. The value generated can be far greater."

In other words, the scenarios and data a single company can access are inherently limited. Open-sourcing brings in more developers to spot bugs and suggest improvements. This isn't just sharing; it's crowdsourced R&D.

If corporate open-sourcing represents horizontal collaboration by market forces, then government intervention represents vertical infrastructure investment.

According to incomplete statistics from Gasgoo Auto Institute, local government orders for data collection robots exceeded 1 billion yuan in 2025 alone.

Data from authoritative research firm Interact Analysis shows that by the end of 2025, more than 50 national or regional humanoid robot data centers were operational or under planning across China. Spread across roughly 19 provinces, over 50% of these centers were already in use by 2025.

In terms of scale, facilities like Shanghai Zhangjiang Robot Valley and the Beijing Shijingshan Embodied Intelligence Comprehensive Training Ground have each deployed nearly 100 data collection robots.

Behind these numbers lies a clear judgment: data collection is shifting from a "corporate activity" to a "government project."

Image source: JD.com

But it is JD.com that has truly pushed this collective action to a climax.

JD.com recently announced plans to accumulate 5 million hours of real-world human video data within a year, surpassing 10 million hours in two years. It also aims to collect 1 million hours of robot body data simultaneously—a scale that leaves most companies in the dust.

The company has already built a leading robotics data center, establishing a full-process pipeline covering collection, labeling, training, and validation.

Even more striking is the mobilization scale: JD.com plans to involve hundreds of thousands of people. This includes over 100,000 internal employees across various roles and up to 500,000 external industry personnel. In Suqian alone, more than 100,000 citizens will participate. From homes and offices to logistics hubs, retail stores, and medical facilities, the initiative covers over 100 specific scenarios.

If successfully executed, this plan could become "the largest data collection campaign in human history."

Yet, amidst the buzz, a question arises: the industry has long known data is critical for embodied intelligence, so why is it only reaching this fever pitch now?

According to analysts at Gasgoo Auto Institute, it's because motion control for embodied robots has matured. The lack of real-world data is now the biggest bottleneck in training a general-purpose "brain."

For the past two years, the sector focused on hardware and motion breakthroughs—getting robots to walk or run stably, or grasp objects more flexibly. These issues are gradually being solved. As robot bodies become more agile, their "brains" are struggling to keep up.

And to train a truly general-purpose robot "brain," massive amounts of high-quality data are the essential fuel.

However, while JD.com's data collection plan has gone viral, it hasn't been without skepticism.

"Using real business scenarios and 'human wave tactics' to get massive data is theoretically feasible and hits the pain point of the data drought," the Gasgoo analyst argued. "But success hinges on capturing high-quality motion data that includes force and tactile feedback. Otherwise, it risks becoming a pile of inefficient video data."

That remark hits at the core of embodied data collection: scale does not equal quality, and video is not valid data.

Hundreds of thousands of people wearing devices while shopping or delivering packages will generate massive visual data. That might teach a robot "what is a door" or "what is an apple," but can it teach them "how much force to apply when holding an egg so it doesn't break"?

The answer remains unknown.

Even with Data, How You Use It Matters More

For embodied intelligence, the industry's shift from fragmentation to collaboration solves the question of where data comes from.

Beneath the surface, a deeper convergence is happening: the boundaries between different data technology routes are blurring.

At NVIDIA GTC 2026, Chelsea Finn, co-founder of Physical Intelligence (PI), stated plainly that many assumed robots with human-like forms would best learn from human videos. In reality, when robot data is sufficiently diverse, models find it easier to bridge the gap between "human data" and "robot data."

"So we don't just use real robot data; we use other sources, especially web and human videos," Finn said. "The goal is to train a model with true generalization capability—one that works across different forms, environments, and tasks."

It sounds complex, but it translates to one thing: don't bet everything on a single data source.

The analyst at Gasgoo Auto Institute agrees. While UMI portable collection effectively balances quality and scale, it doesn't mean teleoperation or simulation synthesis will be replaced. "A more realistic scenario is a layered system where choices are made based on the stage," they noted.

Image source: Spirit AI

Spirit AI, for instance, is firmly pushing a "diversity"-centric scaling path. The company has accumulated over 200,000 hours of multi-type real interaction data, spanning internet videos, teleoperation, and wearable collection. It expects the total to exceed 1 million hours by 2026.

Lv Jun, a research scientist at Noematrix, also noted that due to its advantages in data quality and model training, the company continues to use teleoperation alongside UMI.

So the question arises: while the industry agrees on integrating multiple data collection routes, how exactly should they be combined?

One answer cited repeatedly is: layered usage, playing to strengths. Specifically, use low-cost data for breadth in pre-training, and high-precision data for depth with real robots.

At GTC, Agility CTO Pras Velagapudi illustrated a "pyramid" for data acquired via teleoperation, UMI, simulation, and human videos. At the peak sits teleoperation data—the hardest to obtain and smallest in volume, but highest in quality. Below that are UMI, first-person perspective data, and general videos. The further down you go, the easier the collection and the larger the volume, but the lower the information density.

His view is clear: use peak data for the most core tasks, while maximizing the use of base-level data for pre-training models as a starting point.

Notably, this logic is becoming the industry's common language.

Unitree founder Wang Xingxing agrees. He argues for maximizing video, internet, and simulation data during pre-training to build a base model, then improving the efficiency of real robot data usage. This way, less real-world data is needed to keep the system running.

"Even if you have 10,000 robots and 10,000 people collecting data, the results might not be ideal," he noted. "Issues like data quality, hardware discrepancies, and sensor differences complicate things. More machines don't guarantee linear improvements in data efficacy." He believes the industry should boost data utilization rates by leveraging more video and simulation data, reducing reliance on large-scale real-robot collection.

Skild AI CEO Deepak Pathak offered a vivid analogy to explain this approach: it's like a child learning from adults. Their body proportions are completely different, but through observation and practice, they still learn.

image.png

Image source: Beijing Release

However, despite the consensus on integrating technical routes, one fact cannot be ignored: an invisible hand is quietly shaping the industry landscape during this battle over data collection methods.

"When it comes to data, especially for collection firms, government assistance pushes them to partner with robot manufacturers and local authorities," an industry insider said. "This creates opportunities to build teleoperation-heavy centers using immediate revenue, leaving less incentive to develop methods like UMI."

The statement is polite, but the subtext is clear: government support is a double-edged sword. While it can rapidly deploy data infrastructure and accelerate the industry, it risks creating path dependency—potentially delaying the adoption of more flexible, lower-cost solutions like UMI in China.

Consider this: without government support and subsidies, would so many data centers still be using teleoperation? The answer is obvious.

Conclusion

With policy, industry, and capital injecting momentum simultaneously, and multiple technical routes—teleoperation, UMI, simulation, human video learning—advancing in parallel, the embodied intelligence data dilemma is shifting from "can it be solved?" to "when will it be solved?"

Much like autonomous driving in its infancy, the sector faced data shortages. Yet through collaboration across the supply chain, the accumulation of massive real-world road data, and algorithm iteration, the industry successfully broke through from L2 to higher levels of autonomy.

Embodied intelligence will inevitably follow a similar trajectory. Different technical routes will learn from each other and complement their advantages in competition, ultimately shaking off the shackles of the "data drought."

Gasgoo not only offers timely news and profound insight about China auto industry, but also help with business connection and expansion for suppliers and purchasers via multiple channels and methods. Buyer service: buyer-support@gasgoo.com Seller Service: seller-support@gasgoo.com

All Rights Reserved. Do not reproduce, copy and use the editorial content without permission. Contact us: autonews@gasgoo.com