Download and Deploy: Building Robots That Work Anywhere

For more than 50 years, robots have excelled at performing the same task millions of times in factories — but struggle when asked to do a million different things once, the way humans naturally operate throughout their day. Recent NYU Courant PhD graduate Nur Muhammad “Mahi” Shafiullah and NYU undergraduate alumnus Haritheja Etukuru, working with CDS-affiliated Courant Assistant Professor Lerrel Pinto, have developed Robot Utility Models that tackle this fundamental limitation, achieving 90% success rates in completely new environments without any additional training or fine-tuning.

The work, presented in the paper “Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments,” represents a departure from the standard approach in robotics. While vision and language models have moved to zero-shot deployments that work immediately in new contexts, robot learning has remained stuck in a pattern of pretraining followed by environment-specific fine-tuning.

At the Computer Vision and Pattern Recognition (CVPR) conference this year, Shafiullah and Etukuru demonstrated just how far their approach has come. “People would come by and say, ‘I have something in my pocket — let’s see what it is.’ They’d pull out headphones, and the robot would pick them up with no problem,” Shafiullah recounted. The robot had never seen those particular headphones before — they had simply downloaded their policy onto robots shipped to the conference hotel. Their efforts earned them the “Best Demo Award” among the 70 research demos at the conference.

The system relies on three key innovations. First, the team developed Stick-v2, a handheld data collection tool that costs just $25 in materials plus an iPhone Pro. The tool’s portability and instant deployment capability proved crucial for gathering diverse training data. Second, they collected approximately 1,000 demonstrations across 40 different environments for five core tasks: opening doors, opening drawers, picking up bags, picking up tissues, and reorienting fallen objects. Third, they integrated GPT-4o to detect failures and automatically retry tasks, boosting success rates by an additional 15.6%.

The importance of data diversity emerged as a critical finding. When comparing models trained on 200 demonstrations each from 5–6 environments versus 25 demonstrations each from 40 environments, the diverse dataset dramatically outperformed the concentrated one — particularly for the object reorientation task, where performance dropped by 50% without environmental diversity.

Training these models required substantial computational resources, provided through a grant from Microsoft Research that gave them access to Azure compute clusters. “We didn’t just have to train one model; we had to iterate. So we had to train hundreds of these models,” Shafiullah explained. The compute allowed them to experiment extensively before ultimately creating policies small enough to run on-device — each final model is just 60 megabytes. Through collaborations with Microsoft Research, this grant will further support frontier research in multi-modal manipulation, novel hardware development for increased dexterity, and training robot models at scale.

The project benefited from NYU’s unique ecosystem of students willing to volunteer their time and even their homes for testing. Master’s and undergraduate students helped collect data and run evaluations, with some opening their apartments for robot testing. “This kind of infrastructure is very hard to find,” Shafiullah noted, describing it as “organic, grassroots level experimentation” that would be difficult to replicate at industry labs.

Among their findings, the choice of learning algorithm mattered less than expected. While VQ-BeT and Diffusion Policy performed best, even simpler approaches like ACT and MLP-BC weren’t far behind. The implication is clear: high-quality, diverse data matters more than algorithmic sophistication.

The team tested their approach extensively, running 2,950 real-world robot trials across homes in New York City, Jersey City, and Pittsburgh. The robots succeeded in environments ranging from retail stores to conference venues, using both Hello Robot’s Stretch platform and UFactory’s xArm.

Current limitations center on hardware constraints — the two-fingered gripper cannot handle round doorknobs, and the lack of force feedback makes it difficult to gauge when to stop pulling on drawer handles. The team also assumes the robot starts positioned facing its target rather than incorporating navigation.

Looking ahead, Shafiullah sees applications in assistive robotics, particularly for elderly care and people with disabilities. The team has been collaborating with Hello Robot, whose stretch robots are designed specifically for assistive applications. For someone with quadriplegia who can control a robot with a mouth controller, being able to simply click on an object for the robot to retrieve represents a significant quality of life improvement.

While the Robot Utility Models project will continue at NYU with other PhD students taking the lead, both researchers are moving to new positions — Shafiullah as a postdoc at BAIR at UC Berkeley and Meta Fundamental AI Research (FAIR), and Etukuru beginning his PhD at BAIR. The project’s code, data, models, and hardware designs are all open-sourced at robotutilitymodels.com.

By Stephen Thomas

Have feedback on our content? Help us improve our blog by completing our (super quick) survey.

Learn more Download and Deploy: Building Robots That Work Anywhere

Leave a Reply

Your email address will not be published. Required fields are marked *