Why humanoid robots learn everyday tasks faster than expected

Prev Article Next Article

Last September, roboticist Benjie Holson released the “Humanoid Olympic Games”: a set of increasingly difficult tests for humanoid robots that he demonstrated himself while dressed in a silver suit. The challenges, such as opening a door with a round doorknob, started out simple, at least for a human, and progressed to “gold medal” tasks such as buttoning and hanging a men’s shirt and using a key to open a door.

Holson’s point was that the difficult tasks are not the dazzling ones. While other competitions feature robots that play sports and dance, Holson argued that the robots we actually want are the ones that can wash and cook.

He expected the challenges would take years to resolve. Instead, over the course of months, robotics company Physical Intelligence completed 11 of the 15 challenges – from bronze to gold – with a robot that washed windows, spread peanut butter and used a doggie bag.

On supporting science journalism

If you like this article, please consider supporting our award-winning journalism by subscribes. By purchasing a subscription, you help secure the future of impactful stories about the discoveries and ideas that shape our world today.

Scientific American spoke with Holson about why vision-only, or camera-based, systems exceed his expectations and how close we are to a genuinely useful machine. He has since released a new, more difficult set of challenges.

(An edited transcript of the interview follows.)

You designed these challenges to be difficult. Were you surprised at how quickly the results came?

It was so much faster than I expected. When I chose the challenges, I tried to calibrate them so that some bronzes were done in the first month or two, then silvers and golds in the next six months, and the hardest ones might take a year or a year and a half. To have them do almost all of them in the first three months is wild.

What made it possible?

I started with the premise that we have things that look impressive on a fairly narrow set of tasks – vision only, no touch, simple manipulator, not incredible precision. It limits what you can be good at. I tried to think of tasks that would require us to break out of that set. It turns out I underestimated what is possible with just vision and simple manipulators.

When I visited Physical Intelligence, I learned that they have no force sense. They do all this 100 percent vision-based. The key insertion task, the peanut butter spread—I thought they would require effort. But apparently you just throw several video demonstrations at it and it works.

How do you train a robot to do that without coding it line by line?

It all learns by demonstration. Someone remotely operates the robot doing the task hundreds of times, they train a model based on that, and then the robot can do the task.

There is a lot of confusion about whether large language models (LLM) are useless for bots. Are they?

I used to be quite dubious about the utility of LLMs in robotics. The problem they were good at solving two or three years ago was high-level planning—”If I want to make tea, what are the steps?” Ordering the steps is the easy part. Picking up the teapot and filling it is really challenging.

On the other hand, we have started creating vision-action models using the same transformer architecture (as the one used in LLM). You can use transformers for text in, text out, images in, text out – but also images in, robot actions out.

The nice thing is that they start with models that are pre-trained on text, images, maybe video. Before you even start training your specific task, the AI already understands what a teapot is, what water is, that you might want to fill a teapot with water. So as you practice your assignment, it doesn’t have to start with “Let me figure out what geometry is.” It can start with “I see, we move teapots around” – which is wild that it works.

How did you come up with the “Olympic” tasks?

So part of it was a challenge and part of it was a prediction. I was trying to think of the next set of things that we can’t do now that someone will soon be able to do.

Humans rely on touch to do things like find keys in a pocket. How do we get around that in robotics?

That’s a very good question we don’t know the answer to yet. Touch technology is much worse, more expensive, delicate and far behind cameras. We have worked with cameras for a long time.

The big question is: Are cameras enough? Both Physical Intelligence and Sunday Robotics (which completed the bronze medal task of rolling matched socks) have bet that putting a camera on the wrist, very close to the fingers, allows you to sort of see forces by seeing how everything is smushed. When the robot grabs something, it sees that the fingers have some rubber as a deflector; the object deflects, and it derives forces from it. When you spread peanut butter on bread, the robot watches as the knife bends down and crushes the bread and judges forces based on that. It works much better than I expected.

What about security?

The energy needed to stay balanced is often quite high. If a robot falls, there is a very fast, hard acceleration to get the leg forward in time. Your system has to inject a lot of energy into the world – and that’s what’s unsafe.

I’m a big fan of centaur robots – mobile wheelbase with arms and head. Just to be safe, there is an easier way to get there quickly. If a humanoid loses its power, it will fall down. The general plan seems to be to make a robot so incredibly valuable that we as a society create a new safety class for it – like bikes and cars. They are dangerous, but so valuable that we tolerate the risk.

Have these results changed your timeline?

I used to think home robots were at least 15 years away. Now I think at least six. The difference is that I thought it would take much longer to plausibly do a useful thing in a human space, even as a demo.

But roboticists have seen time and time again that there’s a long way between “it worked in a lab and I got a video” and “I can sell a product.” Waymo hit the roads in 2009; I couldn’t buy a ride until 2024. It takes a long time for reliability to dissolve.

What is the biggest bottleneck left?

Reliability and safety – the things Physical Intelligence shows are incredibly impressive, but if you put it on a different table with different lighting and use a different sock, it might not work. Each step toward generalization seems to take an order of magnitude more data, turning days of data collection into weeks or months.