TechCrunch AI • 103일 전

로봇 스타트업 피지컬 인텔리전스, 학습하지 않은 작업도 수행하는 새 모델 발표

IMP

8/10

핵심 요약

샌프란시스코 기반 로봇 스타트업 Physical Intelligence(PI)가 명시적으로 학습하지 않은 새로운 작업을 수행할 수 있는 로봇 파운데이션 모델 'π0.7(파이 제로 포인트 세븐)'을 발표했습니다. 이 모델은 서로 다른 맥락에서 학습한 기술을 결합하여 처음 보는 문제를 해결하는 '조합적 일반화(Compositional Generalization)' 능력을 보여주며, 이는 데이터 양 이상으로 성능이 비약적으로 상승하는 대규모 언어 모델(LLM)의 변곡점을 로봇 AI에서도 달성할 수 있음을 시사합니다. 아직 복잡한 다단계 작업의 자율 수행이나 프롬프트 엔지니어링의 중요성 등 해결해야 할 한계가 존재하지만, 추가 학습 없이도 사람의 언어 지시를 통해 실시간으로 로봇을 제어할 수 있게 되었다는 점에서 산업계에 큰 의미를 갖습니다.

번역된 본문

샌프란시스코에 위치한 2년 차 로봇 스타트업 Physical Intelligence(피지컬 인텔리전스)는 조용히 베이 에어리어에서 가장 주목받는 AI 기업 중 하나로 부상했습니다. 이 회사는 목요일 새로운 연구 결과를 발표했는데, 최신 모델이 명시적으로 학습한 적 없는 작업도 로봇에게 지시하여 수행하게 할 수 있다는 것을 보여주었습니다. 이 능력은 회사 연구진들조차도 놀라게 한 결과입니다.

π0.7(파이 제로 포인트 세븐)이라 불리는 이 새로운 모델은 오랫동안 추구해 온 '범용 로봇 두뇌'라는 목표를 향한 초기지만 의미 있는 진전을 보여줍니다. 즉, 낯선 작업에 투입되더라도 일상적인 언어로 지시를 내리면 실제로 그 작업을 완수할 수 있는 단계에 접어든 것입니다. 이 연구 결과가 검증을 통과한다면, 로봇 AI가 대규모 언어 모델(LLM)에서 겪었던 것과 유사한 변곡점에 접근하고 있음을 시사합니다. 즉, 기반 데이터가 예측하는 수준을 뛰어넘어 역량이 복합적으로 성장하기 시작하는 시점에 도달한 것입니다.

논문의 핵심 주장은 '조합적 일반화(Compositional generalization)'입니다. 이는 서로 다른 상황에서 학습한 기술을 결합하여 모델이 한 번도 겪어보지 못한 문제를 해결하는 능력을 의미합니다. 지금까지 로봇 학습의 표준적인 접근 방식은 본질적으로 '암기식'이었습니다. 특정 작업에 대한 데이터를 수집하고, 그 데이터로 전문 모델을 학습시킨 다음, 새로운 작업이 생길 때마다 이 과정을 반복하는 것이죠. Physical Intelligence는 π0.7이 이러한 패턴을 깨뜨렸다고 말합니다.

Physical Intelligence의 공동 창이자 로봇 공학 AI를 연구하는 UC 버클리 교수인 세르게이 레빈(Sergey Levine)은 "모델이 데이터를 수집한 그대로의 작업만 수행하는 단계를 넘어, 실제로 새로운 방식으로 기술들을 재조합하기 시작하는 임계점을 넘어서면 능력이 데이터 양에 비해 선형적으로 증가하는 것 이상으로 올라가게 됩니다"라고 설명했습니다. 그는 "이처럼 훨씬 유리한 스케일링(확장) 속성은 언어나 비전과 같은 다른 분야에서 이미 우리가 목격한 바입니다"라고 덧붙였습니다.

논문에서 가장 인상적인 시연은 모델이 훈련 과정에서 본 적이 거의 없는 '에어프라이어'를 다루는 장면입니다. 연구팀이 조사한 결과, 전체 학습 데이터셋에서 관련된 에피소드는 단 두 개뿐이었습니다. 하나는 다른 로봇이 단순히 에어프라이어를 밀어 닫는 것이었고, 다른 하나는 오픈소스 데이터셋에서 또 다른 로봇이 누군가의 지시에 따라 플라스틱 병을 에어프라이어 안에 넣는 것이었습니다. 그럼에도 불구하고 모델은 이 단편적인 기록과 폭넓은 웹 기반 사전 학습 데이터를 종합하여 이 가전제품이 어떻게 작동하는지 기능적으로 이해해 낸 것입니다.

Physical Intelligence의 연구원이자 스탠퍼드 대학교 컴퓨터 과학 박사 과정 학생인 애쉬윈 발라크리슈나(Ashwin Balakrishna)는 "지식이 어디서 왔는지, 언제 성공하고 언제 실패할지 추적하는 것은 매우 어렵습니다"라고 말했습니다.

그럼에도 불구하고 어떠한 지시도 없이 모델은 고구마를 요리하기 위해 이 기기를 사용하려는 꽤 합당한 시도를 했습니다. 단계별 언어 지시, 즉 새로운 직원에게 무언가를 설명하듯 사람이 로봇에게 작업을 차근차근 알려주자 로봇은 성공적으로 작업을 수행했습니다. 이러한 코칭(지시) 능력이 중요한 이유는 추가적인 데이터 수집이나 모델 재학습 없이도 로봇을 새로운 환경에 배포하고 실시간으로 개선할 수 있음을 시사하기 때문입니다.

그렇다면 이 모든 것은 무엇을 의미할까요? 연구진은 모델의 한계를 숨기지 않으며 자신들이 앞서가고 있다고 성급하게 판단하지 않도록 주의를 기울입니다. 적어도 한 가지 사례에서 그들은 실패의 원인을 자신들의 팀에 돌렸습니다. 발라크리슈나는 "때로는 로봇이나 모델의 문제가 아니라 우리의 문제입니다. 프롬프트 엔지니어링을 제대로 하지 못한 것이죠"라고 말했습니다. 그는 초기 에어프라이어 실험에서 5%에 불과했던 성공률이 모델에게 작업을 설명하는 방식을 약 30분간 다듬은 후 95%로 도약했다고 설명했습니다.

또한 이 모델은 아직 단일 고급 명령을 내리는 것만으로 복잡한 다단계 작업을 자율적으로 수행할 수 있는 단계는 아닙니다. 레빈은 "'이봐, 가서 토스트 좀 구워줘'라고 말할 수는 없습니다"라고 인정했습니다. 대신 "'토스터기를 쓰려면 이 부분을 열고, 저 버튼을 누르고, 이렇게 해'라고 차근차근 안내해 주면, 그때는 실제로 꽤 잘 작동합니다"라고 덧붙였습니다.

연구팀은 또한 로봇 공학을 위한 표준화된 벤치마크가 실질적으로 존재하지 않는다는 점도 인정했습니다.

원문 보기

원문 보기 (영어)

Physical Intelligence , the two-year-old, San Francisco-based robotics startup that has quietly become one of the most closely watched AI companies in the Bay Area, published new research Thursday showing that its latest model can direct robots to perform tasks they were never explicitly trained on — a capability the company's own researchers say caught them off guard. The new model, called π0.7, represents what the company describes as an early but meaningful step toward the long-sought goal of a general-purpose robot brain: One that can be pointed at an unfamiliar task, coached through it in plain language, and actually pull it off. If the findings hold up to scrutiny, they suggest that robotic AI may be approaching an inflection point similar to what the field saw with large language models — where capabilities begin compounding in ways that outpace what the underlying data would seem to predict. But first: The core claim in the paper is compositional generalization — the ability to combine skills learned in different contexts to solve problems the model has never encountered. Until now, the standard approach to robot training has been essentially rote memorization — collect data on a specific task, train a specialist model on that data, then repeat for every new task. π0.7, Physical Intelligence says, breaks that pattern. "Once it crosses that threshold where it goes from only doing exactly the stuff that you collect the data for to actually remixing things in new ways," says Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor focused on AI for robotics, "the capabilities are going up more than linearly with the amount of data. That much more favorable scaling property is something we've seen in other domains, like language and vision." The paper's most striking demonstration involves an air fryer the model had essentially never seen in training. When the research team investigated, they found only two relevant episodes in the entire training dataset: One where a different robot merely pushed the air fryer closed, and one from an open-source dataset where yet another robot placed a plastic bottle inside one on someone's instructions. The model had somehow synthesized those fragments, plus broader web-based pretraining data, into a functional understanding of how the appliance works. "It's very hard to track down where the knowledge is coming from, or where it will succeed or fail," says Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. Still, with zero coaching, the model made a passable attempt at using the appliance to cook a sweet potato. With step-by-step verbal instructions — essentially, a human walking the robot through the task the way you might explain something to a new employee — it performed successfully. That coaching capability matters because it suggests robots could be deployed in new environments and improved in real time without additional data collection or model retraining. So what does it all mean? The researchers aren't shy about the model's limitations and are careful not to get ahead of themselves. In at least one case, they point the finger squarely at their own team. "Sometimes the failure mode is not on the robot or on the model," Balakrishna says. "It's on us. Not being good at prompt engineering." He describes an early air fryer experiment that produced a 5% success rate. After spending about half an hour refining how the task was explained to the model, it jumped to 95%, he says. The model also isn't yet capable of executing complex multi-step tasks autonomously from a single high-level command. "You can't tell it, ‘Hey, go make me some toast'," Levine says. "But if you walk it through — ‘for the toaster, open this part, push that button, do this' — then it actually tends to work pretty well." The team also acknowledged that standardized benchmarks for robotics don't really exist, which makes external validation of their claims difficult. Instead, the company measured π0.7 against its own previous specialist models — purpose-built systems trained on individual tasks — and found that the generalist model matched their performance across a range of complex work including making coffee, folding laundry, and assembling boxes. What may be most notable about the research — if you take the researchers at their word — is not any single demo but the degree to which the results surprised them, people whose job it is to know exactly what is in the training data and therefore what the model should and shouldn't be able to do. "My experience has always been that when I deeply know what's in the data, I can kind of just guess what the model will be able to do," Balakrishna says. "I'm rarely surprised. But the last few months have been the first time where I'm genuinely surprised. I just bought a gear set randomly and asked the robot, ‘Hey, can you rotate this gear?' And it just worked." Levine recalled the moment researchers first encountered GPT-2 generating a story about unicorns in the Andes . "Where the heck did it learn about unicorns in Peru?" he says. "That's such a weird combination. And I think that seeing that in robotics is really special." Naturally, critics will point to an uncomfortable asymmetry here: Language models had the entire internet to learn from. Robots don't, and no amount of clever prompting fully closes that gap. But when asked where he expects the skepticism, Levine points somewhere else entirely. "The criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring," he says. "The robot is not doing a backflip." He pushes back on that framing, arguing that the distinction between an impressive robot demo and a robotic system that actually generalizes is precisely the point. Generalization, he suggests, will always look less dramatic than a carefully choreographed stunt — but it is considerably more useful. The paper itself uses careful hedging language throughout, describing π0.7 as showing "early signs" of generalization and "initial demonstrations" of new capabilities. These are research results, not a deployed product, and Physical Intelligence has been restrained from the start about commercial timelines. When asked directly when a system based on these findings might be ready for real-world deployment, Levine declines to speculate. "I think there's good reason to be optimistic, and certainly it's progressing faster than I expected a couple of years ago," he says. "But it's very hard for me to answer that question." Physical Intelligence has raised over $1 billion to date and was most recently valued at $5.6 billion. A significant part of the investor enthusiasm around the company traces to Lachy Groom, a co-founder who spent years as one of Silicon Valley's most well-regarded angel investors — backing Figma, Notion, and Ramp, among others — before deciding that Physical Intelligence was the company he'd been looking for. That pedigree has helped the startup attract serious institutional money even as it has refused to offer investors a commercialization timeline. The company is now said to be in discussions for a new round that would nearly double that figure to $11 billion . The team declined to comment. Topics AI , Exclusive , Physical Intelligence , Robotics Connie Loizos Editor in Chief & General Manager Loizos has been reporting on Silicon Valley since the late ’90s, when she joined the original Red Herring magazine. Previously the Silicon Valley Editor of TechCrunch, she was named Editor in Chief and General Manager of TechCrunch in September 2023. She’s also the founder of StrictlyVC, a daily e-newsletter and lecture series acquired by Yahoo in August 2023 and now operated as a sub brand of TechCrunch. You can contact or verify outreach from Connie by emailing connie@strictlyvc.com or connie@techcrunch.com , or via encrypted message

로봇 공학 파운데이션 모델 사람-로봇 상호작용 AI 일반화 Physical Intelligence