MIT Tech Review • 103일 전

로봇은 어떻게 배우는가: 현대 로봇 공학의 짧은 역사

IMP

8/10

핵심 요약

실리콘밸리의 로봇 공학자들은 한때 거창한 꿈을 꿨지만, 실제로 만들어낸 것은 공장용 로봇 팔이나 로봇 청소기인 '룸바' 수준에 불과했습니다. 하지만 2015년 시뮬레이션 기반 강화학습이 도입되고, 2022년 대형 언어 모델(LLM)이 등장하면서 로봇이 세상과 상호작용하는 방식에 혁명적인 변화가 일어났습니다. 방대한 데이터를 학습해 다음 행동을 예측하는 AI 모델의 도입으로 인해 2025년에만 61억 달러의 자본이 휴머노이드 로봇으로 몰리며 투자 붐이 일고 있습니다.

번역된 본문

로봇 공학자들은 과거에 거창한 꿈을 꾸었지만, 실제로 만들어낸 것은 작은 수준에 그쳤습니다. 인간 신체의 뛰어난 복잡성에 필적하거나 이를 능가하는 것을 목표로 삼았지만, 이내 자신의 경력 전체를 자동차 공장용 로봇 팔을 개량하는 데 보내곤 했습니다. 'C-3PO'를 목표로 했지만 결국 '룸바(로봇 청소기)'로 끝이 났던 것입니다. 이 연구자들의 진정한 야망은 세상을 돌아다니고, 다양한 환경에 적응하며, 사람들과 안전하고 유익하게 상호작용할 수 있는 공상과학(SF) 영화 속 로봇이었습니다. 사회적인 목적을 가진 사람들에게 이러한 기계는 거동이 불편한 사람들을 돕고, 외로움을 덜어주거나 인간에게 너무 위험한 작업을 대신하는 존재였습니다. 수익을 더 중시하는 사람들에게는 임금을 지불하지 않아도 되는 무한한 노동력을 의미했죠. 그러나 어떤 쪽이든, 오랜 실패의 역사는 실리콘밸리 대부분이 유용한 로봇에 베팅하는 것을 주저하게 만들었습니다.

하지만 이제 상황이 변했습니다. 완성된 기계는 아직 없지만, 자금은 쏟아져 들어오고 있습니다. 2025년 한 해에만 기업과 투자자들이 휴머노이드 로봇에 61억 달러를 투자했으며, 이는 2024년 투자액의 4배에 달합니다. 대체 무슨 일이 일어난 것일까요? 바로 기계가 세상과 상호작용하는 방법을 학습하는 방식의 혁명이 일어난 것입니다.

예를 들어, 옷을 개는 것이라는 단 한 가지 목적을 위해 집에 로봇 팔 한 쌍을 설치한다고 상상해 봅시다. 로봇은 그것을 어떻게 배우게 될까요? 먼저 규칙을 작성하는 것으로 시작할 수 있습니다. 원단이 찢어지기 전까지 견딜 수 있는 변형의 정도를 확인합니다. 셔츠의 깃을 식별합니다. 그리퍼(집게)를 왼쪽 소매로 이동시키고, 들어 올린 후 정확히 정해진 거리만큼 안으로 접습니다. 오른쪽 소매에 대해서도 반복합니다. 셔츠가 회전해 있다면 그에 맞춰 계획을 수정합니다. 소매가 꼬여 있다면 바로잡습니다. 규칙의 수은 매우 빠르게 기하급수적으로 늘어나지만, 이 모든 상황을 완벽하게 고려한다면 안정적인 결과를 만들어낼 수 있습니다. 이것이 로봇 공학의 원래 방식이었습니다. 가능한 모든 상황을 미리 예측하고 이를 프로그래밍 코드로 작성해 두는 것이죠.

2015년경이 되자 최첨단 기술은 다른 방식을 쓰기 시작했습니다. 로봇 팔과 옷에 대한 디지털 시뮬레이션을 구축한 뒤, 프로그램이 옷을 성공적으로 개면 보상 신호를 주고 실패할 때마다 벌점을 주는 방식입니다. 이렇게 하면 수백만 번의 반복을 통해 시행착오로 온갖 기술을 시도해 보며 스스로 실력을 향상시킬 수 있습니다. 이는 AI가 게임을 잘하게 된 것과 같은 원리입니다.

2022년 챗GPT(ChatGPT)의 등장은 현재의 붐을 촉매 역할을 했습니다. 방대한 양의 텍스트로 학습된 대형 언어 모델(LLM, Large Language Models)은 시행착오를 통해 작동하는 것이 아니라, 문장에서 다음에 어떤 단어가 와야 할지 예측하는 방법을 학습하여 작동합니다. 이와 유사한 모델이 로봇 공학에 적용되면서 곧 이미지, 센서 수치, 로봇 관절의 위치를 파악하고 기계가 취해야 할 다음 행동을 예측할 수 있게 되었으며, 매 초 수십 개의 모터 명령을 내릴 수 있게 되었습니다.

대량의 데이터를 흡수하는 AI 모델에 의존하는 이러한 개념의 전환은 유용한 로봇이 사람과 대화해야 하거나, 환경을 이동하거나, 복잡한 작업을 수행해야 하는 경우에도 모두 효과가 있는 것으로 보입니다. 그리고 이는 로봇이 아직 완벽하지 않더라도 작업 환경에 투입하여 현장에서 학습하게 한다는 새로운 학습 방식과 결합되었습니다. 오늘날 실리콘밸리의 로봇 공학자들은 다시 거창한 꿈을 꾸고 있습니다. 그 과정이 어떠했는지 살펴보겠습니다.

지보 (Jibo) 움직일 수 있는 사회적 로봇인 '지보'는 대형 언어 모델(LLM) 시대가 훨씬 이전인 시절에 대화를 나눌 수 있었습니다. MIT 로봇 공학 연구원인 신시아 브레이즐(Cynthia Breazeal)은 2014년 팔도, 다리도, 얼굴도 없는 '지보'라는 로봇을 세상에 선보였습니다. 사실 그것은 모양새가 마치 스탠드 조명 같았습니다. 브레이즐의 목표는 가정용 사회적 로봇을 만드는 것이었으며, 이 아이디어는 크라우드펀딩 캠페인에서 370만 달러의 자금을 모았습니다. 초기 예약 주문 가격은 749달러였습니다. 초기 지보는 자기소개를 하거나 아이들을 위해 춤을 추며 즐거움을 주었지만, 할 수 있는 것은 그것이 전부였습니다. 하지만 그 비전은 항상 일정 관리, 이메일 처리부터 이야기 들려주기까지 모든 것을 해결하는 일종의 '구현된 비서'가 되는 것이었습니다. 지보는 헌신적인 사용자층을 확보했지만, 결국 2019년에 회사는 문을 닫았습니다.

돌이켜 보면, 지보에게 정말로 필요했던 것 중 하나는 더 나은 언어 처리 능력이었습니다. 당시 애플의 시리(Siri), 아마존의 알렉사(Alexa)와 경쟁해야 했으며, 당시 이 모든 기술은 무거운 스크립트(사전 작성된 명령어)에 크게 의존하고 있었습니다.

원문 보기

원문 보기 (영어)

Roboticists used to dream big but build small. They’d hope to match or exceed the extraordinary complexity of the human body, and then they’d spend their career refining robotic arms for auto plants. Aim for C-3P0; end up with the Roomba. The real ambition for many of these researchers was the robot of science fiction—one that could move through the world, adapt to different environments, and interact safely and helpfully with people. For the socially minded, such a machine could help those with mobility issues, ease loneliness, or do work too dangerous for humans. For the more financially inclined, it would mean a bottomless source of wage-free labor. Either way, a long history of failure left most of Silicon Valley hesitant to bet on helpful robots. That has changed. The machines are yet unbuilt, but the money is flowing: Companies and investors put $6.1 billion into humanoid robots in 2025 alone, four times what was invested in 2024. What happened? A revolution in how machines have learned to interact with the world. Imagine you’d like a pair of robot arms installed in your home purely to do one thing: fold clothes. How would it learn to do that? You could start by writing rules. Check the fabric to figure out how much deformation it can tolerate before tearing. Identify a shirt’s collar. Move the gripper to the left sleeve, lift it, and fold it inward by exactly this distance. Repeat for the right sleeve. If the shirt is rotated, turn the plan accordingly. If the sleeve is twisted, correct it. Very quickly the number of rules explodes, but a complete accounting of them could produce reliable results. This was the original craft of robotics: anticipating every possibility and encoding it in advance. Around 2015, the cutting edge started to do things differently: Build a digital simulation of the robotic arms and the clothes, and give the program a reward signal every time it folds successfully and a ding every time it fails. This way, it gets better by trying all sorts of techniques through trial and error, with millions of iterations—the same way AI got good at playing games . The arrival of ChatGPT in 2022 catalyzed the current boom. Trained on vast amounts of text, large language models work not through trial and error but by learning to predict what word should come next in a sentence. Similar models adapted to robotics were soon able to absorb pictures, sensor readings, and the position of a robot’s joints and predict the next action the machine should take, issuing dozens of motor commands every second. This conceptual shift—to reliance on AI models that ingest large amounts of data—seems to work whether that helpful robot is supposed to talk to people, move through an environment, or even do complicated tasks. And it was paired with other ideas about how to accomplish this new way of learning, like deploying robots even if they aren’t yet perfect so they can learn from the environment they’re meant to work in. Today, Silicon Valley roboticists are dreaming big again. Here’s how that happened. Jibo Jibo A movable social robot carried out conversations long before the age of LLMs. An MIT robotics researcher named Cynthia Breazeal introduced an armless, legless, faceless robot called Jibo to the world in 2014. It looked, in fact, like a lamp. Breazeal’s aim was to create a social robot for families, and the idea pulled in $3.7 million in a crowdsourced funding campaign. Early preorders cost $749. The early Jibo could introduce itself and dance to entertain kids, but that was about it. The vision was always for it to become a sort of embodied assistant that could handle everything from scheduling and emails to telling stories. It earned a number of devoted users, but ultimately the company shut down in 2019. In retrospect, one thing that Jibo really needed was better language capabilities. It was competing against Apple’s Siri and Amazon’s Alexa, and all those technologies at the time relied on heavy scripting. In broad terms, when you spoke to them, software would translate your speech into text, analyze what you wanted, and create a response pulled from preapproved snippets. Those snippets could be charming, but they were also repetitive and simply boring — downright robotic. That was especially a challenge for a robot that was supposed to be social and family oriented. What has happened since, of course, is a revolution in how machines can generate language. Voice mode from any leading AI provider is now engaging and impressive, and multiple hardware startups are trying (and failing) to build products that take advantage of it. But that comes with a new risk: While scripted conversations can’t really go off the rails, ones generated by AI certainly can. Some popular AI toys have, for example, talked to kids about how to find matches and knives. OpenAI Dactyl A robot hand trained with simulations tries to model the unpredictability and variation of the real world. By 2018, every leading robotics lab was trying to scrap the old scripted rules and train robots through trial and error. OpenAI tried to train its robotic hand, Dactyl, virtually — with digital models of the hand and of the palm-size cubes Dactyl was supposed to manipulate. The cubes had letters and numbers on their faces; the model might set a task like “Rotate the cube so the red side with the letter O faces upward.” Here’s the problem: A robotic hand might get really good at doing this in its simulated world, but when you take that program and ask it to work on a real version in the real world, the slight differences between the two can cause things to go awry. Colors might be slightly different, or the deformable rubber in the robot’s fingertips could turn out to be stretchier than it was in simulation. The solution is called domain randomization. You essentially create millions of simulated worlds that all vary slightly and randomly from one another. In each one the friction might be less, or the lighting more harsh, or the colors darkened. Exposure to enough of this variation means the robots will be better able to manipulate the cube in the real world. The approach worked on Dactyl, and one year later it was able to use the same core techniques to do something harder: solving Rubik’s Cubes (though it worked only 60% of the time, and just 20% when the scrambles were particularly hard). Still, the limits of simulation mean that this technique plays a far smaller role today than it did in 2018. OpenAI shuttered its robotics effort in 2021 but has recently started the division up again — reportedly focusing on humanoids. Google DeepMind RT-2 Training on images from across the internet helps robots translate language into action. Around 2022, Google’s robotics team was up to some strange things. It spent 17 months handing people robot controllers and filming them doing everything from picking up bags of chips to opening jars. The team ended up cataloguing 700 different tasks. The point was to build and test one of the first large-scale foundation models for robotics. As with large language models, the idea was to input lots of text, tokenize it into a format an algorithm could work with, and then generate an output. Google’s RT-1 received input about what the robot was looking at and how the many parts of the robotic arm were positioned; then it took an instruction and translated it into motor commands to move the robot. When it had seen tasks before, it carried out 97% of them successfully; it succeeded at 76% of the instructions it hadn’t seen before. The second iteration, RT-2, came out the following year and went even further. Instead of training on data specific to robotics, it went broad: It trained on more general images from across the internet, like the vision-language models lots of researchers were working on at the time. That allowed the robot to interpret where certain objects were in the scene. “All these other things were unlocked,” says Kanishka Rao, a roboticist

로봇 공학 휴머노이드 대형 언어 모델 강화학습 인공지능 역사