메뉴
BL
MarkTechPost 48일 전

공간 인식과 로봇 행동 예측을 위한 MolmoAct 구현 튜토리얼

IMP
6/10
핵심 요약

본 튜토리얼은 시각적 관찰을 바탕으로 공간적인 추론과 로봇 제어가 가능한 액션-추론 모델인 MolmoAct의 실전 구현 방법을 다룹니다. 환경 설정부터 다중 뷰 이미지 입력, 자연어 명령을 통한 깊이 추론 및 시각적 궤적 시각화, 실행 가능한 로봇 출력 생성까지 전체 워크플로우를 단계별로 안내합니다.

번역된 본문

인공지능 애플리케이션, 기술, 컴퓨터 비전, 에디터 추천, 머신러닝, 피지컬 AI, 스태프 튜토리얼

이 튜토리얼에서는 MolmoAct를 단계별로 살펴보고, 액션-추론 모델(action-reasoning models)이 시각적 관찰을 바탕으로 어떻게 공간적으로 추론할 수 있는지 실무적으로 이해해 봅니다. 우리는 실행 환경을 설정하고, 모델을 불러오며, 다중 뷰(multi-view) 이미지 입력을 준컴다음, MolmoAct가 자연어 명령을 통해 깊이 인식(depth-aware) 추론, 시각적 추적(visual traces), 그리고 실행 가능한 로봇 출력물을 어떻게 생성하는지 탐구합니다. 워크플로우를 진행하면서 모델 추론(inference)을 실행해 보고, 모델이 액션을 어떻게 파싱하고 궤적(trajectories)을 시각화하며, 로봇 중심 작업을 위한 고급 처리 파이프라인을 어떻게 지원하는지 검토합니다.

코드 복사됨 (다른 브라우저 사용)

print("=" * 80) print("🔧 섹션 1: 설치 및 설정") print("=" * 80)

import subprocess import sys

def install_packages(): """MolmoAct에 필요한 모든 패키지 설치""" packages = [ "torch>=2.0.0", "torchvision", "transformers==4.52", "accelerate", "einops", "Pillow", "numpy", "matplotlib", "requests", "scipy", "huggingface_hub", ] for package in packages: print(f"📦 {package} 설치 중...") subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package]) print("✅ 모든 패키지가 성공적으로 설치되었습니다!")

install_packages()

print("\n" + "=" * 80) print("📚 섹션 2: 임포트 및 구성") print("=" * 80)

import torch import numpy as np import matplotlib.pyplot as plt from PIL import Image import requests from io import BytesIO from typing import List, Tuple, Dict, Optional, Union import json import time from dataclasses import dataclass import warnings import re

warnings.filterwarnings("ignore", category=FutureWarning) warnings.filterwarnings("ignore", category=UserWarning)

print(f"🖥️ 디바이스: {device}") if torch.cuda.is_available(): print(f"🎮 GPU: {torch.cuda.get_device_name(0)}") print(f"💾 GPU 메모리: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

print("\n" + "=" * 80) print("🤖 섹션 3: MOLMOACT 모델 로더") print("=" * 80)

@dataclass class MolmoActConfig: """MolmoAct 모델 구성""" model_name: str = "allenai/MolmoAct-7B-D-0812" torch_dtype: str = "bfloat16" device_map: str = "auto" max_new_tokens: int = 256 temperature: float = 0.0 do_sample: bool = False

우리는 이 튜토리얼을 설정하고 Google Colab에서 MolmoAct를 실행하는 데 필요한 환경을 준비했습니다. 필요한 모든 패키지를 설치하고, 핵심 라이브러리를 임포트한 뒤, GPU 가속을 사용할 수 있는지 감지하도록 런타임을 구성했습니다. 또한 튜토리얼의 나머지 부분 전반에 걸쳐 사용하는 주요 모델 설정을 저장하는 기본 구성 클래스(config class)도 정의했습니다.

코드 복사됨 (다른 브라우저 사용)

class MolmoActModel: """ 간편한 추론을 위한 MolmoAct 모델 래퍼(Wrapper)

이 클래스는 다음을 위한 고수준 인터페이스를 제공합니다:
- 모델 로딩 및 관리
- 적절한 프롬프트를 사용한 추론 실행
- 출력 파싱 (깊이, 궤적, 액션)
- 배치(Batch) 처리
"""

def __init__(self, config: Optional[MolmoActConfig] = None):
    self.config = config or MolmoActConfig()
    self.model = None
    self.processor = None
    self._loaded = False

def load(self) -> None:
    """MolmoAct 모델 및 프로세서 로드"""
    if self._loaded:
        print("⚠️ 모델이 이미 로드되었습니다!")
        return

    print(f"🔄 MolmoAct 모델 로드 중: {self.config.model_name}")
    print(" 첫 실행 시 몇 분 정도 걸릴 수 있습니다...")

    from transformers import AutoModelForImageTextToText, AutoProcessor

    dtype = getattr(torch, self.config.torch_dtype)

    print(" 📥 모델 가중치(weights) 로드 중...")
    self.model = AutoModelForImageTextToText.from_pretrained(
        self.config.model_name,
        trust_remote_code=True,
        torch_dtype=dtype,
        device_map=self.config.device_map,
    )

    print(" 📥 프로세서 로드 중...")
    try:
        self.processor = AutoProcessor.from_pretrained(
            self.config.model_name,
            trust_remote_code=True,
        )
        if hasattr(self.processor, 'tokenizer'):
            self.processor.tokenizer.padding_side = "left"
    except TypeError as e:
        if "prompt_templates" in str(e):
            print(" ⚠️ 사용자 정의 프로세서 구성을 처리하는 중...")
원문 보기
원문 보기 (영어)
Artificial Intelligence Applications Technology Computer Vision Editors Pick Machine Learning Physical AI Staff Tutorials In this tutorial, we walk through MolmoAct step by step and build a practical understanding of how action-reasoning models can reason in space from visual observations. We set up the environment, load the model, prepare multi-view image inputs, and explore how MolmoAct produces depth-aware reasoning, visual traces, and actionable robot outputs from natural language instructions. As we move through the workflow, we run inference and also examine how the model parses actions, visualizes trajectories, and supports more advanced processing pipelines for robotics-oriented tasks. Copy Code Copied Use a different Browser print("=" * 80) print("🔧 SECTION 1: INSTALLATION AND SETUP") print("=" * 80) import subprocess import sys def install_packages(): """Install all required packages for MolmoAct""" packages = [ "torch>=2.0.0", "torchvision", "transformers==4.52", "accelerate", "einops", "Pillow", "numpy", "matplotlib", "requests", "scipy", "huggingface_hub", ] for package in packages: print(f"📦 Installing {package}...") subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package]) print("✅ All packages installed successfully!") install_packages() print("\n" + "=" * 80) print("📚 SECTION 2: IMPORTS AND CONFIGURATION") print("=" * 80) import torch import numpy as np import matplotlib.pyplot as plt from PIL import Image import requests from io import BytesIO from typing import List, Tuple, Dict, Optional, Union import json import time from dataclasses import dataclass import warnings import re warnings.filterwarnings("ignore", category=FutureWarning) warnings.filterwarnings("ignore", category=UserWarning) print(f"🖥️ Device: {device}") if torch.cuda.is_available(): print(f"🎮 GPU: {torch.cuda.get_device_name(0)}") print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") print("\n" + "=" * 80) print("🤖 SECTION 3: MOLMOACT MODEL LOADER") print("=" * 80) @dataclass class MolmoActConfig: """Configuration for MolmoAct model""" model_name: str = "allenai/MolmoAct-7B-D-0812" torch_dtype: str = "bfloat16" device_map: str = "auto" max_new_tokens: int = 256 temperature: float = 0.0 do_sample: bool = False We set up the tutorial and prepared the environment needed to run MolmoAct in Google Colab. We install all required packages, import the core libraries, and configure the runtime to detect whether GPU acceleration is available. We also define the base configuration class that stores the main model settings we use throughout the rest of the tutorial. Copy Code Copied Use a different Browser class MolmoActModel: """ MolmoAct Model Wrapper for Easy Inference This class provides a high-level interface for: - Loading and managing the model - Running inference with proper prompting - Parsing outputs (depth, trace, actions) - Batch processing """ def __init__(self, config: Optional[MolmoActConfig] = None): self.config = config or MolmoActConfig() self.model = None self.processor = None self._loaded = False def load(self) -> None: """Load the MolmoAct model and processor""" if self._loaded: print("⚠️ Model already loaded!") return print(f"🔄 Loading MolmoAct model: {self.config.model_name}") print(" This may take a few minutes on first run...") from transformers import AutoModelForImageTextToText, AutoProcessor dtype = getattr(torch, self.config.torch_dtype) print(" 📥 Loading model weights...") self.model = AutoModelForImageTextToText.from_pretrained( self.config.model_name, trust_remote_code=True, torch_dtype=dtype, device_map=self.config.device_map, ) print(" 📥 Loading processor...") try: self.processor = AutoProcessor.from_pretrained( self.config.model_name, trust_remote_code=True, ) if hasattr(self.processor, 'tokenizer'): self.processor.tokenizer.padding_side = "left" except TypeError as e: if "prompt_templates" in str(e): print(" ⚠️ Handling custom processor configuration...") from transformers.dynamic_module_utils import get_class_from_dynamic_module processor_class = get_class_from_dynamic_module( "processing_molmoact.MolmoActProcessor", self.config.model_name, trust_remote_code=True, ) from transformers import AutoTokenizer, AutoImageProcessor tokenizer = AutoTokenizer.from_pretrained( self.config.model_name, trust_remote_code=True, padding_side="left", ) image_processor = AutoImageProcessor.from_pretrained( self.config.model_name, trust_remote_code=True, ) self.processor = processor_class( image_processor=image_processor, tokenizer=tokenizer, ) else: raise e self._loaded = True print("✅ Model loaded successfully!") self._print_model_info() def _print_model_info(self) -> None: """Print model information""" total_params = sum(p.numel() for p in self.model.parameters()) trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad) print(f"\n📊 Model Statistics:") print(f" Total Parameters: {total_params / 1e9:.2f}B") print(f" Trainable Parameters: {trainable_params / 1e9:.2f}B") print(f" Model dtype: {next(self.model.parameters()).dtype}") def build_prompt(self, instruction: str) -> str: """ Build the reasoning prompt for MolmoAct The prompt structure is crucial for MolmoAct to generate: 1. Depth perception tokens 2. Visual trajectory trace 3. Action predictions """ prompt = ( f"The task is {instruction}. " "What is the action that the robot should take. " f"To figure out the action that the robot should take to {instruction}, " "let's think through it step by step. " "First, what is the depth map for the first image? " "Second, what is the trajectory of the end effector in the first image? " "Based on the depth map of the first image and the trajectory of the end effector in the first image, " "along with other images from different camera views as additional information, " "what is the action that the robot should take?" ) return prompt We begin building the main MolmoAct model wrapper that makes inference easier to manage. We load the model and processor, handle custom processor initialization logic, and print useful model statistics once loading is complete. We also define a prompt-building method that helps us structure the reasoning query to guide the model toward depth, trace, and action generation. Copy Code Copied Use a different Browser @torch.inference_mode() def generate( self, images: List[Image.Image], instruction: str, max_new_tokens: Optional[int] = None, ) -> Dict: """ Generate action reasoning from images and instruction Args: images: List of PIL Images instruction: Task instruction max_new_tokens: Override default max tokens Returns: Dictionary containing: - text: Generated reasoning text - depth: Parsed depth tokens - trace: Parsed visual trace coordinates - action: Parsed action values """ if not self._loaded: raise RuntimeError("Model not loaded! Call .load() first.") prompt = self.build_prompt(instruction) max_tokens = max_new_tokens or self.config.max_new_tokens text = self.processor.apply_chat_template( [{"role": "user", "content": [dict(type="text", text=prompt)]}], tokenize=False, add_generation_prompt=True, ) inputs = self.processor( images=[images], text=text, padding=True, return_tensors="pt", ) inputs = {k: v.to(self.model.device) for k, v in inputs.items()} with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16): generated_ids = self.model.generate( **inputs, max_new_tokens=max_tokens, do_sample=self.config.do_sample, ) generated_tokens = generated_ids[:, inputs['input_ids'].size(1):] generated_text = self.processor.batch_decode( generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] result = { "text": generated_text, "depth": self._safe_parse_depth(generated_text), "trace": self._safe_parse_trace(generated_text), "action": self._safe_parse_action(generated_text, unnorm_key="molmoact"), "action_raw": self._safe_parse_action(generated_text, unnorm_key=None), } return result def _safe_parse_depth(self, text: