MarkTechPost • 66일 전

마이크로소프트, 코드로 브라우저 제어하는 오픈소스 웹 에이전트 '웹라이트' 공개

IMP

8/10

핵심 요약

마이크로소프트 리서치가 기존의 스크린샷이나 DOM 기반의 단편적인 클릭 방식을 벗어나, 에이전트가 터미널 환경에서 직접 코드를 작성해 브라우저를 제어하는 새로운 프레임워크 '웹라이트(Webwright)'를 오픈소스로 공개했습니다. 이 방식은 자동화 스크립트를 작성하는 개발자의 방식과 유사하며, 복잡한 다단계 웹 상호작용을 압축적인 코드로 처리할 수 있게 해줍니다. 오디세이(Odysseys) 벤치마크에서 기존 기본 GPT-4o 대비 성능을 크게 끌어올리며, 코드 생성 및 디버깅 능력이 뛰어난 최신 LLM의 강점을 극대화한 접근법으로 주목받습니다.

번역된 본문

오늘날 대부분의 웹 에이전트는 한 번에 하나의 동작으로 브라우저를 구동합니다. 모델은 스크린샷이나 DOM 텍스트 형태의 현재 페이지 상태를 수신하고, 다음 클릭, 키 입력 또는 스크롤을 예측합니다. 언어 모델의 추론 능력이 제한적이었을 때는 이러한 '단일 동작 기반' 설계가 타당했습니다. 하지만 모델이 코드 작성 및 디버깅 능력이 향상됨에 따라, 이 경직된 루프는 더 이상 도움이 되는 구조가 아니라 제약 조건이 되어버렸습니다.

마이크로소프트 리서치의 AI 프론티어(AI Frontiers) 연구소는 다른 접근 방식을 고안했습니다. 이들의 새로운 오픈소스 프레임워크인 웹라이트(Webwright)는 상태 저장 브라우저 세션 대신 에이전트에 터미널을 제공합니다. 에이전트는 브라우저를 제어하기 위해 플레이라이트(Playwright) 코드를 작성하고, 배시(Bash) 명령을 실행하며, 로그를 검사하고 스크립트를 반복적으로 다듬습니다. 플레이라이트 역시 마이크로소프트에서 개발한 오픈소스 브라우저 자동화 라이브러리로, Chromium, Firefox, WebKit 브라우저의 프로그래밍 방식 제어를 지원합니다.

웹라이트의 차별점 웹라이트는 에이전트와 브라우저를 분리하고, 브라우저를 에이전트가 프로그램을 개발하는 동안 실행하고 검사한 후 버릴 수 있는 대상으로 취급합니다. 여기서 지속되는 결과물은 브라우저 세션이 아니라 로컬 작업 공간의 코드와 로그입니다. 이는 개발자가 RPA(로봇 프로세스 자동화) 스크립트를 작성할 때 사용하는 것과 동일한 모델입니다. 매번 사이트를 수동으로 클릭하는 대신 스크립트를 한 번만 작성하는 것이죠. 그 스크립트는 재실행, 수정 및 공유가 가능합니다. 웹라이트는 이 개념을 LLM 기반 에이전트에 적용합니다.

이 시스템은 러너(Runner), 모델 엔드포인트(Model Endpoint), 터미널 환경(Environment)이라는 세 가지 핵심 구성 요소로 이루어져 있습니다. 러너는 약 150줄, 모델 인터페이스는 약 550줄, 환경은 약 300줄의 코드로 구성되어 있습니다. 멀티 에이전트 오케스트레이션이나 복잡한 계획 계층은 없으며, 단일 에이전트 루프만 존재합니다. 모든 중간 코드, 로그, 스크린샷 및 결과는 작업 공간에 저장되어 각 실행을 쉽게 검사할 수 있습니다.

에이전트 루프 러너는 현재 컨텍스트를 모델에 전송합니다. 모델은 사고 블록(thinking block)과 셸 명령을 반환합니다. 해당 명령은 환경에서 실행되며, 터미널 출력, 로그, 스크린샷 또는 오류 역추적을 반환합니다. 이러한 관찰 결과는 다시 컨텍스트에 포함되며 루프가 계속됩니다. 코딩 에이전트는 한 번에 하나의 원시적인 동작을 수행하는 대신, 날짜 선택이나 전체 양식 작성과 같은 다단계 상호작용을 압축된 프로그램으로 자연스럽게 표현할 수 있습니다. 루프, 함수 및 추상화를 통해 에이전트는 유사한 하위 수준 단계의 시퀀스를 반복적으로 예측하지 않고도 유사한 작업 전반에 걸쳐 일반화할 수 있습니다.

두 가지 엔지니어링 과제 조기 종료(Premature 'done')와 컨텍스트 폭발(context explosion)이 두 가지 핵심 문제입니다. 개방형 배시 동작을 사용하면 모델이 완료를 스스로 보고해야 하며, 실제로 완료되지 않았는데도 종종 성공했다고 잘못 판단합니다. 연구진은 이를 해결하기 위해 게이트(Gate)를 추가했습니다. 에이전트가 완료('done: true')를 출력하기 전에 자체 성찰 구성(self-reflection config)을 생성하고, 로그 및 스크린샷이 있는 새 폴더에서 최종 스크립트를 실행한 후 성공 또는 실패를 출력하는 자체 성찰 판단을 통과해야 합니다. 그렇지 않으면 해당 플래그가 삭제되고 재시도됩니다.

컨텍스트 길이와 관련해서는 긴 코딩 궤적(trajectory)이 컨텍스트 제한을 빠르게 초과하므로, 20단계마다 기록을 단일 요약으로 압축합니다.

벤치마크 결과 웹라이트는 Online-Mind2Web와 Odysseys라는 두 가지 벤치마크에서 평가되었습니다. Online-Mind2Web는 136개의 널리 사용되는 사이트에 걸쳐 300개의 작업을 포함하며, 자동화된 'LLM-as-a-Judge' 평가 프레임워크를 사용합니다.

GPT-4o는 전체 정확도 86.67%를 달성하여 100단계 예산을 가진 Online-Mind2Web 벤치마크의 AutoEval 카테고리에서 모든 오픈소스 하네스 레시피 중 최고 수준을 기록했습니다. Claude Opus 4.7은 전체 정확도 84.7%를 기록했지만, N=100 단계에서 어려운 작업에 대해 더 나은 성능을 보였습니다(80.5% 대 GPT-4o의 76.6%).

연구진은 또한 모델이 클릭 및 입력을 위한 x,y 좌표를 예측하는 기존의 스크린샷 기반 에이전트 설정에서 GPT-4o 기준(Baseline)을 재현했습니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI AI Agents Technology AI Shorts Artificial Intelligence Applications Language Model Large Language Model New Releases Open Source Software Engineering Staff Tech News Most web agents today drive a browser one action at a time. The model receives the current page state — as a screenshot or DOM text — and predicts the next click, keypress, or scroll. This action-at-a-time design made sense when language models had limited reasoning ability. As models have become more capable at writing and debugging code, that rigid loop has become a constraint rather than a structure that helps. Microsoft Research's AI Frontiers lab built a different approach. Their new open-source framework, Webwright , gives the agent a terminal instead of a stateful browser session. The agent writes Playwright code to control browsers, runs bash commands, inspects logs, and iteratively refines scripts. Playwright is an open-source browser automation library, also from Microsoft, that supports programmatic control of Chromium, Firefox, and WebKit browsers. What Webwright Does Differently Webwright separates the agent from the browser and treats the browser as something the agent can launch, inspect, and discard while developing a program. The persistent artifact is not the browser session but the code and logs in the local workspace. This is the same model a developer uses when writing an RPA (Robotic Process Automation) script. Instead of manually clicking through a site each time, they write a script once. That script can be rerun, adapted, and shared. Webwright applies this to LLM-powered agents. The system has three core components: a Runner, a Model Endpoint, and a terminal Environment. The runner is about 150 lines of code, the model interface about 550 lines, and the environment about 300 lines. There is no multi-agent orchestration or complex planning hierarchy — just a single agent loop. All intermediate code, logs, screenshots, and results are stored in the workspace, making each run easy to inspect. The Agent Loop The Runner sends the current context to the model. The model returns a thinking block and a shell command. That command runs in the Environment, which returns terminal output, logs, screenshots, or error tracebacks. These observations go back into context, and the loop continues. Rather than issuing one primitive action at a time, a coding agent can naturally express multi-step interactions — such as selecting a date or filling out an entire form — as a compact program. Loops, functions, and abstractions allow the agent to generalize across similar tasks without repeatedly predicting similar sequences of low-level steps. Two Engineering Challenges Premature ‘done' and context explosion are the two core issues. With open-ended bash actions, the model must self-report completion and often claims success without actually finishing. They added a gate: the agent must generate a self-reflection config, run a final script in a fresh folder with logs and screenshots, and pass its own self-reflection judgement that outputs success or failure before emitting done: true . Otherwise, the flag is dropped and it retries. For context length, long coding trajectories quickly exceed context limits, so they compact history every 20 steps into a single summary. Benchmark Results Webwright was evaluated on two benchmarks: Online-Mind2Web and Odysseys. Online-Mind2Web contains 300 tasks across 136 widely used sites and uses an automated LLM-as-a-Judge evaluation framework. GPT-5.4 achieves 86.67% overall accuracy, representing the highest among all open-sourced harness recipes in the AutoEval category of the Online-Mind2Web benchmark, with a 100-step budget. Claude Opus 4.7 reached 84.7% overall but performed better on hard tasks at N=100 steps — 80.5% versus 76.6% for GPT-5.4. They also reproduced a GPT-5.4 baseline in a conventional screenshot-based agent setting, where the model predicts x,y coordinates for clicks and typing actions. Using the same underlying model, Webwright achieves substantial gains across all three difficulty categories, highlighting the benefit of the code-driven terminal-based approach over step-by-step coordinate prediction. Odysseys evaluates long-horizon browsing tasks spanning multiple websites. Tasks average 272.3 words of instructions. In the April 2026 leaderboard, the best-performing model was Opus 4.6, with a top score of 44.5. Webwright powered by GPT-5.4 reaches 60.1%, a 35.1% relative improvement over the previous state of the art. Compared to the base GPT-5.4 performance of 33.5%, this corresponds to a 79.4% relative improvement — or 26.6 absolute points. Cost Analysis Claude Opus 4.7 is more efficient in the number of steps to solve each task (mean 21.9 steps) compared to GPT-5.4 (mean 26.3 steps). However, Claude Opus 4.7 is priced significantly higher compared to GPT-5.4 ($5 vs. $2.50 per 1M input tokens, and $25 vs. $15.00 per 1M output tokens, April 2026), which makes the average per-task cost higher compared to GPT-5.4 ($2.37 vs. $6.09). The first 50 steps deliver 82% accuracy, and the next 50 steps deliver 3–4 additional points. Small Model Performance The research team also tested Qwen3.5-9B on the hard split of Online-Mind2Web. When tasks are augmented with pre-built reusable tool scripts, Qwen3.5-9B achieves 66.2% on Online-Mind2Web websites with more than five tools. This shows that smaller, lower-cost models can handle complex web tasks when paired with a pre-built tool library. Marktechpost’s Visual Explainer Webwright Quick Start Guide 01 / 05 — Overview What Is Webwright? Webwright is an open-source, terminal-native web agent framework from Microsoft Research . Instead of predicting one browser click at a time, the agent writes Playwright code, runs bash commands, and stores reusable scripts in a local workspace. ~1,000 lines of harness code across 3 modules — no hidden orchestration Single agent loop : Runner, Model Endpoint, and terminal Environment 86.7% on Online-Mind2Web | 60.1% on Odysseys with GPT-5.4 Backends: OpenAI, Anthropic, OpenRouter Scripts reusable in Claude Code, Codex, OpenClaw # GitHub repository github.com/microsoft/Webwright 02 / 05 — Prerequisites What You Need Before Installing Confirm the following are ready before running any install commands. Python 3.10+ — required minimum runtime Chromium — installed via Playwright in the next step API key — OpenAI, Anthropic, or OpenRouter Git — to clone the repository # Check your Python version python --version # Must return Python 3.10 or higher 03 / 05 — Installation Clone and Install Webwright Clone the repo, install in editable mode, then install Chromium for Playwright browser control. # 1. Clone the repository git clone https://github.com/microsoft/Webwright cd Webwright # 2. Install the package in editable mode pip install -e . # 3. Install Chromium for Playwright playwright install chromium The -e flag means local source edits apply immediately without reinstalling. 04 / 05 — Running a Task Run Your First Web Task Export your API key, then pass a task instruction and start URL to the CLI. # Export your key export OPENAI_API_KEY= "sk-..." export ANTHROPIC_API_KEY= "sk-ant-..." # Run a task python -m webwright.run.cli \ -c base.yaml -c model_openai.yaml \ -t "Find cheapest economy flight SEA to JFK on 2026-05-15" \ --start-url https://www.google.com/flights \ --task-id demo_openai \ -o outputs/default Flag Description -c Config file from src/webwright/config/ — stackable -t Task instruction in plain English -start-url Initial URL for the browser session -task-id Output subfolder name -o Root output directory for logs and scripts 05 / 05 — Claude Code Integration Use Webwright as a Claude Code Skill Webwright ships a built-in Claude Code skill. No separate LLM API key is needed beyond your Claude Code subscription. Claude Code reads PNG screenshots natively. # Proje

마이크로소프트 리서치 웹라이트 웹 자동화 AI 에이전트 오픈소스