MarkTechPost • 97일 전

구글 클라우드 AI, 성공과 실패 경험을 학습하는 '리즈닝뱅크' 공개

IMP

8/10

핵심 요약

현재 대부분의 AI 에이전트는 이전 작업에서 겪은 성공과 실패를 기억하지 못하고 동일한 실수를 반복하는 근본적인 한계가 있습니다. 구글 클라우드 AI 등의 연구진은 성공과 실패의 이유를 재사용 가능한 추론 전략으로 추출해 저장하는 메모리 프레임워크 '리즈닝뱅크(ReasoningBank)'를 발표했습니다. 이 시스템은 에이전트의 성공률을 실질적으로 높이며, 다수의 결과를 생성해 핵심을 도출하는 확장 기법인 MaTTS와 결합했을 때 AI 에이전트의 자율적 문제 해결 능력을 획기적으로 향상시킵니다.

번역된 본문

현재 대부분의 AI 에이전트는 근본적인 기억 상실증 문제를 안고 있습니다. 웹을 탐색하거나 GitHub 이슈를 해결, 쇼핑 플랫폼을 탐색하도록 에이전트를 배포해 보면, 마치 그런 유형의 작업을 처음 본 것처럼 매 작업에 접근합니다. 동일한 유형의 문제로 수많은 어려움을 겪었음에도 불구하고 항상 똑같은 실수를 반복합니다. 귀중한 교훈은 작업이 종료되는 순간 증발해 버립니다.

구글 클라우드 AI, 일리노이 대학교 어바나-샴페인(UIUC), 예일 대학교의 연구진은 에이전트가 수행한 '행동'을 단순히 기록하는 것에 그치지 않고, 해당 결과가 '왜' 성공했거나 실패했는지를 재사용 및 일반화 가능한 추론 전략(reasoning strategies)으로 증류(distill)해 내는 메모리 프레임워크 '리즈닝뱅크(ReasoningBank)'를 소개했습니다.

기존 에이전트 메모리의 한계 리즈닝뱅크가 왜 중요한지 이해하려면 기존 에이전트 메모리가 실제로 어떻게 작동하는지 파악해야 합니다. 가장 널리 쓰이는 두 가지 접근 방식은 궤적 메모리(Trajectory memory, Synapse라는 시스템에서 사용)와 워크플로우 메모리(Workflow memory, Agent Workflow Memory, AWM에서 사용)입니다. 궤적 메모리는 에이전트가 실행한 모든 클릭, 스크롤, 타이핑한 쿼리 등 원시 행동 로그를 저장합니다. 워크플로우 메모리는 한 발 더 나아가 성공적인 작업 수행 과정에서 재사용 가능한 단계별 절차를 추출합니다. 하지만 이 두 방식 모두 치명적인 맹점이 존재합니다. 원시 궤적 데이터는 노이즈가 많고 너무 길어서 새로운 작업에 직접적으로 활용하기 어렵습니다. 워크플로우 메모리는 오직 성공한 시도만 추출하므로, 수많은 실패 속에 묻혀있는 풍부한 학습 신호가 완전히 버려집니다.

리즈닝뱅크(ReasoningBank)의 작동 방식 리즈닝뱅크는 완료된 각 작업을 중심으로 메모리 검색(memory retrieval), 메모리 추출(memory extraction), 메모리 통합(memory consolidation)이라는 세 단계로 구동되는 폐루프(closed-loop) 메모리 프로세스로 작동합니다. 에이전트가 새 작업을 시작하기 전에 임베딩 기반 유사도 검색을 사용해 리즈닝뱅크를 쿼리하여 상위 k개의 가장 관련성 높은 메모리 항목을 검색합니다. 검색된 항목은 추가 컨텍스트로 에이전트의 시스템 프롬프트에 직접 주입됩니다. 흥미롭게도 기본값은 k=1, 즉 작업당 단일 메모리 항목만 검색하는 것입니다. 제거 실험(Ablation experiment) 결과, 더 많은 메모리를 검색할수록 실제로는 성능이 저하되는 것으로 나타났습니다. 성공률이 k=1일 때 49.7%에서 k=4일 때 44.4%로 떨어졌습니다. 검색된 메모리의 '질'과 '관련성'이 '양'보다 훨씬 중요하다는 증거입니다.

작업이 완료되면, 에이전트와 동일한 백본 대형 언어 모델(LLM)이 구동하는 '메모리 추출기(Memory Extractor)'가 궤적을 분석하여 이를 구조화된 메모리 항목으로 증류합니다. 각 항목은 세 가지 구성 요소로 이루어집니다: 타이틀(간결한 전략 이름), 설명(한 문장 요약), 내용(추출된 1~3문장 길이의 추론 단계 또는 실무적 통찰력). 핵심은 추출기가 성공한 궤적과 실패한 궤적을 다르게 처리한다는 점입니다. 성공은 검증된 전략을 제공하는 반면, 실패는 반사실적 함정(counterfactual pitfalls)과 예방적 교훈을 제공합니다.

테스트 시점에 정답 라벨(Ground-truth labels)에 접근할 수 없는 상황에서 궤적의 성공 여부를 판단하기 위해 시스템은 'LLM-as-a-Judge(판관으로서의 LLM)'를 사용합니다. 이 모델은 사용자 쿼리, 궤적 데이터, 최종 페이지 상태를 바탕으로 '성공' 또는 '실패'라는 이진 판정 결과를 출력합니다. 판정 모델이 완벽할 필요는 없습니다. 제거 실험 결과, 판정 정확도가 약 70%까지 떨어지더라도 리즈닝뱅크의 성능은 견고하게 유지되었습니다. 새로운 메모리 항목은 리즈닝뱅크 저장소에 직접 추가되며, 빠른 코사인 유사도 검색을 위해 미리 계산된 임베딩과 함께 JSON 형식으로 관리되어 전체 루프를 완성합니다.

MaTTS: 메모리와 테스트 시점 스케일링의 결합 연구진은 한 발 더 나아가 리즈닝뱅크를 테스트 시점 연산 스케일링(Test-time compute scaling)과 연결하는 '메모리 인식 테스트 시점 스케일링(memory-aware test-time scaling, MaTTS)'을 도입했습니다. 테스트 시점 연산 스케일링은 이미 수학적 추론이나 코딩 작업에서 강력한 효과를 입증한 기술입니다. 이 통찰은 간단하지만 매우 중요합니다. 테스트 시 스케일링은 동일한 작업에 대해 여러 궤적(결과)을 생성합니다. MaTTS는 최적의 정답 하나만 골라서 나머지를 폐기하는 대신, 전체 궤적 세트를 활용합니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI AI Agents Artificial Intelligence AI Infrastructure Tech News AI Paper Summary Technology AI Shorts Applications Language Model Large Language Model Machine Learning New Releases Software Engineering Staff Most AI agents today have a fundamental amnesia problem. Deploy one to browse the web, resolve GitHub issues, or navigate a shopping platform, and it approaches every single task as if it has never seen anything like it before. No matter how many times it has stumbled on the same type of problem, it repeats the same mistakes. Valuable lessons evaporate the moment a task ends. A team of researchers from Google Cloud AI, the University of Illinois Urbana-Champaign and Yale University introduces ReasoningBank , a memory framework that doesn't just record what an agent did — it distills why something worked or failed into reusable, generalizable reasoning strategies. The Problem with Existing Agent Memory To understand why ReasoningBank is important, you need to understand what existing agent memory actually does. Two popular approaches are trajectory memory (used in a system called Synapse) and workflow memory (used in Agent Workflow Memory, or AWM). Trajectory memory stores raw action logs — every click, scroll, and typed query an agent executed. Workflow memory goes a step further and extracts reusable step-by-step procedures from successful runs only. Both have critical blind spots. Raw trajectories are noisy and too long to be directly useful for new tasks. Workflow memory only mines successful attempts, which means the rich learning signal buried in every failure — and agents fail a lot — gets completely discarded. How ReasoningBank Works ReasoningBank operates as a closed-loop memory process with three stages that run around every completed task: memory retrieval, memory extraction, and memory consolidation. Before an agent starts a new task, it queries ReasoningBank using embedding-based similarity search to retrieve the top- k most relevant memory items. Those items get injected directly into the agent's system prompt as additional context. Importantly, the default is k=1, a single retrieved memory item per task. Ablation experiments show that retrieving more memories actually hurts performance: success rate drops from 49.7% at k=1 to 44.4% at k=4. The quality and relevance of retrieved memory matter far more than quantity. Once the task is finished, a Memory Extractor — powered by the same backbone LLM as the agent — analyzes the trajectory and distills it into structured memory items . Each item has three components: a title (a concise strategy name), a description (a one-sentence summary), and content (1–3 sentences of distilled reasoning steps or operational insights). Crucially, the extractor treats successful and failed trajectories differently: successes contribute validated strategies, while failures supply counterfactual pitfalls and preventative lessons. To decide whether a trajectory was successful or not — without access to ground-truth labels at test time — the system uses an LLM-as-a-Judge , which outputs a binary "Success" or "Failure" verdict given the user query, the trajectory, and the final page state. The judge doesn't need to be perfect; ablation experiments show ReasoningBank remains robust even when judge accuracy drops to around 70%. New memory items are then appended directly to the ReasoningBank store, maintained as JSON with pre-computed embeddings for fast cosine similarity search, completing the loop. MaTTS: Pairing Memory with Test-Time Scaling The research team goes further and introduces memory-aware test-time scaling (MaTTS) , which links ReasoningBank with test-time compute scaling — a technique that has already proven powerful in math reasoning and coding tasks. The insight is simple but important: scaling at test time generates multiple trajectories for the same task. Instead of just picking the best answer and discarding the rest, MaTTS uses the full set of trajectories as rich contrastive signals for memory extraction. MaTTS comes in two ways . Parallel scaling generates k independent trajectories for the same query, then uses self-contrast — comparing what went right and wrong across all trajectories — to extract higher-quality, more reliable memory items. Sequential scaling iteratively refines a single trajectory using self-refinement , capturing intermediate corrections and insights as memory signals. The result is a positive feedback loop: better memory guides the agent toward more promising rollouts, and richer rollouts forge even stronger memory. The paper notes that at k=5, parallel scaling (55.1% SR) edges out sequential scaling (54.5% SR) on WebArena-Shopping — sequential gains saturate quickly once the model reaches a decisive success or failure, while parallel scaling keeps providing diverse rollouts that the agent can contrast and learn from. Results Across Three Benchmarks Tested on WebArena (a web navigation benchmark spanning shopping, admin, GitLab, and Reddit tasks), Mind2Web (which tests generalization across cross-task, cross-website, and cross-domain settings), and SWE-Bench-Verified (a repository-level software engineering benchmark with 500 verified instances), ReasoningBank consistently outperforms all baselines across all three datasets and all tested backbone models. On WebArena with Gemini-2.5-Flash, ReasoningBank improved overall success rate by +8.3 percentage points over the memory-free baseline (40.5% → 48.8%), while reducing average interaction steps by up to 1.4 compared to no-memory and up to 1.6 compared to other memory baselines. The efficiency gains are sharpest on successful trajectories — on the Shopping subset, for example, ReasoningBank cut 2.1 steps from successful task completions (a 26.9% relative reduction). The agent reaches solutions faster because it knows the right path, not simply because it gives up on failed attempts sooner. On Mind2Web, ReasoningBank delivers consistent gains across cross-task, cross-website, and cross-domain evaluation splits, with the most pronounced improvements in the cross-domain setting — where the highest degree of strategy transfer is required and where competing methods like AWM actually degrade relative to the no-memory baseline. On SWE-Bench-Verified, results vary meaningfully by backbone model. With Gemini-2.5-Pro, ReasoningBank achieves a 57.4% resolve rate versus 54.0% for the no-memory baseline, saving 1.3 steps per task. With Gemini-2.5-Flash, the step savings are more dramatic — 2.8 fewer steps per task (30.3 → 27.5) alongside a resolve rate improvement from 34.2% to 38.8%. Adding MaTTS (parallel scaling, k=5) pushes results further. ReasoningBank with MaTTS reaches 56.3% overall SR on WebArena with Gemini-2.5-Pro — compared to 46.7% for the no-memory baseline — while also reducing average steps from 8.8 to 7.1 per task. Emergent Strategy Evolution One of the most striking findings is that ReasoningBank's memory doesn't stay static — it evolves. In a documented case study, the agent's initial memory items for a "User-Specific Information Navigation" strategy resemble simple procedural checklists: "actively look for and click on ‘Next Page,' ‘Page X,' or ‘Load More' links." As the agent accumulates experience, those same memory items mature into adaptive self-reflections, then into systematic pre-task checks, and eventually into compositional strategies like "regularly cross-reference the current view with the task requirements; if current data doesn't align with expectations, reassess available options such as search filters and alternative sections." The research team describe this as emergent behavior resembling the learning dynamics of reinforcement learning — happening entirely at test time, without any model weight updates. Key Takeaways Failure is finally a learning signal : Unlike existing agent memory systems (Synapse, AWM) that only learn from successful trajectories, R

에이전트 메모리 프레임워크 구글 클라우드 대형 언어 모델 자가 학습