MarkTechPost • 111일 전

구글, 비정형 데이터로 논문 자동 완성하는 멀티 에이전트 프레임워크 공개

IMP

8/10

핵심 요약

구글 클라우드 AI 연구팀이 연구자의 거친 아이디어와 실험 로그만으로 학회 제출용 LaTeX 논문을 완성하는 멀티 에이전트 시스템 'PaperOrchestra'를 제안했습니다. 이 시스템은 문헌 검토, 시각화 자료 생성, API 기반 인용 검증 및 본문 작성을 자동화하여 기존 자동 연구 도구들의 한계를 극복한 것이 특징입니다.

번역된 본문

연구 논문을 작성하는 과정은 매우 고된 작업입니다. 실험이 모두 끝난 후에도 연구자는 지저분한 실험 노트, 흩어진 결과 표, 미완성된 아이디어들을 학회의 세부 양식에 맞춰 논리적이고 완성도 있는 원고로 옮기는 데 수주일을 더 보내야 합니다. 많은 신임 연구자들에게 이러한 변환 작업은 논문이 좌절되는 지점이기도 합니다.

구글 클라우드 AI 연구팀은 비정형적인 사전 작성 자료(대략적인 아이디어 요약 및 날것의 실험 로그)를 제출 준비가 완료된 LaTeX 원고로 자율적으로 변환하는 멀티 에이전트 시스템인 'PaperOrchestra(페이퍼오케스트라)'를 제안합니다. 이 시스템은 문헌 리뷰, 생성된 시각화 자료, API로 검증된 인용 문헌까지 모두 포함하여 원고를 완성합니다.

해결하고자 하는 핵심 문제 PaperRobot과 같은 기존의 자동 글쓰기 시스템은 점진적인 텍스트 생성이 가능했지만, 데이터 기반의 과학적 서사가 요구하는 전체적인 복잡성을 처리하지는 못했습니다. 최근의 종단간 자율 연구 프레임워크인 'AI Scientist-v1(코드 템플릿을 통한 자동화된 실험 및 초안 작성 도입)'과 후속 모델인 'AI Scientist-v2(에이전트 기반 트리 탐색을 활용해 자율성을 강화)'는 전체 연구 루프를 자동화하지만, 이들의 작성 모듈은 자체 내부 실험 파이프라인에 강하게 결합되어 있습니다. 단순히 데이터를 전달한다고 해서 논문이 완성되는 것은 아니며, 독립적인 작성 도구로는 기능하지 않습니다.

이와 동시에 AutoSurvey2 및 LiRA와 같이 문헌 리뷰에 특화된 시스템은 포괄적인 조사를 생성하지만, 특정 새로운 방법을 기존 기술과 명확하게 비교 및 positioning(위치 지정)하는 목표 지향적인 '관련 연구(Related Work)' 섹션을 작성하는 데 필요한 문맥적 인식이 부족합니다. CycleResearcher는 기존에 구조화된 BibTeX 참조 목록을 입력으로 요구하지만, 이는 작업 초기 단계에서는 거의 사용할 수 없는 자료이며, 비정형 입력을 처리할 때는 완전히 실패합니다.

그 결과 하나의 간극이 존재했습니다. 즉, 실제 연구자가 실험을 마친 후 가지고 있을 법한 제약 없는 인간 제공 자료를 가져와 완전하고 엄격한 원고를 독립적으로 생성할 수 있는 기존 도구는 없었습니다. PaperOrchestra는 바로 이러한 공백을 메우기 위해 특별히 설계되었습니다.

파이프라인 작동 방식 PaperOrchestra는 5개의 전문 에이전트를 조율하여 순차적으로 작동하며, 그중 2개는 병렬로 실행됩니다.

1단계 — 아웃라인 에이전트(Outline Agent): 이 에이전트는 아이디어 요약, 실험 로그, LaTeX 학회 템플릿 및 학회 가이드라인을 읽고 구조화된 JSON 형식의 개요를 생성합니다. 이 개요에는 시각화 계획(생성할 플롯 및 다이어그램 지정), 서론의 거시적 맥락과 관련 연구의 미시적 방법론 클러스터를 분리하는 목표 지향적 문헌 검색 전략, 그리고 제공된 자료에서 언급된 모든 데이터셋, 옵티마이저, 평가 지표 및 베이스라인 방법에 대한 인용 힌트가 포함된 섹션 수준의 작성 계획이 포함됩니다.

2단계 및 3단계 — 플롯팅 에이전트 및 문헌 리뷰 에이전트(병렬 실행): 플롯팅 에이전트는 PaperBanana라는 학술 일러스트 도구를 사용하여 시각화 계획을 실행합니다. 이 도구는 비전-언어 모델(VLM) 평가자를 활용해 생성된 이미지를 디자인 목표와 비교하여 평가하고 반복적으로 수정합니다. 동시에 문헌 리뷰 에이전트는 2단계 인용 파이프라인을 수행합니다. 웹 검색이 장착된 대형 언어 모델(LLM)을 사용하여 후보 논문을 식별한 다음, Semantic Scholar API를 통해 각 논문을 검증합니다. 이 과정에서 레벤슈타인 거리(Levenshtein distance)를 사용한 유효한 퍼지 제목 일치를 확인하고, 초록 및 메타데이터를 검색하며, 학회의 제출 마감일과 연결된 시간적 제한을 적용합니다. 환각(Hallucination)되었거나 확인할 수 없는 참고 문헌은 모두 폐기됩니다. 검증된 인용 문헌은 BibTeX 파일로 편집되며, 에이전트는 이를 바탕으로 서론 및 관련 연구 섹션의 초안을 작성합니다. 이때 수집된 문헌 풀의 최소 90% 이상이 실제로 인용되어야 한다는 강력한 제약 조건이 적용됩니다.

4단계 — 섹션 작성 에이전트(Section Writing Agent): 이 에이전트는 지금까지 생성된 모든 데이터를 가져옵니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Artificial Intelligence AI Infrastructure Tech News AI Paper Summary Technology AI Shorts Language Model Large Language Model Machine Learning Software Engineering Staff Writing a research paper is brutal. Even after the experiments are done, a researcher still faces weeks of translating messy lab notes, scattered results tables, and half-formed ideas into a polished, logically coherent manuscript formatted precisely to a conference's specifications. For many fresh researchers, that translation work is where papers go to die. A team at Google Cloud AI Research propose ‘ PaperOrchestra ‘, a multi-agent system that autonomously converts unstructured pre-writing materials — a rough idea summary and raw experimental logs — into a submission-ready LaTeX manuscript, complete with a literature review, generated figures, and API-verified citations. The Core Problem It's Solving Earlier automated writing systems, like PaperRobot, could generate incremental text sequences but couldn't handle the full complexity of a data-driven scientific narrative. More recent end-to-end autonomous research frameworks like AI Scientist-v1 (which introduced automated experimentation and drafting via code templates) and its successor AI Scientist-v2 (which increases autonomy using agentic tree-search) automate the entire research loop — but their writing modules are tightly coupled to their own internal experimental pipelines. You can't just hand them your data and expect a paper. They're not standalone writers. Meanwhile, systems specialized in literature reviews, such as AutoSurvey2 and LiRA , produce comprehensive surveys but lack the contextual awareness to write a targeted Related Work section that clearly positions a specific new method against prior art. CycleResearcher requires a pre-existing structured BibTeX reference list as input — an artifact rarely available at the start of writing — and fails entirely on unstructured inputs. The result is a gap: no existing tool could take unconstrained human-provided materials — the kind of thing a real researcher might actually have after finishing experiments — and produce a complete, rigorous manuscript on its own. PaperOrchestra is built specifically to fill that gap. How the Pipeline Works PaperOrchestra orchestrates five specialized agents that work in sequence, with two running in parallel: Step 1 — Outline Agent: This agent reads the idea summary, experimental log, LaTeX conference template, and conference guidelines, then produces a structured JSON outline. This outline includes a visualization plan (specifying what plots and diagrams to generate), a targeted literature search strategy separating macro-level context for the Introduction from micro-level methodology clusters for the Related Work, and a section-level writing plan with citation hints for every dataset, optimizer, metric, and baseline method mentioned in the materials. Steps 2 & 3 — Plotting Agent and Literature Review Agent (parallel): The Plotting Agent executes the visualization plan using PaperBanana , an academic illustration tool that uses a Vision-Language Model (VLM) critic to evaluate generated images against design objectives and iteratively revise them. Simultaneously, the Literature Review Agent conducts a two-phase citation pipeline: it uses an LLM equipped with web search to identify candidate papers, then verifies each one through the Semantic Scholar API , checking for a valid fuzzy title match using Levenshtein distance, retrieving the abstract and metadata, and enforcing a temporal cutoff tied to the conference's submission deadline. Hallucinated or unverifiable references are discarded. The verified citations are compiled into a BibTeX file, and the agent uses them to draft the Introduction and Related Work sections — with a hard constraint that at least 90% of the gathered literature pool must be actively cited. Step 4 — Section Writing Agent: This agent takes everything generated so far — the outline, the verified citations, the generated figures — and authors the remaining sections: abstract, methodology, experiments, and conclusion. It extracts numeric values directly from the experimental log to construct tables and integrates the generated figures into the LaTeX source. Step 5 — Content Refinement Agent: Using AgentReview , a simulated peer-review system, this agent iteratively optimizes the manuscript. After each revision, the manuscript is accepted only if the overall AgentReview score increases, or ties with net non-negative sub-axis gains. Any overall score decrease triggers an immediate revert and halt. Ablation results show this step is critical: refined manuscripts dominate unrefined drafts with 79%–81% win rates in automated side-by-side comparisons, and deliver absolute acceptance rate gains of +19% on CVPR and +22% on ICLR in AgentReview simulations. The full pipeline makes approximately 60–70 LLM API calls and completes in a mean of 39.6 minutes per paper — only about 4.5 minutes more than AI Scientist-v2's 35.1 minutes, despite running significantly more LLM calls (40–45 for AI Scientist-v2 vs. 60–70 for PaperOrchestra). The Benchmark: PaperWritingBench The research team also introduce PaperWritingBench , described as the first standardized benchmark specifically for AI research paper writing. It contains 200 accepted papers from CVPR 2025 and ICLR 2025 (100 from each venue), selected to test adaptation to different conference formats — double-column for CVPR versus single-column for ICLR. For each paper, an LLM was used to reverse-engineer two inputs from the published PDF: a Sparse Idea Summary (high-level conceptual description, no math or LaTeX) and a Dense Idea Summary (retaining formal definitions, loss functions, and LaTeX equations), alongside an Experimental Log derived by extracting all numeric data and converting figure insights into standalone factual observations. All materials were fully anonymized, stripping author names, titles, citations, and figure references. This design isolates the writing task from any specific experimental pipeline, using real accepted papers as ground truth — and it reveals something important. For Overall Paper Quality , the Dense idea setting substantially outperforms Sparse (43%–56% win rates vs. 18%–24%), since more precise methodology descriptions enable more rigorous section writing. But for Literature Review Quality , the two settings are nearly equal (Sparse: 32%–40%, Dense: 28%–39%), meaning the Literature Review Agent can autonomously identify research gaps and relevant citations without relying on detail-heavy human inputs. The Results In automated side-by-side (SxS) evaluations using both Gemini-3.1-Pro and GPT-5 as judge models, PaperOrchestra dominated on literature review quality, achieving absolute win margins of 88%–99% over AI baselines. For overall paper quality, it outperformed AI Scientist-v2 by 39%–86% and the Single Agent by 52%–88% across all settings. Human evaluation — conducted with 11 AI researchers across 180 paired manuscript comparisons — confirmed the automated results. PaperOrchestra achieved absolute win rate margins of 50%–68% over AI baselines in literature review quality, and 14%–38% in overall manuscript quality. It also achieved a 43% tie/win rate against the human-written ground truth in literature synthesis — a notable result for a fully automated system. The citation coverage numbers tell a particularly clear story. AI baselines averaged only 9.75–14.18 citations per paper, inflating their F1 scores on the must-cite (P0) reference category while leaving "good-to-cite" (P1) recall near zero. PaperOrchestra generated an average of 45.73–47.98 citations , closely mirroring the ~59 citations found in human-written papers, and improved P1 Recall by 12.59%–13.75% over the strongest baselines. Under the ScholarPeer evaluation framework, PaperOrchestra achieved simulated acceptance rates of 84% on CVPR and 81%

멀티 에이전트 논문 자동 작성 대형 언어 모델 구글 AI 연구 연구 자동화