메뉴
HN
Hacker News 11일 전

8B 모델 에이전트 성능 53%→99% 끌어올린 가드레일 'Forge'

IMP
8/10
핵심 요약

자체 호스팅되는 소형 LLM(8B)의 도구 호출 및 에이전트 성능을 극적으로 끌어올려주는 'Forge' 라이브러리가 소개되었습니다. 파싱 오류 복구, 재시도 넛지, 컨텍스트 관리 등의 가드레일 기술을 통해 소형 모델로도 복잡한 다단계 에이전트 워크플로우에서 99%에 육박하는 높은 성공률을 기록할 수 있습니다. OpenAI 호환 프록시 서버 모드를 지원하여 기존 클라이언트(예: Cursor, Continue 등)에 쉽게 통합해 성능을 높일 수 있는 것이 큰 장점입니다.

번역된 본문

Forge: 자체 호스팅 LLM 도구 호출을 위한 신뢰성 레이어

Forge는 가드레일(파싱 오류 복구, 재시도 넛지, 단계 강제) 및 컨텍스트 관리(VRAM 인식 예산, 계층형 압축)를 통해 8B 로컬 모델을 다단계 에이전트 워크플로우에서 최고 수준으로 끌어올려 줍니다. 현재 최고의 자체 호스팅 구성(Ministral-3 8B Instruct Q8, llama-server 기반)은 Forge의 26가지 시나리오 평가 스위트 전체에서 86.5%를 기록했으며, 가장 어려운 단계에서는 76%를 기록했습니다.

활용 방법은 다음과 같습니다:

WorkflowRunner — 도구를 정의하고, 백엔드를 선택하여 구조화된 에이전트 루프를 실행합니다. Forge는 시스템 프롬프트, 도구 실행, 컨텍스트 압축 및 가드레일에 이르는 전체 수명 주기를 관리합니다.

SlotWorker — 자동 선점 기능을 통해 공유 추론 슬롯에 대한 우선순위 대기열 액세스를 추가합니다. 전문 워크플로우가 하나의 GPU 슬롯을 공유하는 다중 에이전트 아키텍처에 적합합니다. Forge를 직접 기반으로 구축할 때 가장 좋습니다.

가드레일 미들웨어(Middleware) — 자체 오케스트레이션 루프 내에서 Forge의 신뢰성 스택(조합 가능한 미들웨어)을 사용합니다. 루프는 사용자가 제어하며, Forge는 응답을 검증하고, 형식이 잘못된 도구 호출을 복구하며, 필수 단계를 강제합니다.

프록시 서버(Proxy server) — 모든 클라이언트(opencode, Continue, aider 등)와 로컬 모델 서버 사이에 위치하는 즉시 사용 가능한 OpenAI 호환 프록시(python -m forge.proxy)입니다. 가드레일을 투명하게 적용하므로 클라이언트는 더 똑똑한 모델과 대화하고 있다고 생각하게 됩니다. Ollama, llama-server(llama.cpp), Llamafile 및 Anthropic을 백엔드로 지원합니다.

요구 사항: Python 3.12+ 실행 중인 LLM 백엔드(아래 참조)

설치: pip install forge-guardrails # 핵심 모듈만 pip install "forge-guardrails[anthropic]" # Anthropic 클라이언트 포함

개발용 설치: git clone https://github.com/antoinezambelli/forge.git cd forge pip install -e ".[dev]"

백엔드 설정 (택 1): llama-server (권장 — 평가 상위 10개 구성 모두 llama-server에서 실행됨):

https://github.com/ggml-org/llama.cpp/releases 에서 설치

llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080

Ollama (대안 — 설정이 더 쉽지만, 까다로운 워크로드에서는 성능이 약간 낮음):

https://ollama.com/download 에서 설치

ollama pull ministral-3:8b-instruct-2512-q4_K_M

Anthropic (API, 로컬 GPU 불필요): pip install -e ".[anthropic]" export ANTHROPIC_API_KEY=sk-...

자세한 지침은 백엔드 설정(Backend Setup)을, 하드웨어에 맞는 모델은 모델 가이드(Model Guide)를 참조하세요.

빠른 시작: import asyncio from pydantic import BaseModel, Field from forge import ( Workflow, ToolDef, ToolSpec, WorkflowRunner, OllamaClient, ContextManager, TieredCompact, )

def get_weather(city: str) -> str: return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel): city: str = Field(description="City name")

workflow = Workflow( name="weather", description="Look up weather for a city.", tools={ "get_weather": ToolDef( spec=ToolSpec( name="get_weather", description="Get current weather", parameters=GetWeatherParams, ), callable=get_weather, ), }, required_steps=[], terminal_tool="get_weather", system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.", )

async def main(): client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True) ctx = ContextManager( strategy=TieredCompact(keep_recent=2), budget_tokens=8192 ) runner = WorkflowRunner(client=client, context_manager=ctx) await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

다단계 워크플로우, 멀티턴 대화 및 백엔드 자동 관리에 대해서는 사용자 가이드(User Guide)를 참조하세요. 장기 실행 세션(CLI, 채팅 서버, 음성 비서)을 구축하는 경우, 일시적인 메시지 필터링에 대한 중요한 지침은 장기 실행 세션 권고안(long-running session advisory)을 참조하세요.

프록시 서버: 로컬 모델 서버를 즉시 대체할 수 있습니다. OpenAI 호환 클라이언트를 프록시로 가리키면 Forge의 가드레일을 무료로 얻을 수 있습니다.

외부 모드 — 사용자가 llama-server를 관리하고 Forge가 이를 프록시함

python -m forge.proxy --backend-url http://localhost:8080 --port 8081

관리형... (원문 누락)

원문 보기
원문 보기 (영어)
forge A reliability layer for self-hosted LLM tool-calling. Forge lifts an 8B local model to the top of its class on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction). The current top self-hosted config (Ministral-3 8B Instruct Q8 on llama-server) scores 86.5% across forge's 26-scenario eval suite — and 76% on the hardest tier. Three ways to use it: WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly. Guardrails middleware — Use forge's reliability stack ( composable middleware ) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps. Proxy server — Drop-in OpenAI-compatible proxy ( python -m forge.proxy ) that sits between any client (opencode, Continue, aider, etc.) and a local model server. Applies guardrails transparently — the client thinks it's talking to a smarter model. Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends. Requirements Python 3.12+ A running LLM backend (see below) Install pip install forge-guardrails # core only pip install " forge-guardrails[anthropic] " # + Anthropic client For development: git clone https://github.com/antoinezambelli/forge.git cd forge pip install -e " .[dev] " Backend setup (pick one) llama-server (recommended — top 10 eval configs all run on llama-server): # Install from https://github.com/ggml-org/llama.cpp/releases llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080 Ollama (alternative — easier setup, slightly weaker on harder workloads): # Install from https://ollama.com/download ollama pull ministral-3:8b-instruct-2512-q4_K_M Anthropic (API, no local GPU needed): pip install -e " .[anthropic] " export ANTHROPIC_API_KEY=sk-... See Backend Setup for full instructions and Model Guide for which model fits your hardware. Quick Start import asyncio from pydantic import BaseModel , Field from forge import ( Workflow , ToolDef , ToolSpec , WorkflowRunner , OllamaClient , ContextManager , TieredCompact , ) def get_weather ( city : str ) -> str : return f"72°F and sunny in { city } " class GetWeatherParams ( BaseModel ): city : str = Field ( description = "City name" ) workflow = Workflow ( name = "weather" , description = "Look up weather for a city." , tools = { "get_weather" : ToolDef ( spec = ToolSpec ( name = "get_weather" , description = "Get current weather" , parameters = GetWeatherParams , ), callable = get_weather , ), }, required_steps = [], terminal_tool = "get_weather" , system_prompt_template = "You are a helpful assistant. Use the available tools to answer the user." , ) async def main (): client = OllamaClient ( model = "ministral-3:8b-instruct-2512-q4_K_M" , recommended_sampling = True ) ctx = ContextManager ( strategy = TieredCompact ( keep_recent = 2 ), budget_tokens = 8192 ) runner = WorkflowRunner ( client = client , context_manager = ctx ) await runner . run ( workflow , "What's the weather in Paris?" ) asyncio . run ( main ()) For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide . If you're building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages. Proxy Server Drop-in replacement for a local model server. Point any OpenAI-compatible client at the proxy and get forge's guardrails for free. # External mode — you manage llama-server, forge proxies it python -m forge.proxy --backend-url http://localhost:8080 --port 8081 # Managed mode — forge starts llama-server and the proxy together python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081 Then configure your client to use http://localhost:8081/v1 as the API base URL. Note: The proxy automatically injects a synthetic respond tool when tools are present in the request. The model calls respond(message="...") instead of producing bare text, keeping it in tool-calling mode where forge's full guardrail stack applies. The respond call is stripped from the outbound response — the client sees a normal text response ( finish_reason: "stop" ) and never knows the tool exists. This is essential for small local models (~8B), which cannot be trusted to choose correctly between text and tool calls — guiding them to a tool is a must. See ADR-013 for the full analysis. Backends Backend Best for Native FC? Ollama Easiest setup, model management built-in Yes llama-server Best performance, full control Yes (with --jinja ) Llamafile Single binary, zero dependencies No (prompt-injected) Anthropic Frontier baseline, hybrid workflows Yes See Backend Setup for installation and Model Guide for which model to pick. Running Tests python -m pytest tests/ -v --tb=short python -m pytest tests/ --cov=forge --cov-report=term-missing Eval Harness 26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation. See Eval Guide for full CLI reference. # llama-server (start in another terminal first; see Eval Guide) python -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf " path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf " --runs 10 --stream --verbose # Batch eval (JSONL output, automatic resume) python -m tests.eval.batch_eval --config all --runs 50 # Reports (ASCII table, HTML dashboard, markdown views) python -m tests.eval.report eval_results.jsonl Project Structure src/forge/ __init__.py # Public API exports errors.py # ForgeError hierarchy server.py # setup_backend(), ServerManager, BudgetMode core/ messages.py # Message, MessageRole, MessageType, MessageMeta workflow.py # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow inference.py # run_inference() — shared front half (compact, fold, validate, retry) runner.py # WorkflowRunner — the agentic loop slot_worker.py # SlotWorker — priority-queued slot access steps.py # StepTracker guardrails/ nudge.py # Nudge dataclass response_validator.py # ResponseValidator, ValidationResult step_enforcer.py # StepEnforcer, StepCheck error_tracker.py # ErrorTracker clients/ base.py # ChunkType, StreamChunk, LLMClient protocol ollama.py # OllamaClient (native FC) llamafile.py # LlamafileClient (native FC or prompt-injected) anthropic.py # AnthropicClient (frontier baseline) context/ manager.py # ContextManager, CompactEvent strategies.py # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact hardware.py # HardwareProfile, detect_hardware() prompts/ templates.py # Tool prompt builders (prompt-injected path) nudges.py # Retry and step-enforcement nudge templates tools/ respond.py # Synthetic respond tool (respond_tool(), respond_spec()) proxy/ proxy.py # ProxyServer — programmatic start/stop API server.py # Raw asyncio HTTP server, SSE streaming handler.py # Request handler — bridge between HTTP and run_inference convert.py # OpenAI messages ↔ forge Messages conversion tests/ unit/ # 865 deterministic tests — no LLM backend required eval/ # Eval harness — model qualification against real backends Documentation User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory Model Guide — Which model and backend for your hardware Backend Setup — Backend installation and server setup Eval Guide — Eval harness CLI reference, batch eval Architecture — Full design document Workflow Internals — Workflow design and runner internals Contributing — How to set u