오픈소스 LLM 엔지니어링 플랫폼인 Langfuse를 활용하여 트레이싱, 프롬프트 관리, 평가 및 실험을 수행하는 파이프라인 구축 방법을 다룹니다. 유료 API 키가 없어도 내장된 Mock LLM을 통해 모든 핵심 기능을 실습할 수 있어 실무 도입 전 테스트하기 유용합니다. 이를 통해 LLM 애플리케이션의 거동을 관측하고 체계적으로 개선하는 방법을 배울 수 있습니다.
번역된 본문
에디터 추천 | 에이전트 AI 기술 | 인공지능 | 언어 모델 | 대형 언어 모델 | 스태프 기술 뉴스 | 튜토리얼
이 튜토리얼에서는 트레이싱(Tracing), 프롬프트 관리, 평가 점수 산출, 데이터셋 및 실험을 위한 Langfuse(오픈소스 LLM 엔지니어링 플랫폼) 파이프라인을 구현합니다. 우리는 실제 OpenAI API 키를 사용하거나 결정론적 Mock LLM(가짜 LLM)과 함께 작동하는 완전한 워크플로우를 구축합니다. 이를 통해 유료 모델 액세스에 의존하지 않고도 Langfuse의 모든 주요 기능을 이해할 수 있습니다.
우리는 자격 증명(Credentials)을 설정하고 Langfuse에 연결하는 것으로 시작합니다. 간단한 함수 호출을 추적하고, 소규모 RAG 파이프라인을 계측(Instrument)하며, 프롬프트를 중앙에서 관리하고, 평가 점수를 첨부하고, 데이터셋 기반 실험을 실행합니다. 또한 Langfuse가 구조화되고 프로덕션 준비가 완료된 방식으로 LLM 애플리케이션을 관찰, 평가 및 개선하는 데 어떻게 도움이 되는지 확인합니다.
[코드 복사 및 실행 환경 설정]
필요한 Langfuse 및 OpenAI 패키지를 설치하는 코드로 시작합니다. 그런 다음 Langfuse 자격 증명을 수집하고, 올바른 Langfuse 리전(지역) 또는 자체 호스팅 URL을 선택하며, 선택적으로 OpenAI API 키를 입력받습니다. 마지막으로 Langfuse 클라이언트를 초기화하고 인증을 확인한 후, 현재 OpenAI를 사용 중인지 아니면 내장된 Mock LLM을 사용 중인지 확인합니다.
[LLM 호출 및 Mock LLM 구현]
이 코드는 실제 OpenAI를 사용하는 경우 Langfuse와 통합된 OpenAI SDK를 로드하여 응답을 생성합니다. OpenAI 키가 없는 경우를 대비해 사전 정의된 국가별 수도 데이터와 간단한 응답 로직을 바탕으로 동작하는 Mock LLM을 구현했습니다. 이를 통해 Langfuse의 트레이싱 및 생성(Generation) 기능을 시각적으로 테스트할 수 있습니다.
Editors Pick Agentic AI Technology Artificial Intelligence Language Model Large Language Model Staff Tech News Tutorials In this tutorial, we implement the Langfuse (an open-source LLM engineering platform) pipeline for tracing, prompt management, scoring, datasets, and experiments. We build a complete workflow that works with either a real OpenAI key or a deterministic mock LLM, so we can understand every major Langfuse feature without depending on paid model access. We start by setting up credentials and connecting to Langfuse. We trace simple function calls, instrument a small RAG pipeline, manage prompts centrally, attach evaluation scores, and run dataset-based experiments. Also, we see how Langfuse helps us observe, evaluate, and improve LLM applications in a structured and production-ready way. Copy Code Copied Use a different Browser import subprocess, sys def pip_install(*pkgs): subprocess.run([sys.executable, "-m", "pip", "install", "-qU", *pkgs], check=True) pip_install("langfuse", "openai") import os from getpass import getpass def _ask(var, prompt, secret=True, default=None): if os.environ.get(var): return os.environ[var] val = (getpass(prompt) if secret else input(prompt)).strip() if not val and default is not None: val = default os.environ[var] = val return val print("Enter your Langfuse credentials (input is hidden):") _ask("LANGFUSE_PUBLIC_KEY", " Langfuse PUBLIC key (pk-lf-...): ") _ask("LANGFUSE_SECRET_KEY", " Langfuse SECRET key (sk-lf-...): ") region = (input(" Region — EU (default) / US / or paste a self-hosted URL: ") .strip().lower()) if region.startswith("http"): HOST = region elif region in ("2", "us"): HOST = "https://us.cloud.langfuse.com" else: HOST = "https://cloud.langfuse.com" os.environ["LANGFUSE_HOST"] = HOST OPENAI_API_KEY = (getpass(" OpenAI key (optional, press Enter to skip): ").strip()) if OPENAI_API_KEY: os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY USE_OPENAI = bool(OPENAI_API_KEY) DEFAULT_MODEL = "gpt-4o-mini" if USE_OPENAI else "mock-llm-v1" from langfuse import get_client, observe, propagate_attributes, Evaluation langfuse = get_client() assert langfuse.auth_check(), "Auth failed — double-check keys/region." print(f"\n✅ Connected to Langfuse at {HOST}") print(f" LLM backend: {'OpenAI (' + DEFAULT_MODEL + ')' if USE_OPENAI else 'built-in mock'}\n") We begin by installing the required Langfuse and OpenAI packages inside the Colab environment. We then collect Langfuse credentials, choose the correct Langfuse region or self-hosted URL, and optionally accept an OpenAI API key. We finally initialize the Langfuse client, verify authentication, and confirm whether we are using OpenAI or the built-in mock LLM. Copy Code Copied Use a different Browser if USE_OPENAI: from langfuse.openai import openai _MOCK_FACTS = { "france": "Paris", "germany": "Berlin", "japan": "Tokyo", "italy": "Rome", "spain": "Madrid", "india": "New Delhi", } def _mock_answer(user_text: str) -> str: t = user_text.lower() for country, capital in _MOCK_FACTS.items(): if country in t: return capital if "langfuse" in t: return ("Langfuse is an open-source LLM engineering platform for " "observability, prompt management, evaluation and datasets.") return "This is a mock response. Provide an OpenAI key for real generations." def llm_chat(messages, *, model=DEFAULT_MODEL, temperature=0.3, name=None, langfuse_prompt=None) -> str: """Return assistant text; the call is traced as a Langfuse generation.""" if USE_OPENAI: kwargs = dict(model=model, messages=messages, temperature=temperature) if name: kwargs["name"] = name if langfuse_prompt: kwargs["langfuse_prompt"] = langfuse_prompt resp = openai.chat.completions.create(**kwargs) return resp.choices[0].message.content last_user = next((m["content"] for m in reversed(messages) if m["role"] == "user"), "") answer = _mock_answer(last_user) gen_kwargs = dict(as_type="generation", name=name or "mock-llm", model=model, input=messages) if langfuse_prompt is not None: gen_kwargs["prompt"] = langfuse_prompt with langfuse.start_as_current_observation(**gen_kwargs) as gen: gen.update(output=answer, usage_details={"input_tokens": 24, "output_tokens": 12}) return answer print("PART 1 ── Decorator tracing -------------------------------------------") @observe() def write_story(topic: str) -> str: return llm_chat( [{"role": "user", "content": f"Write a one-sentence story about {topic}."}], name="story-generation", ) @observe() def story_pipeline(topic: str) -> str: return write_story(topic) print(" →", story_pipeline("a debugging robot")) We define the LLM helper that supports both real OpenAI generations and deterministic mock responses. We also make sure that even the mock path creates a proper Langfuse generation observation, so the tutorial remains fully traceable without an OpenAI key. We then demonstrate basic decorator-based tracing by wrapping a simple story-generation pipeline with @observe. Copy Code Copied Use a different Browser print("\nPART 2 ── Manual RAG trace --------------------------------------------") _KB = { "refund": "Refunds are processed within 5–7 business days to the original method.", "warranty": "All products carry a 1-year limited manufacturer warranty.", } @observe(name="retrieve") def retrieve(question: str): q = question.lower() hits = [v for k, v in _KB.items() if k in q] or list(_KB.values()) return hits[:2] @observe(name="rag-pipeline") def rag_pipeline(question: str, user_id="user-42", session_id="sess-001") -> str: with propagate_attributes(user_id=user_id, session_id=session_id, tags=["rag", "support-bot", "tutorial"]): context = "\n".join(retrieve(question)) return llm_chat( [{"role": "system", "content": "Answer the question using ONLY the provided context."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}], name="rag-answer", ) rag_answer = rag_pipeline("How long do refunds take?") rag_trace_id = langfuse.get_current_trace_id() print(" →", rag_answer) We build a small manual RAG pipeline using a simple in-memory knowledge base for refunds, shipping, and warranty information. We trace the retrieval step separately and use propagate_attributes to attach user ID, session ID, and tags across the full trace. We then run a refund-related question and capture the trace ID so we can attach scores to it later. Copy Code Copied Use a different Browser print("\nPART 3 ── Prompt management -------------------------------------------") langfuse.create_prompt( name="support-agent", type="chat", prompt=[ {"role": "system", "content": "You are a {{tone}} customer-support agent for {{company}}. " "Be concise."}, {"role": "user", "content": "{{question}}"}, ], labels=["production"], config={"model": DEFAULT_MODEL, "temperature": 0.2}, ) prompt = langfuse.get_prompt("support-agent", type="chat") compiled = prompt.compile(tone="friendly", company="Acme", question="Do you offer express shipping?") print(" compiled prompt:", compiled) @observe(name="prompt-managed-call") def answer_with_managed_prompt(): return llm_chat(compiled, name="support-reply", langfuse_prompt=prompt) print(" →", answer_with_managed_prompt()) print("\nPART 4 ── Scoring -----------------------------------------------------") def keyword_overlap(answer: str, expected_keyword: str) -> float: return 1.0 if expected_keyword.lower() in (answer or "").lower() else 0.0 langfuse.create_score( name="groundedness", value=keyword_overlap(rag_answer, "5"), trace_id=rag_trace_id, data_type="NUMERIC", comment="Heuristic: mentions the documented refund window.", ) langfuse.create_score(name="user_feedback", value="helpful", trace_id=rag_trace_id, data_type="CATEGORICAL") langfuse.create_score(name="resolved", value=1, trace_id=rag_trace_id, data_type="BOOLEAN") @observe(name="scored-call") def scored_call(): out = llm_chat([{"role": "user", "content": "What is the capital of Japan?"}], name="capital-q") with langfuse.start_as_current_observation(as_type="span", name="grade") as span: span.score(name="correct", v