MarkTechPost • 111일 전

구글 LangExtract와 오픈AI로 구축하는 문서 지능 파이프라인

IMP

7/10

핵심 요약

구글의 LangExtract 라이브러리와 오픈AI 모델을 활용하여 비정형 텍스트를 기계가 읽을 수 있는 구조화된 데이터로 변환하는 방법을 다루는 실전 튜토리얼입니다. 계약서, 회의록 등 다양한 문서에서 엔티티와 리스크를 추출하고, 이를 대화형으로 시각화하여 분석 및 업무 자동화 파이프라인에 활용할 수 있는 점이 핵심입니다. 개발자와 데이터 실무자들에게 매우 유용한 가이드라인을 제공합니다.

번역된 본문

이 튜토리얼에서는 구글의 LangExtract 라이브러리를 사용하여 비정형 텍스트를 기계가 읽을 수 있는 구조화된 정보로 변환하는 방법을 알아봅니다. 먼저, 추출 작업을 위해 강력한 언어 모델을 활용할 수 있도록 필요한 종속성(dependencies)을 설치하고 OpenAI API 키를 안전하게 구성합니다. 또한 계약서, 회의록, 제품 공지사항 및 운영 로그를 포함한 다양한 문서 유형을 처리할 수 있는 재사용 가능한 추출 파이프라인을 구축할 것입니다.

신중하게 설계된 프롬프트와 예시 어노테이션을 통해 LangExtract가 엔티티(Entity), 작업(Action), 마감일(Deadline), 리스크(Risk) 및 기타 구조화된 속성을 식별하고 이를 원본 텍스트의 정확한 위치(Source span)와 매핑하는 방법을 시연합니다. 또한 추출된 정보를 시각화하고 테이블 형태의 데이터셋으로 구성하여 다운스트림 분석, 자동화 워크플로우 및 의사결정 시스템에서 활용할 수 있도록 합니다.

코드 복사 !pip -q install -U "langextract[openai]" pandas IPython

import os import json import textwrap import getpass import pandas as pd

OPENAI_API_KEY = getpass.getpass("OPENAI_API_KEY를 입력하세요: ") os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

import langextract as lx from IPython.display import display, HTML

구조화된 추출 작업을 위해 Colab 환경을 준비하기 위해 LangExtract, Pandas, IPython 등 필요한 라이브러리를 설치합니다. 런타임 중 안전한 액세스를 위해 사용자로부터 OpenAI API 키를 안전하게 요청하고 이를 환경 변수로 저장합니다. 그런 다음 LangExtract를 실행하고, 결과를 표시하며, 구조화된 출력을 처리하는 데 필요한 핵심 라이브러리를 가져옵니다(import).

코드 복사 MODEL_ID = "gpt-4o-mini"

def run_extraction( text_or_documents, prompt_description, examples, output_stem, model_id=MODEL_ID, extraction_passes=1, max_workers=4, max_char_buffer=1800, ): result = lx.extract( text_or_documents=text_or_documents, prompt_description=prompt_description, examples=examples, model_id=model_id, api_key=os.environ["OPENAI_API_KEY"], fence_output=True, use_schema_constraints=False, extraction_passes=extraction_passes, max_workers=max_workers, max_char_buffer=max_char_buffer, )

jsonl_name = f"{output_stem}.jsonl"
html_name = f"{output_stem}.html"

lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=".")

html_content = lx.visualize(jsonl_name)
with open(html_name, "w", encoding="utf-8") as f:
    if hasattr(html_content, "data"):
        f.write(html_content.data)
    else:
        f.write(html_content)

return result, jsonl_name, html_name

def extraction_rows(result): rows = [] for ex in result.extractions: start_pos = None end_pos = None if getattr(ex, "char_interval", None): start_pos = ex.char_interval.start_pos end_pos = ex.char_interval.end_pos

    rows.append({
        "class": ex.extraction_class,
        "text": ex.extraction_text,
        "attributes": json.dumps(ex.attributes or {}, ensure_ascii=False),
        "start": start_pos,
        "end": end_pos,
    })
return pd.DataFrame(rows)

def preview_result(title, result, html_name, max_rows=50): print("=" * 80) print(title) print("=" * 80) print(f"총 추출 수: {len(result.extractions)}")

df = extraction_rows(result)
display(df.head(max_rows))
display(HTML(f'<p><a href="{html_name}" target="_blank">대화형 시각화 열기: {html_name}</a></p>'))

전체 추출 파이프라인을 구동하는 핵심 유틸리티 함수들을 정의합니다. 텍스트를 LangExtract 엔진으로 보내 JSONL 및 HTML 출력을 모두 생성하는 재사용 가능한 run_extraction 함수를 만듭니다. 또한 추출 결과를 테이블 형식의 행으로 변환하고 노트북 환경에서 대화형으로 미리 볼 수 있는 헬퍼 함수도 정의합니다.

코드 복사 contract_prompt = textwrap.dedent(""" 나타나는 순서대로 계약-위험(contract-risk) 정보를 추출합니다. 규칙: 1. 소스의 정확한 텍스트 범위를 사용하세요. extraction_text를 바꾸지 마세요. 2. 존재하는 경우 다음 클래스를 추출하세요: - party (당사자) - obligation (의무) - deadline (마감일) - payme... (결제/지불)

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Technology Artificial Intelligence Language Model OCR Staff Tutorials In this tutorial, we explore how to use Google’s LangExtract library to transform unstructured text into structured, machine-readable information. We begin by installing the required dependencies and securely configuring our OpenAI API key to leverage powerful language models for extraction tasks. Also, we will build a reusable extraction pipeline that enables us to process a range of document types, including contracts, meeting notes, product announcements, and operational logs. Through carefully designed prompts and example annotations, we demonstrate how LangExtract can identify entities, actions, deadlines, risks, and other structured attributes while grounding them to their exact source spans. We also visualize the extracted information and organize it into tabular datasets, enabling downstream analytics, automation workflows, and decision-making systems. Copy Code Copied Use a different Browser !pip -q install -U "langextract[openai]" pandas IPython import os import json import textwrap import getpass import pandas as pd OPENAI_API_KEY = getpass.getpass("Enter OPENAI_API_KEY: ") os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY import langextract as lx from IPython.display import display, HTML We install the required libraries, including LangExtract, Pandas, and IPython, so that our Colab environment is ready for structured extraction tasks. We securely request the OpenAI API key from the user and store it as an environment variable for safe access during runtime. We then import the core libraries needed to run LangExtract, display results, and handle structured outputs. Copy Code Copied Use a different Browser MODEL_ID = "gpt-4o-mini" def run_extraction( text_or_documents, prompt_description, examples, output_stem, model_id=MODEL_ID, extraction_passes=1, max_workers=4, max_char_buffer=1800, ): result = lx.extract( text_or_documents=text_or_documents, prompt_description=prompt_description, examples=examples, model_id=model_id, api_key=os.environ["OPENAI_API_KEY"], fence_output=True, use_schema_constraints=False, extraction_passes=extraction_passes, max_workers=max_workers, max_char_buffer=max_char_buffer, ) jsonl_name = f"{output_stem}.jsonl" html_name = f"{output_stem}.html" lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=".") html_content = lx.visualize(jsonl_name) with open(html_name, "w", encoding="utf-8") as f: if hasattr(html_content, "data"): f.write(html_content.data) else: f.write(html_content) return result, jsonl_name, html_name def extraction_rows(result): rows = [] for ex in result.extractions: start_pos = None end_pos = None if getattr(ex, "char_interval", None): start_pos = ex.char_interval.start_pos end_pos = ex.char_interval.end_pos rows.append({ "class": ex.extraction_class, "text": ex.extraction_text, "attributes": json.dumps(ex.attributes or {}, ensure_ascii=False), "start": start_pos, "end": end_pos, }) return pd.DataFrame(rows) def preview_result(title, result, html_name, max_rows=50): print("=" * 80) print(title) print("=" * 80) print(f"Total extractions: {len(result.extractions)}") df = extraction_rows(result) display(df.head(max_rows)) display(HTML(f'<p><a href="{html_name}" target="_blank">Open interactive visualization: {html_name}</a></p>')) We define the core utility functions that power the entire extraction pipeline. We create a reusable run_extraction function that sends text to the LangExtract engine and generates both JSONL and HTML outputs. We also define helper functions to convert the extraction results into tabular rows and preview them interactively in the notebook. Copy Code Copied Use a different Browser contract_prompt = textwrap.dedent(""" Extract contract-risk information in order of appearance. Rules: 1. Use exact text spans from the source. Do not paraphrase extraction_text. 2. Extract the following classes when present: - party - obligation - deadline - payment_term - penalty - termination_clause - governing_law 3. Add useful attributes: - party_name for obligations or payment terms when relevant - risk_level as low, medium, or high - category for the business meaning 4. Keep output grounded to the exact wording in the source. 5. Do not merge non-contiguous spans into one extraction. """) contract_examples = [ lx.data.ExampleData( text=( "Acme Corp shall deliver the equipment by March 15, 2026. " "The Client must pay within 10 days of invoice receipt. " "Late payment incurs a 2% monthly penalty. " "This agreement is governed by the laws of Ontario." ), extractions=[ lx.data.Extraction( extraction_class="party", extraction_text="Acme Corp", attributes={"category": "supplier", "risk_level": "low"} ), lx.data.Extraction( extraction_class="obligation", extraction_text="shall deliver the equipment", attributes={"party_name": "Acme Corp", "category": "delivery", "risk_level": "medium"} ), lx.data.Extraction( extraction_class="deadline", extraction_text="by March 15, 2026", attributes={"category": "delivery_deadline", "risk_level": "medium"} ), lx.data.Extraction( extraction_class="party", extraction_text="The Client", attributes={"category": "customer", "risk_level": "low"} ), lx.data.Extraction( extraction_class="payment_term", extraction_text="must pay within 10 days of invoice receipt", attributes={"party_name": "The Client", "category": "payment", "risk_level": "medium"} ), lx.data.Extraction( extraction_class="penalty", extraction_text="2% monthly penalty", attributes={"category": "late_payment", "risk_level": "high"} ), lx.data.Extraction( extraction_class="governing_law", extraction_text="laws of Ontario", attributes={"category": "legal_jurisdiction", "risk_level": "low"} ), ] ) ] contract_text = """ BluePeak Analytics shall provide a production-ready dashboard and underlying ETL pipeline no later than April 30, 2026. North Ridge Manufacturing will remit payment within 7 calendar days after final acceptance. If payment is delayed beyond 15 days, BluePeak Analytics may suspend support services and charge interest at 1.5% per month. This Agreement shall be governed by the laws of British Columbia. """ contract_result, contract_jsonl, contract_html = run_extraction( text_or_documents=contract_text, prompt_description=contract_prompt, examples=contract_examples, output_stem="contract_risk_extraction", extraction_passes=2, max_workers=4, max_char_buffer=1400, ) preview_result("USE CASE 1 — Contract risk extraction", contract_result, contract_html) We build a contract intelligence extraction workflow by defining a detailed prompt and structured examples. We provide LangExtract with annotated training-style examples so that it understands how to identify entities such as obligations, deadlines, penalties, and governing laws. We then run the extraction pipeline on a contract text and preview the structured risk-related outputs. Copy Code Copied Use a different Browser meeting_prompt = textwrap.dedent(""" Extract action items from meeting notes in order of appearance. Rules: 1. Use exact text spans from the source. No paraphrasing in extraction_text. 2. Extract these classes when present: - assignee - action_item - due_date - blocker - decision 3. Add attributes: - priority as low, medium, or high - workstream when inferable from local context - owner for action_item when tied to a named assignee 4. Keep all spans grounded to the source text. 5. Preserve order of appearance. """) meeting_examples = [ lx.data.ExampleData( text=( "Sarah will finalize the launch email by Friday. " "The team decided to postpone the webinar. " "Blocked by missing legal approval." ), extractions=[ lx.data.Extraction( extraction_class="assignee", extraction_text="Sarah", attributes={"priority": "medium", "workstream": "marketing"} ), lx.data.Extraction( extraction_class="action_item", extraction_text="will finalize the launch email", attributes={"owner": "Sarah", "priority": "high", "workstream": "marketing"} ),

langextract openai 데이터 추출 파이프라인 구축 자연어 처리