Hacker News • 78일 전

인터페이즈: 대규모 정밀 작업 특화 신규 AI 모델

IMP

8/10

핵심 요약

인터페이즈(Interfaze)는 트랜스포머 모델의 유연성과 DNN/CNN 모델의 높은 정확도를 결합하여 OCR, 비전, 음성 인식, 구조화된 출력 등의 작업에서 최적화된 성능을 제공하는 새로운 아키텍처입니다. 이 모델은 Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini 등과 비교하여 9개 벤치마크에서 대부분 우수한 성능을 보여주었으며, 특히 처리 비용과 응답 시간을 획기적으로 낮추면서도 높은 정확도를 유지하는 것이 특징입니다.

번역된 본문

인터페이즈(Interfaze): 대규모 작업에서 높은 정확도를 위해 구축된 새로운 모델 아키텍처

tl;dr: 인터페이즈는 OCR, 비전, STT(음성 텍스트 변환), 구조화된 출력 분야의 9개 직접 비교 벤치마크에서 Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini, Grok-4.3 등의 모델을 능가하는 새로운 모델 아키텍처입니다.

인간은 컴퓨터 수준의 작업에 있어서는 비효율적입니다. 우리는 실수를 하지만, 의사결정과 미묘한 뉘앙스를 이해하는 데는 탁월합니다. 만약 사람에게 50페이지짜리 PDF를 읽고, 모든 단어를 다른 문서의 XY 좌표에 매핑한 뒤 전체를 중국어로 번역하라고 지시한다고 상상해 보세요. 엄청난 실수가 발생할 것이고, 그 사람의 급여를 지불하는 데 많은 비용이 들며, 결과를 얻기까지 오랜 시간을 기다려야 할 것입니다.

트랜스포머(Transformer) 모델도 이와 비슷합니다. 이들은 미묘한 뉘앙스와 인간 수준의 작업에서 놀라운 성능을 발휘하며, 인간처럼 실수도 하지만 그것이 오히려 창의성을 발휘하게 만듭니다. 하지만 우리는 지금까지 잘못된 작업에 잘못된 모델을 사용해 왔습니다.

CNN/DNN은 90년대 초반 LeNet-5부터 ResNet, 그리고 최근의 CRNN-CTC에 이르기까지 존재해 왔습니다. 이들은 OCR, 번역, GUI 감지와 같은 특정 작업에 특화된 심층 신경망(DNN) 아키텍처입니다. 이들이 데이터를 소비하고 바라보는 방식은 작업에 특화되도록 훈련되었기 때문에, 해당 특정 작업에서는 최대 100배 더 높은 정확도를 발휘합니다. 또한 바운딩 박스(Bounding Box)나 신뢰도 점수(Confidence Score)와 같은 유용한 메타데이터를 생성하여 개발자가 의존할 수 있는 예측 가능한 워크플로우를 구축할 수 있게 해줍니다.

그렇다면 왜 그토록 많은 사람들이 결정론적(Deterministic) 작업에 여전히 트랜스포머나 LLM을 선택하는 걸까요? DNN은 유연하지 않기 때문입니다. 이들은 훈련 데이터만큼만 좋은 성능을 내며, 인간 수준의 뉘앙스 처리에는 능하지 않습니다. 서빙 비용은 저렴할지 모르지만, 새로운 작업을 위해 유지보수하고 재훈련하는 데는 많은 비용이 듭니다. 여권을 예로 들면, CNN은 바운딩 박스와 신뢰도 점수를 통해 생년월일을 추출할 수 있지만, 그 사람의 나이를 계산할 수는 없습니다.

인터페이즈(Interfaze)를 소개합니다. DNN/CNN 모델의 전문성과 옴니-트랜스포머(Omni-transformer)를 결합하여 두 가지 장점을 모두 제공하는 새로운 모델 아키텍처입니다. 즉, 결정론적 작업에서 높은 정확도와 낮은 비용을 제공합니다:

비전 (이미지 및 문서, 객체 및 GUI 감지)
웹 추출 및 검색
오디오 (STT 및 화자 분리)
번역
비디오 (출시 예정)

모델 사양

컨텍스트 윈도우: 100만 토큰
최대 출력 토큰: 3만 2천 토큰
입력 모달리티: 텍스트, 이미지, 오디오, 파일
추론 기능: 지원됨 (기본값: 비활성화)

벤치마크 Claude Opus 4.7이나 GPT 5.5 같은 Pro 등급 모델은 현재 시장에서 코딩이나 복잡한 추론 작업에 있어 최고의 범용 모델이지만, 높은 비용과 느린 응답 시간 때문에 OCR이나 번역과 같은 대용량 작업에는 일반적으로 사용되지 않습니다. 인터페이즈는 가격 및 기능 세트가 유사한 모델들을 기준으로 벤치마크를 측정했으며, 이 모델들은 비용을 낮게 유지하면서도 가장 빠른 속도로 최고의 성능을 끌어내도록 최적화되었습니다.

오늘날 대부분의 사람들은 결정론적 개발자 작업을 위해 두 가지 모델 카테고리를 선택합니다:

Gemini-3-Flash, GPT-5.4-Mini, Claude Sonnet 4.6과 같은 플래시/미니 모델. 대규모 작업에서 성능과 가격의 최적의 균형을 제공합니다.
Reducto, Mistral OCR, Whisper와 같은 전문 제공업체.

세부 벤치마크 결과 (비교 모델: 인터페이즈, Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini, Grok-4.3)

OCRBench V2: 70.7% / 55.8% / 54.7% / 52.7% / 54.7%
olmOCR: 85.7% / 75.3% / 73.9% / 80.1% / 81.9%
RefCOCO: 82.1% / 75.2% / 75.5% / 67.0% / 25.0%
VoxPopuli (WER, 낮을수록 좋음): 2.4% / 4.0% / — / — / —
Spider 2.0-Lite: 52.9% / 45.2% / 49.6% / 26.7% / 45.9%
GPQA Diamond: 89.9% / 88.5% / 89.9% / 82.8% / 73.6%
MMMLU: 90.9% / 88.7% / 84.9% / 75.3% / 89.7%
MMMU-Pro: 71.1% / 67.6% / 46.3% / 40.4% / 68.7%
SOB Value Acc: 79.5% / 77.3% / 77.9% / 75.1% / 78.4%

(참고: ↓ 표시는 낮을수록 좋은 지표(단어 오류율)입니다. — 표시는 점수가 없음(모델에 기본 오디오 입력 기능이 없음)을 나타냅니다. 다른 모든 항목은 높을수록 좋습니다. 각 모델은 OCRBench V2, olmOCR, RefCOCO, VoxPopuli-Cleaned-AA, SOB Value, Spider-2.0-Lite, GPQA Diamond, MMMLU, MMMU-Pro의 9개 벤치마크에서 직접 비교되었습니다.)

전체 리더보드 보기 → 인터페이즈는 거의 모든 벤치마크에서 선두를 차지합니다.

원문 보기

원문 보기 (영어)

Interfaze Beta pricing docs blog sign in Interfaze: A new model architecture built for high accuracy at scale copy markdown tl;dr : Interfaze is a new model architecture that outperforms models like Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini, and Grok-4.3 across 9 head-to-head benchmarks in OCR, vision, STT, and structured output. Humans are inefficient at computer-level tasks. We make mistakes, but we're great at decision-making and understanding nuance. Imagine telling a human to read a 50-page PDF, map every word to another document with its XY position, and translate the whole thing into Chinese. You'd get tons of mistakes, pay a lot to keep that human on payroll, and wait a long time for the result. Transformer models are similar. They're amazing at nuance and human-level tasks, and they make mistakes like a human, but that's also what keeps them creative. We've been using the wrong models for the wrong tasks. CNNs/DNNs have existed since the early 90s, from LeNet-5 to ResNet, and more recently CRNN-CTC. These are deep neural network architectures that are task-specific for things like OCR, translation, or GUI detection. The way they consume and see data is trained to be task specific, which makes them up to 100x more accurate at their specific task. They also produce useful metadata like bounding boxes and confidence scores, letting developers build predictable workflows they can rely on. So why do so many of us still go for transformers/LLMs for deterministic tasks? DNNs are not flexible. They're only as good as their training data, and they aren't great at human-level nuance. They might be cheap to serve but expensive to maintain and retrain for new tasks. Take a passport: a CNN can extract the date of birth with bounding boxes and a confidence score, but it can't calculate the person's age. Introducing Interfaze A new model architecture that merges the specialization of DNN/CNN models with omni-transformers, giving you the best of both worlds. That means high accuracy and low cost on deterministic tasks: Vision (image and document, object and GUI detection) Web extraction and search Audio (STT and speaker diarization) Translation Video (coming soon) Model specs Feature Value Context window 1M tokens Max output tokens 32k tokens Input modalities Text, Images, Audio, File Reasoning Available (default: disabled) Benchmark While Pro tier models like Claude Opus 4.7 and GPT 5.5 are the best generalist models in the market today for things like coding and complex reasoning tasks, they aren't commonly used for high volume tasks like OCR or translation due to high cost and slow response times. Interfaze is benchmarked against models in similar pricing tiers and feature sets that are optimized to squeeze the most performance out of the model at the fastest speed, while keeping cost low at scale. Today, most people reach for two model categories for deterministic developer tasks: Flash/mini models like Gemini-3-Flash, GPT-5.4-Mini and Claude Sonnet 4.6. The best balance you can get between performance and price at scale. Specialized providers like Reducto, Mistral OCR or Whisper. Breakdown Benchmark Interfaze Gemini-3-Flash Claude-Sonnet-4.6 GPT-5.4-Mini Grok-4.3 OCRBench V2 70.7% 55.8% 54.7% 52.7% 54.7% olmOCR 85.7% 75.3% 73.9% 80.1% 81.9% RefCOCO 82.1% 75.2% 75.5% 67.0% 25.0% VoxPopuli (WER) ↓ 2.4% 4.0% — — — Spider 2.0-Lite 52.9% 45.2% 49.6% 26.7% 45.9% GPQA Diamond 89.9% 88.5% 89.9% 82.8% 73.6% MMMLU 90.9% 88.7% 84.9% 75.3% 89.7% MMMU-Pro 71.1% 67.6% 46.3% 40.4% 68.7% SOB Value Acc 79.5% 77.3% 77.9% 75.1% 78.4% ↓ = lower is better (word error rate). — = not scored (model has no native audio input). All other rows: higher is better. Each model is compared head-to-head across nine benchmarks: OCRBench V2, olmOCR, RefCOCO, VoxPopuli-Cleaned-AA, SOB Value, Spider-2.0-Lite, GPQA Diamond, MMMLU, and MMMU-Pro. View the full leaderboard → Interfaze leads in almost every benchmark, against both specialized models in each category and the generalist flash/mini models. Our goal isn't to replace LLMs. It's to specialize in deterministic tasks. The benchmarks focus on categories like OCR, object detection, and structured output, with a few general benchmarks like GPQA Diamond to show the level of problem-solving and understanding you'd expect from any transformer model. Interfaze is priced in a similar range as Gemini-3-Flash, at $1.50 per million input tokens and $3.50 per million output tokens . OCR is our number one use case Our number one use case from users has been OCR for images and complex, long PDFs. Interfaze outperforms OCR providers like Chandra OCR and Reducto, and generalist models like Gemini-3-Flash and GPT-5.4-Mini. It isn't just the task-specific CNN encoder doing a good job. It's the ability to lean on object detection for figures and graphics, or lean on the translation layers of the transformer all in a shared vector space. View full olmOCR benchmarks → Structured output is a big part of determinism Most LLMs today are great at following a JSON schema, but pretty bad at filling it with accurate values. No public benchmark measures the accuracy of those values, so we released SOB (the Structured Output Benchmark) last week. TL;DR: SOB gives the model the correct answer in its context, then asks it to generate a JSON output with data it already has. We measure who is the most accurate, with the fewest mistakes and hallucinations, across text, image, and audio modalities (all normalized to text). Compared against the same flash/mini set used throughout this post. See the full SOB leaderboard for all 28 models, including frontier Pro-tier models like Gemini-3.1-Pro, GPT-5.5, and Claude-Opus-4.7. There's still huge room for improving structured output without raising cost or compute. Follow us on X or LinkedIn to follow our research journey. Multilingual performance beyond English Interfaze has great multilingual performance across a wide range of languages. View full MMMLU benchmarks → Speech-to-text on par with specialized ASR providers On VoxPopuli-Cleaned-AA, Interfaze comes in second on word error rate. Speech-to-text inference speed Interfaze transcribes 209 seconds of audio per second of compute, ~1.5× faster than Deepgram Nova-3, ~8× faster than Scribe v2, and over 11× faster than Gemini-3-Flash. View full VoxPopuli benchmarks → Here's how you get started Set up your SDK Interfaze speaks the Chat Completions API standard, so any AI SDK that supports OpenAI works out of the box: just point it at https://api.interfaze.ai/v1 . Grab your API key from the Interfaze dashboard and drop it in. OpenAI SDK Vercel AI SDK LangChain SDK typescript python typescript typescript python import OpenAI from "openai" ; const interfaze = new OpenAI ({ baseURL: "https://api.interfaze.ai/v1" , apiKey: "<your-api-key>" , }); The same interfaze client is reused in every example below. Read the full setup guide → Complex OCR + object detection A magazine page with dense multi-column text and three illustrations. Interfaze runs OCR and object detection on the same image in one request, returning the full text plus pixel-coordinates for every figure, all under your schema. OpenAI SDK Vercel AI SDK LangChain SDK typescript python typescript typescript python import { z } from "zod" ; import { zodResponseFormat } from "openai/helpers/zod" ; const OCRObjectDetectionSchema = z. object ({ text: z. string (). describe ( "all text in the image" ), graphic_objects: z . array ( z. object ({ description: z. string (), top_left_x: z. number (), top_left_y: z. number (), bottom_right_x: z. number (), bottom_right_y: z. number (), }) ) . describe ( "graphics objects found in the image" ), }); const response = await interfaze.chat.completions. create ({ model: "interfaze-beta" , messages: [ { role: "user" , content: [ { type: "text" , text: "Extract the text and graphics from the im

새로운 아키텍처 OCR 비전 모델 음성 인식 벤치마크