Hacker News • 99일 전

타입 시스템과 신경망

IMP

7/10

핵심 요약

본 글은 대형 언어 모델(LLM)이 고도로 추상화된 프로그래밍 언어의 코드를 생성할 때 발생하는 타입 불일치 문제를 지적하며, 타입 검사를 학습 후 처리하는 기존 방식의 한계를 분석합니다. 오류 발생 시 재시도하거나 매 토큰마다 타입을 강제하는 제약 디코딩 방식은 비효율적이거나 근본적인 해결책이 되지 못합니다. 따라서 모델의 가중치를 업데이트하여 근본적으로 타입이 지정된 출력을 생성하도록 LLM 구조를 재설계해야 한다고 강조합니다.

번역된 본문

Bruno Gavranović의 2026년 4월 20일 게시글: 타입과 신경망 [이 글은 GLAIVE 블로그에도 교차 게시되었습니다.]

신경망은 Idris, Lean, Agda와 같이 고도로 일반화되고 프로그램의 올바름을 증명할 수 있는 프로그래밍 언어로 점점 더 많은 코드를 생성하는 데 사용되고 있습니다. 하지만 코드를 생성하는 대부분의 최신 프론티어 모델, 즉 대형 언어 모델(LLM, Large Language Models)은 학습 과정과 타입 검사(typechecking) 과정을 분리합니다. 이들은 고정된 타입의 출력인 List Token을 생성하도록 학습됩니다. 유효한 코드를 얻기 위해서는 학습 후 처리(post-training) 과정에서 이 출력을 여러 가지 임시적인 방법(ad-hoc ways)을 통해 특정 언어의 타입으로 파싱해야 합니다. 이러한 임시 방법들은 무엇일까요? 이 방법들이 제대로 작동합니까, 혹은 작동할 것이라 기대할 수 있을까요? 더 중요한 것은, LLM을 처음부터 다시 구축하여 타입이 지정된 출력(typed output)을 생성하도록 학습시킬 수 있을까요?

학습 후 타입 강제하기

오늘날 LLM은 주어진 학습 말뭉치에서 다음 토큰을 예측하도록 학습됩니다. 그 결과로 다음과 같은 타입의 함수가 만들어집니다.

LLM : List Token -> D ( List Token )

여기서 D(List Token)은 토큰 목록에 대한 분포를 나타냅니다. LLM은 입력 프롬프트를 소비하고 한 번에 하나의 토큰씩 이 분포를 생성합니다. 즉, 다음 토큰 분포에서 샘플링한 결과를 다시 모델에 피드백으로 제공하는 방식입니다. 이 블로그 글의 목적상, Token을 단순히 Maybe Char, 즉 모델이 생성할 수 있는 문자이거나 생성 과정의 종료를 알리는 STOP 토큰 중 하나라고 간주하겠습니다. 1

이 섹션의 접근 방식들은 모두 학습된 네트워크를 고정된 것으로 취급합니다. 오직 추론(inference) 절차만 수정하며, 다음 두 가지 축을 따라 분류됩니다.

세분성(Granularity): 타입 검사기에 얼마나 자주 문의합니까? 모델이 STOP 토큰을 내보낼 때입니까, 아니면 모든 토큰을 생성한 후입니까?
대역폭(Bandwidth): 타입 검사기는 무엇을 다시 전달합니까? 구조화된 오류 메시지입니까, 아니면 단순히 이진법적인 수용/거부(accept/reject) 신호입니까?

이러한 접근 방식들은 이 축들의 양극단에 있으며, 타입이 지정된 생성을 근본적으로 해결하지는 못하지만, 문제가 어디에 있는지 이해하는 데 도움이 되므로 여기서 설명합니다.

1. 시도; 컴파일; 오류 발생 시 반복 (Try; Compile; If Error Repeat)

이것은 대부분의 프로그래머가 하는 일입니다. 원시 텍스트를 편집기에 입력합니다. 컴파일러는 이를 구조화된 데이터로 처리하거나, 프로그래머가 다시 제출하기 전에 내면화해야 하는 오류를 반환합니다. 동일한 루프를 LLM으로 수행할 수 있습니다. 2 List Int 타입의 값을 생성하는 작업이 주어지면, 모델은 후보(예: "[ 1 , 2 , 3 ] STOP")를 생성하고 컨트롤러는 이를 타입 검사기에 전달합니다. 통과하면 끝입니다. 그렇지 않으면 오류를 피드백으로 제공하고 모델에게 다시 시도하라고 요청합니다.

우리의 두 가지 축을 따라 볼 때, 이 접근 방식은 낮은 세분성(타입 검사기는 STOP 토큰이 생성된 후에만 참조됨)과 높은 대역폭(오류는 모델이 추론할 수 있는 구조화된 메시지임)을 가집니다. 하지만 문제가 있습니다. 작업이 대신 Either Char Double 타입의 값을 생성하는 것이라고 가정해 봅시다. 모델이 "[ 1 , 2"로 시작한다면, 이 타입의 어떤 값도 [로 시작하지 않으므로 이미 복구할 수 없는 상태입니다. 전체 생성이 완료될 때까지 아무것도 알아채지 못합니다! 그 시점에서 전체 시퀀스가 거부되고 모델이 처음부터 다시 시작합니다. 이는 엄청난 낭비입니다. 루프를 조이는 것, 즉 각 확인 전에 생성하는 토큰 수를 줄이는 것으로 완화할 수 있습니다. 그리고 그 한계는 경험이 많은 종속 타입(dependently-typed) 프로그래머들이 하는 일입니다. 즉, 작은 움직임을 만들고 컴파일러의 응답을 본 다음 다음 움직임을 만드는 것입니다. 하지만 이 조임은 세분성 문제만 해결할 뿐입니다. 프로그래머와 달리 LLM은 세션 전반에 걸쳐 학습한 내용을 유지하지 않습니다. 이 접근 방식에서는 가중치(weights)를 업데이트하는 것이 없으므로, 다음 세션은 이전과 동일한 상태에서 시작됩니다.

2. 제약 디코딩 (Constrained Decoding)

다른 접근 방식은 각 토큰이 샘플링되기 전에 타입 검사기에 문의하고, 잘못된 타입의 결과를 초래할 수 있는 토큰을 마스킹하여 제외하는 것입니다. 예를 들어, 모델이 List Int 타입의 값을 생성 중이고 이미 "[ 1 , 2 , 3" 토큰을 생성했다면, 알파벳 문자에 대한 확률을 0으로 설정하고 나머지에서 샘플링합니다. 3 타입 추론(type inference)은 일반적으로 결정 불가능하다는 사실에도 불구하고, 이 접근 방식은...

원문 보기

원문 보기 (영어)

Bruno Gavranović Posted on April 20, 2026 Types and Neural Networks [This is cross posted to the GLAIVE blog ] Neural networks are used to generate increasingly more code in languages which enable highly generic and provably correct programming: Idris, Lean, and Agda, for example. However, most frontier models generating the code – Large Language Models – separate the process of training from the process of typechecking. They are trained to produce output of a fixed type: List Token . To get valid code, that output is then during post-training parsed, in multiple ad-hoc ways, into the types of a particular language. What are these ad-hoc ways? Do they work, and should we expect them to? And more importantly, can we rebuild LLMs from the ground up so that they’re trained to produce typed output? Enforcing types after training Today, LLMs are trained to predict the next token in a given training corpus, resulting in functions of type LLM : List Token -> D ( List Token ) where D (List Token) here denotes a distribution over a list of tokens. LLMs consume an input prompt and produce this distribution one token at a time – sampling from the next-token distribution and feeding the result back in. For the purposes of this blog post, we’ll consider Token to simply be Maybe Char , i.e. either a character the model is able to generate, or a STOP token which signals the termination of the generative process. 1 The approaches in this section all take this trained network as fixed. They modify only the inference procedure, and fall along two axes: Granularity: how often is the typechecker consulted? When the model emits a STOP token, or after every token? Bandwidth: what does the typechecker communicate back? A structured error message, or merely a binary accept/reject signal? These approaches sit at opposite extremes along these axes, and neither fundamentally solves typed generation, but we describe them here because they help us understand where the issue is. 1. Try; Compile; If Error Repeat This is what most programmers do. They type raw text into the editor; the compiler either processes it into structured data , or returns an error the programmer has to internalise before resubmitting. The same loop can be performed with LLMs. 2 Given the task of producing a value of type List Int , the model generates a candidate – say, “ [ 1 , 2 , 3 ] STOP ” – and the controller hands it to a typechecker. If it passes, we’re done. If not, we feed the error back and ask the model to try again. Along our two axes, this approach has low granularity (the typechecker is consulted only after the STOP token is produced) and high bandwidth (errors are structured messages the model can reason about). But there is a problem. Suppose the task is instead to produce a value of Either Char Double , and the model begins with “ [ 1 , 2 ” – already unrecoverable, since no value of this type starts with [ . Nothing notices until the full generation completes! At that point the entire sequence is rejected and the model starts over. This is incredibly wasteful. It can be remedied by tightening the loop – generating fewer tokens before each check – and in the limit this is what experienced dependently-typed programmers do: make a small move, see the compiler’s response, make the next move. But tightening only addresses granularity. The LLM, unlike the programmer, does not carry what it learns across sessions: nothing in this approach updates the weights, meaning that the next session starts from the same state as the last. 2. Constrained decoding A different approach is to consult the typechecker before each token is sampled, and mask out any tokens that would lead to an ill-typed result. For instance, if the model is generating a value of type List Int and has already produced the tokens “ [ 1 , 2 , 3 ”, we set the probability for alphabetic characters to zero and sample from the remains. 3 Despite the fact that type inference is in general not decidable, this approach is well-known, and has been applied to JSON schemas and restricted fragments of various type systems. 4 Along our axes, this sits at the opposite corner: maximum granularity (every token), and minimum bandwidth (one bit per token). Here, output is guaranteed to typecheck! However, there is a big problem: this only prevents the model from saying certain things, and does not change what the model wants to say . To see this, consider the case where the model has learned to assign high probability to alphabetic characters in our example. Masking them out forces it to sample tokens it considers unlikely. At each step, the mask checks only whether a token can lead to a well-typed output, without accounting for how much of the model’s probability mass remains on each branch. This can trap generation into branches where only low-probability completions exist, producing output that typechecks but is increasingly nonsensical. 5 Like with the previous approach, no gradient flows through the mask, so the model cannot learn to adapt and assign a higher probability to densely populated branches, as the weights are never updated. Learning types during training The above approaches, as strange as it may seem, work. Frontier models using retry loops are making genuine progress: FrontierMath scores climbed from under 2% upon release to almost 50% in less than two years, and so did the scores for ARC-AGI-2 and SWE-bench . In chess, GPT-4 reaches roughly 1371 ELO (intermediate player) as a byproduct of general training. These are significant results. But in domains where we know how to train models that utilise structure during training , we see even more dramatic performance improvements. Chess is one of those. The system AlphaZero utilises the game structure during training and reaches superhuman ELO (>3400) with ~30x fewer parameters than GPT-4 (<60 million vs 1.8 trillion). And AlphaZero never makes an illegal move, compared to GPT-4 which makes one in 30% of games. What about dependently-typed programming languages? These are an incredibly rich domain, and one can wonder what happens if we tightly integrate their compilers into neural network training. Unfortunately, this hasn’t happened yet: we don’t really know how to do that . Most current models generating typed code are playing the game without being told its rules. What is surprising is how good they are. If chess is any indication of what happens to performance when you train using structure, encoding the rules of a language is worth figuring out. The only question is: how do you differentiate through type systems? How do you learn to produce a typed output, given that types are discrete, non-differentiable objects? Differentiating through structure To understand differentiation, let’s start with a simpler question. How do you write a function of the following type in Haskell, modelling a simple choice of output structure: a coproduct? f :: (x, p) -> Either a b f = ? If you try to write it down, no matter how creative you get, it will end up having the shape f (x, p) = if c (x, p) then Left (f_l (x, p)) else Right (f_r (x, p)) for some predicate c : (x, p) -> Bool and two maps f_l :: (x, p) -> a , f_r :: (x, p) -> b . There is no other option. 6 To produce a Left or a Right , you must commit to one or the other by partitioning the input via c . When it returns True you go left, otherwise you go right. This decomposition is not an accident of Haskell – it is one of the properties of the category \(\mathbf{Set}\) , which is an extensive category. 7 And this property is central to research on differentiating maps into coproducts, most notably CHAD and Higher Order AD of Higher Order Functions . They use this theorem, and seemingly – without retries or constrained decoding – show us how to build a network whose output is a structured type, at train time . But as we’ve seen above – these approaches require us, the programmer, to partition the input space, instead of having the neural net

대형 언어 모델 타입 시스템 코드 생성 제약 디코딩 신경망