r/singularity • 78일 전

KV 캐시 대신 가중치에 넣는 BDH 메모리 아키텍처

IMP

6/10

핵심 요약

전통적인 트랜스포머가 컨텍스트 길이가 길어질수록 겪는 KV 캐시의 메모리 문제를 해결하기 위해, 정보를 네트워크 가중치(그래프 전파)에 직접 저장하는 새로운 포스트 트랜스포머 아키텍처인 BDH의 개념과 작동 방식을 정리한 글입니다. 저자는 기존 모델들이 단기 기억에 의존하는 '전진성 건망증'을 겪고 있으며, 이를 해결하기 위해 어텐션을 선형화하는 동시에 키/쿼리 공간을 매우 높은 차원(희소 및 비음수 패턴)으로 확장해야 한다고 강조합니다.

번역된 본문

몇몇 토론 스레드에서 BDH(Block-dependent Hebbian)라는 개념이 언급되는 것을 보았지만, 이 아키텍처가 실제로 주장하는 바가 무엇인지에 대한 간결한 설명을 찾을 수 없었습니다. Jan Chorowski의 세미나를 보고 필기한 내용을 공유합니다. 영상을 끝까지 볼 필요가 없도록 짧은 버전을 올립니다.

저는 포스트 트랜스포머 아키텍처를 탐구하고 있는 입장이므로, 이 글은 제가 이해한 하나의 아키텍처에 대한 견해로 봐주시기 바랍니다. 틀린 부분이 있다면 지적해 주시고, 이것이 확정적인 결론이라고 생각하지는 마십시오.

트랜스포머의 메모리를 마크다운 메모를 보는 것에 비유하며 새로운 장기 기억을 형성할 수 없는 '전진성 건망증'에 빗대어 묘사하는 글을 점점 더 많이 보고 있습니다. 즉, 트랜스포머의 메모리는 가중치에 압축된 정적인 사전 학습 컨텍스트와 KV 캐시에 인코딩된 매우 짧은 컨텍스트(현재 사용자 세션)의 조합으로 이루어집니다.

어텐션 부분이 저에게 가장 흥미로웠습니다. 표준 어텐션은 쿼리(Query)를 과거의 키(Key)들과 비교하여 값(Value)을 검색합니다. Jan의 아이디어는 키와 쿼리를 작고 추상적인 벡터로 취급하는 것을 멈추는 것입니다. 첨부한 슬라이드 사진에서 그는 키와 쿼리를 고차원 공간의 뉴런 활성화와 동일하게 설정했습니다. 따라서 시그마(σ)는 누적 연결 행렬이 되며, 메모리를 읽는 것은 그래프 전파(Graph propagation)가 됩니다.

따라서 이 방식은 성능을 효율성으로 맞바꾸며 단순히 어텐션을 선형화하는 바닐라 SSM(State Space Model)과는 다릅니다. 그의 표현을 빌리자면: "기본적으로 비선형 어텐션 레이어를 선형 어텐션 레이어로 교체하면서 모델의 다른 부분은 아무것도 변경하지 않을 수는 없습니다."

다시 말해, 어텐션을 선형화한다면 Jan은 메모리 공간도 변경해야 한다고 주장합니다. 모델이 비음수 활성화(Non-negative activations)와 함께 작동하기 때문에 키/쿼리 공간은 매우 크고, 희소적이며, 양수/뉴런과 같은 형태가 됩니다. 다른 슬라이드에 따르면 트랜스포머의 키-쿼리 차원이 약 10^3(천 단위)인 데 반해, BDH는 10^7(천만 단위) 이상이라고 합니다. 결과적으로 단기 메모리 상태는 고정되고 양수이며 매우 고차원적인 공간에 투영되어, 기존 KV 캐시보다 훨씬 더 표현력이 뛰어나고 조작하기 쉬워집니다.

실질적인 문제도 명확합니다. 전체 뉴런 x 뉴런(N x N) 연결 행렬은 너무 큽니다. 이를 구현할 때는 저랭크 분해(Low-rank factorization)와 ReLU 임계값(Thresholding)을 사용하여 N x N 행렬을 직접 구현하는 대신 그래프를 압축 상태로 유지하고 희소화(Sparse)합니다.

중요해 보이지만 추가 확인이 필요한 다른 주장들은 다음과 같습니다:

RNN이 잘못된 메모리/컴퓨팅 비율을 가졌을 수 있습니다. 즉, O(N^2)의 전이 파라미터를 가졌지만 상태(State)는 O(N)뿐이었습니다.
BDH 메모리는 노이즈가 있는 고정 크기 해시 테이블(Hash table)과 더 비슷합니다. 희소적인 키가 몇 개의 버킷에 기록되고, 충돌이 노이즈를 추가하지만 메모리는 토큰마다 증가하지 않습니다.
복원된 그래프를 보면 모듈형/멱함수 꼬리 분포(Modular/heavy-tailed) 구조를 보여줍니다.
유럽의회(Europarl) 데이터 예시에서는 "US dollar" 이후에는 시냅스가 활성화되지만 "US" 이후에는 활성화되지 않는 것을 볼 수 있습니다.
반복된 사실은 시간이 지남에 따라 더 적은 뉴런을 활성화시키고 더 적은 쓰기를 발생시킵니다. 활성화된 뉴런이 대략 6%에서 약 2%로 떨어집니다.

저는 이러한 결과들을 확인된 사실이 아니라, 검증해 볼 만한 흥미로운 주장으로 취급할 것입니다. 다음과 같은 주의 사항도 중요합니다:

이것은 기존 트랜스포머 가중치를 변환한 것이 아닙니다. Jan은 BDH 모델이 처음부터 학습되거나 최선의 경우 지식 증류(Distillation)를 통해 얻어진다고 말합니다.
장기 가중치는 여전히 역전파(Backprop)를 사용하며, 흡비안(Hebbian) 스타일 부분은... (원문 누락)

원문 보기

원문 보기 (영어)

I've seen BDH come up in a few discussion threads, but I couldn't find a compact explanation of what the architecture is actually claiming. I found jan chorowski's seminar and took notes, so posting the short version here in case it saves others the full watch. I'm exploring post-transformer architectures, so treat this as my understanding of one architecture, please correct it and not a definitive take. I read more and more anterograde amnesia to characterize transformers' memory as being unable to form new long-term memories as they compensate with markdown notes. So transformers' memory is a combination of static pre-training context compressed into the weights and very short-term context (current user session) encoded in KV-cache. The attention part was the most interesting to me. Standard attention retrieves values by comparing a query to past keys. Jan's idea is to stop treating keys/queries as small abstract vectors. In the (attached) photo of the slide he sets keys and queries equal to neuron activations in high dimensional space, so sigma is the accumulated connectivity matrix and reading memory becomes graph propagation. So it’s not just linearizing attention as in vanilla SSM, trading off performance for efficiency. His line was: You cannot swap basically a non-linear attention layer for a linear attention layer and change nothing else in the model. In other words: if you linearize attention, Jan's claim is that you also need to change the memory space. The key/query space becomes very large, sparse, and positive/neuron-like because the model is working with non-negative activations. Another slide claims `>10^7` key-query dimensions for BDH versus `~10^3` for Transformers; the short-term memory states are thus projected to fixed, positive, and very high-dimensional spaces, becoming much more expressive and manipulable than KV cache. The practical issue is obvious: a full `Neurons x Neurons` connectivity matrix is too large. The implementation uses low-rank factorization plus ReLU thresholding, keeping the graph compressed and sparse instead of materializing `N x N`. Other claims that seem important to put here but need follow up: * RNNs maybe had the wrong memory/compute ratio: O(N\^2) transition parameters but only O(N) state * BDH memory is more like a noisy fixed-size hash table: sparse keys write to a few buckets, collisions add noise, but memory does not grow one token at a time * Recovered graphs show modular/heavy-tailed-looking structure * A Europarl example shows a synapse activating after "US dollar" but not after "US" * Repeated facts cause fewer active neurons /fewer writes over time, roughly 6% active neurons dropping to about 2%. I would treat the results as interesting claims to inspect, not proof. The caveats matter: * This is not a conversion of existing Transformer weights; jan says BDH models train from scratch or at best distill. * Long-term weights still use backprop and the hebbian style part is

포스트 트랜스포머 BDH 아키텍처 어텐션 메커니즘 KV 캐시 최적화 모델 아키텍처