MarkTechPost • 89일 전

큐원팀, LLM 내부 구조 해독하는 '큐원-스코프(SAE)' 오픈소스 공개

IMP

8/10

핵심 요약

알리바바 클라우드의 Qwen 팀이 대규모 언어 모델(LLM)의 내부 작동 방식을 해석하고 제어할 수 있는 오픈소스 희소 오토인코더(SAE) 모음인 'Qwen-Scope'를 공개했습니다. 이 도구는 모델의 내부 상태를 인간이 이해할 수 있는 언어나 스타일 같은 개념으로 분해하여, 가중치 수정 없이 실시간으로 모델의 출력을 제어하는 디버깅 및 개발 도구로 활용될 수 있습니다. 이를 통해 개발자들은 값비싼 컴퓨팅 자원을 소모하지 않고도 모델의 오작동을 진단하고 원하는 방향으로 쉽게 평가 및 수정할 수 있게 되었습니다.

번역된 본문

대규모 언어 모델(LLM)은 놀라운 능력을 갖추고 있지만, 그 내부 작동 방식은 답답할 정도로 불투명합니다. 모델이 잘못된 언어로 응답을 생성하거나, 끊임없이 같은 말을 반복하거나, 안전한 요청을 거부하는 등 오작동을 할 때, AI 개발자들은 내부 연산 수준에서 왜 그런 문제가 발생했는지 진단할 수 있는 도구가 거의 없습니다. 바로 이 문제를 해결하기 위해 'Qwen-Scope(큐원-스코프)'가 개발되었습니다.

Qwen 팀은 최근 Qwen3 및 Qwen3.5 모델 계열을 기반으로 학습된 오픈소스 희소 오토인코더(SAE, Sparse AutoEncoders) 모음인 Qwen-Scope를 공개했습니다. 이번 릴리스에는 7개 모델 변형에 걸쳐 총 14개의 SAE 가중치 그룹이 포함되어 있습니다. 여기에는 5개의 밀집(Dense) 모델(Qwen3-1.7B, Qwen3-8B, Qwen3.5-2B, Qwen3.5-9B, Qwen3.5-27B)과 2개의 전문가 혼합(MoE) 모델(Qwen3-30B-A3B, Qwen3.5-35B-A3B)이 포함됩니다.

희소 오토인코더(SAE)란 무엇이며, 왜 주목해야 할까요? SAE는 원시 신경망 활성화 값(Raw neural network activations)과 인간이 이해할 수 있는 개념 사이를 연결하는 번역 계층이라고 생각하면 됩니다. LLM이 텍스트를 처리할 때 수천 개의 숫자로 이루어진 고차원 은닉 상태(Hidden states)를 생성하는데, 이를 직접 해석하는 것은 매우 어렵습니다. SAE는 이러한 활성화 값을 대규모 희소 잠재 특징(Sparse latent features) 사전(Dictionary)으로 분해하도록 학습합니다. 즉, 각 입력에 대해 아주 일부분의 특징만 활성화되도록 만듭니다. 이렇게 활성화된 각각의 특징은 특정 언어, 문체, 안전과 관련된 동작 등 인간이 해석 가능한 개념과 정확히 일치하는 경향이 있습니다.

구체적으로, Qwen-Scope는 각 백본(Backbone)과 트랜스포머(Transformer) 계층마다 개별 SAE를 학습시켜, 희소 잠재 특징 집합을 사용하여 잔차 스트림(Residual-stream) 활성화 값을 재구성합니다. SAE 인코더는 각 활성화 값을 과완전(Overcomplete) 잠재 표현으로 매핑하고, Top-k 활성화 규칙을 통해 재구성을 위해 가장 큰 k개의 잠재 활성화 값만 유지합니다(이번 릴리스에서 k는 50 또는 100으로 설정됨). 밀집 백본의 경우 SAE 너비는 모델 은닉 크기의 16배로 확장됩니다. MoE 백본의 경우 표준 SAE는 32K 너비(16배 확장)를 사용하며, 더 미세한 표현 구조를 포착하기 위해 최대 128K 너비(64배 확장)의 더 넓은 SAE도 함께 제공됩니다. 그 결과 7개 백본에 걸쳐 모든 트랜스포머 계층에 대한 계층별 특징 사전이 완성되었습니다. 한 가지 중요한 기술적 세부 사항은, Qwen3.5-27B만이 유일하게 지시 사항 학습(Instruct) 변형을 기반으로 SAE가 학습되었으며, 나머지 6개 백본은 모두 기본 모델(Base model) 체크포인트를 사용하여 학습되었다는 점입니다.

Qwen-Scope가 개발 워크플로우를 바꾸는 4가지 활용법

추론 시점 제어 (Inference-Time Steering) 가장 즉각적인 활용법은 모델 가중치를 수정하지 않고도 모델의 출력에 영향을 미치는 '제어(Steering)'입니다. 이 아이디어는 '고수준의 동작이 모델의 내부 표현 공간에서 특정 방향성으로 인코딩된다'는 탄탄한 가설에 기반합니다. 추론 시점에 수식(h' ← h + αd, 여기서 h는 은닉 상태, d는 SAE 특징 방향, α는 강도)을 사용하여 잔차 스트림에 특징 방향을 더하거나 빼는 방식으로, 엔지니어들은 모델의 동작을 원하는 방향으로 유도하거나 특정 동작을 억제할 수 있습니다. 연구팀은 Qwen3 모델을 활용해 두 가지 사례 연구를 시연했습니다. 첫 번째 사례에서는 영어로 프롬프트를 주었는데 모델이 예기치 않게 중국어 텍스트를 섞어 생성하는 문제가 발생했습니다. 활성화 강도별로 SAE 특징을 정렬해보니, 중국어 특징(id: 6159)이 매우 높게 활성화되어 있음을 발견했습니다. 생성 과정에서 이 특징의 활성화를 억제하자 언어 혼용 현상이 완전히 사라졌습니다. 두 번째 사례에서는 고전 중국어 특징(id: 36398)을 활성화하여 이야기 작성 과제를 성공적으로 고전 문학 스타일로 유도했습니다. 두 예시 모두 모델 가중치에 대한 단 한 번의 업데이트 없이 수행되었습니다.
모델 구동 없는 평가 분석 (Evaluation Analysis Without Running Models) LLM을 평가하려면 일반적으로 대규모 벤치마크 데이터셋에 대해 수많은 순전파(Forward pass)를 실행해야 하므로 컴퓨팅 비용과 시간이 많이 소모됩니다. Qwen-Scope는 SAE 특징 활성화 값을 벤치마크 분석을 위한 표현 수준의 프록시(Proxy)로 사용하는 더 저렴한 대안을 제안합니다. 핵심 통찰은 모델이 벤치마크 샘플을 처리할 때 SAE가 해당 활성화 값을 희소 활성화 특징 세트로 분해한다는 점입니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Artificial Intelligence AI Infrastructure Technology AI Shorts Applications Language Model Machine Learning New Releases Open Source Software Engineering Staff Tech News Large language models are remarkably capable, yet frustratingly opaque. When a model misbehaves — generating responses in the wrong language, repeating itself endlessly, or refusing safe requests — AI devs have very few tools to diagnose why it happened at the level of internal computations. That's the problem Qwen-Scope is built to solve. Qwen Team just released Qwen-Scope , an open-source suite of sparse autoencoders (SAEs) trained on the Qwen3 and Qwen3.5 model families. The release comprises 14 groups of SAE weights across 7 model variants — five dense models (Qwen3-1.7B, Qwen3-8B, Qwen3.5-2B, Qwen3.5-9B, and Qwen3.5-27B) and two mixture-of-experts (MoE) models (Qwen3-30B-A3B and Qwen3.5-35B-A3B). What is a Sparse Autoencoder, and Why Should You Care? Think of a sparse autoencoder as a translation layer between raw neural network activations and human-understandable concepts. When an LLM processes text, it produces high-dimensional hidden states — vectors with thousands of numbers — that are difficult to interpret directly. An SAE learns to decompose these activations into a large dictionary of sparse latent features , where each input activates only a small subset of features. Each of those features tends to correspond to a specific, interpretable concept: a language, a style, a safety-relevant behavior. Concretely, for each backbone and transformer layer, Qwen-Scope trains a separate SAE to reconstruct residual-stream activations using a sparse set of latent features. The SAE encoder maps each activation to an overcomplete latent representation, and a Top-k activation rule keeps only the largest k latent activations for reconstruction (with k set to either 50 or 100 in the release). For dense backbones, the SAE width scales to 16× the model hidden size; for MoE backbones, standard SAEs use 32K width (16× expansion), and wider SAEs up to 128K width (64× expansion) are also released to capture finer-grained representation structure. The result is a layer-wise feature dictionary for every transformer layer across all seven backbones. One important technical detail: Qwen3.5-27B is the only backbone whose SAEs are trained on the instruct variant; all other six backbones use their base model checkpoints. Four Ways Qwen-Scope Changes the Development Workflow 1. Inference-Time Steering The most immediate application is steering — influencing model output without modifying any model weights. The idea rests on a well-supported hypothesis: high-level behaviors are encoded as directions in the model's internal representation space. By adding or subtracting a feature direction from the residual stream at inference time using the formula h' ← h + αd , where h is the hidden state, d is the SAE feature direction, and α controls strength, engineers can push the model toward or away from specific behaviors. The research team demonstrates two case studies on Qwen3 models. In the first, a model prompted in English unexpectedly mixes in Chinese text. Ranking SAE features by activation strength reveals a highly activated Chinese-language feature (id: 6159). Suppressing it during generation removes the language mixing entirely. In the second, activating a classical-Chinese feature (id: 36398) successfully steers a story-continuation task toward a classical literary style. Both examples required zero weight updates. 2. Evaluation Analysis Without Running Models Evaluating LLMs typically means running many forward passes across large benchmark datasets — expensive in compute and time. Qwen-Scope proposes a cheaper alternative: using SAE feature activations as a representation-level proxy for benchmark analysis . The core insight is that when a model processes a benchmark sample, the SAE decomposes its activation into a sparse set of active features, each interpretable as a ‘micro-capability.' A benchmark whose samples all activate the same features is redundant ; two benchmarks that activate largely overlapping feature sets are similar . The research team defines a feature redundancy metric that achieves a Spearman rank correlation of ρ ≈ 0.85 with performance-based redundancy across 17 widely-used benchmarks — including MMLU, GSM8K, MATH, EvalPlus, and GPQA-Diamond — without running a single model evaluation. The analysis also reveals that 63% of GSM8K's features are already covered by MATH, suggesting that evaluation suites containing MATH can safely omit GSM8K with minimal loss of discriminative information. The framework also extends to inter-benchmark similarity: the research team measures feature overlap between pairs of benchmarks to determine whether they probe the same capabilities. After controlling for general model ability by partialing out MMLU scores, the partial Pearson correlation between feature overlap and performance-based similarity across 28 benchmark pairs improves to 75.5%, providing evidence that feature overlap captures benchmark-specific capability similarity rather than just general model quality. This has a direct practical implication: benchmarks with low mutual feature overlap probe distinct capabilities and should both be retained; benchmarks with high overlap are candidates for consolidation. 3. Data-Centric Workflows: Toxicity Classification and Safety Data Synthesis SAE features also prove effective as lightweight classifiers. The research team builds a multilingual toxicity classifier across 13 languages using a simple two-stage pipeline: identify SAE features that fire more frequently on toxic examples than clean ones (on a small discovery set), then apply an OR-rule over those features on held-out test data — no additional classifier head, no gradient-based fitting. On English, this achieves an F1 score above 0.90 on both Qwen3-1.7B and Qwen3-8B. The research team further shows that features discovered in English transfer meaningfully to other languages without rediscovery — performance declines with linguistic distance (strongest for European languages like Russian and French, weaker for Arabic, Chinese, and Amharic), and scaling to Qwen3-8B improves both the level and stability of cross-lingual transfer. Crucially, using only 10% of the original discovery data still recovers about 99% of classification performance, demonstrating strong data efficiency. On the synthesis side, the research team introduces a feature-driven safety data synthesis pipeline : identify safety-relevant SAE features that are missing from existing supervision, generate prompt-completion pairs designed to activate those features, and verify retention in feature space. Under a matched budget, feature-driven synthesis achieves 99.74% coverage of the target safety feature set, compared to the substantially lower coverage achieved by natural sampling or random safety-related synthesis. Adding 4k feature-driven synthetic examples to 4k real safety examples produces a safety accuracy of 77.75 — approaching the performance of training on 120k safety-only examples. 4. Post-Training: Supervised Fine-Tuning and Reinforcement Learning Perhaps the most technically novel contribution is using SAE features as signals during training , not just inference. For supervised fine-tuning, the research team addresses unexpected code-switching — where multilingual LLMs spontaneously produce tokens in an unintended language. Their method, called Sparse Autoencoder-guided Supervised Fine-Tuning (SASFT) , first identifies language-specific features via a monolinguality score, then introduces an auxiliary regularization loss that suppresses those feature activations during training on non-target-language data. Across five models spanning three model families — Gemma-2, Llama-3.1, and Qwen3 — and three target languages (Chinese, Russian, and Korean), SASFT achieves over 50% reduction in c

대규모 언어 모델 오픈소스 희소 오토인코더 모델 해석 가능성 Qwen