Hacker News • 87일 전

대규모 언어 모델의 거부 행동은 단일 방향성에 의해 매개된다

IMP

8/10

핵심 요약

이 연구는 대화형 대규모 언어 모델이 유해한 요청을 거부하는 메커니즘이 모델 내부의 단일 1차원 부분 공간(방향성)에 의해 결정된다는 사실을 13개의 주요 오픈소스 모델을 통해 입증합니다. 연구진은 이 방향성을 제거하면 모델의 안전장치가 무력화되고, 반대로 추가하면 무해한 요청도 거부하게 만들 수 있음을 보였습니다. 이러한 기계적 해석 가능성(Mechanistic Interpretability) 연구는 현재 AI 안전성 미세조정(Fine-tuning) 방식의 취약성을 지적하며, 모델 내부 구조 이해가 행동 제어 기술로 이어질 수 있음을 시사합니다.

번역된 본문

컴퓨터 과학 > 머신러닝 arXiv:2406.11717 (cs) [2024년 6월 17일 제출 (v1), 2024년 10월 30일 최종 수정 (현재 버전, v3)]

제목: 대규모 언어 모델의 거부 행동은 단일 방향성에 의해 매개된다 (Refusal in Language Models Is Mediated by a Single Direction) 저자: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

초록: 대화형 대규모 언어 모델(LLM)은 명령어 따르기와 안전성을 위해 미세조정(Fine-tuning)되어, 순수한 요청은 수행하면서 유해한 요청은 거부하도록 설계되었습니다. 이러한 '거부(Refusal)' 행동은 다양한 채팅 모델에 걸쳐 널리 퍼져 있지만, 그 기저의 메커니즘은 여전히 제대로 이해되지 않았습니다. 본 연구에서는 최대 720억(72B) 파라미터 크기의 13개 인기 오픈소스 채팅 모델을 대상으로, 거부 행동이 1차원 부분 공간(Subspace)에 의해 매개된다는 사실을 보여줍니다.

구체적으로, 각 모델에 대해 단일 '방향성(Direction)'을 발견했으며, 모델의 잔차 스트림(Residual Stream) 활성화에서 이 방향성을 지워버리면 모델이 유해한 명령어를 거부하지 못하게 됩니다. 반면, 이 방향성을 인위적으로 추가하면 무해한 명령어에 대해서조차 거부를 유도할 수 있습니다. 이러한 통찰을 활용하여, 우리는 모델의 다른 기능에는 최소한의 영향만 주면서 거부 행동을 정교하게 비활성화하는 새로운 화이트박스(White-box) 탈옥(Jailbreak) 기법을 제안합니다. 마지막으로, 적대적 접미사(Adversarial suffixes)가 거부 매개 방향성의 전파를 어떻게 억제하는지 기계적으로 분석합니다.

우리의 연구 결과는 현재의 안전 미세조정 방법들이 얼마나 취약한지를 강조합니다. 더 나아가, 모델 내부에 대한 이해를 활용하여 모델의 행동을 제어할 수 있는 실용적인 방법을 개발할 수 있음을 보여줍니다.

주제: 머신러닝 (cs.LG); 인공지능 (cs.AI); 컴퓨터 과학 및 언어 (cs.CL)

인용: arXiv:2406.11717 [cs.LG] (또는 이 버전의 경우 arXiv:2406.11717v3 [cs.LG]) https://doi.org/10.48550/arXiv.2406.11717

제출 이력: Andy Arditi [이메일 보기] [v1] 2024년 6월 17일 월요일 16:36:12 UTC (237 KB) [v2] 2024년 7월 15일 월요일 11:53:41 UTC (183 KB) [v3] 2024년 10월 30일 수요일 18:57:07 UTC (194 KB)

원문 보기

원문 보기 (영어)

--> Computer Science > Machine Learning arXiv:2406.11717 (cs) [Submitted on 17 Jun 2024 ( v1 ), last revised 30 Oct 2024 (this version, v3)] Title: Refusal in Language Models Is Mediated by a Single Direction Authors: Andy Arditi , Oscar Obeso , Aaquib Syed , Daniel Paleka , Nina Panickssery , Wes Gurnee , Neel Nanda View a PDF of the paper titled Refusal in Language Models Is Mediated by a Single Direction, by Andy Arditi and 6 other authors View PDF Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior. Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2406.11717 [cs.LG] (or arXiv:2406.11717v3 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2406.11717 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Andy Arditi [ view email ] [v1] Mon, 17 Jun 2024 16:36:12 UTC (237 KB) [v2] Mon, 15 Jul 2024 11:53:41 UTC (183 KB) [v3] Wed, 30 Oct 2024 18:57:07 UTC (194 KB) Full-text links: Access Paper: View a PDF of the paper titled Refusal in Language Models Is Mediated by a Single Direction, by Andy Arditi and 6 other authors View PDF TeX Source view license Current browse context: cs.LG < prev | next > new | recent | 2024-06 Change to browse by: cs cs.AI cs.CL References & Citations NASA ADS Google Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer ( What is the Explorer? ) Connected Papers Toggle Connected Papers ( What is Connected Papers? ) Litmaps Toggle Litmaps ( What is Litmaps? ) scite.ai Toggle scite Smart Citations ( What are Smart Citations? ) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv ( What is alphaXiv? ) Links to Code Toggle CatalyzeX Code Finder for Papers ( What is CatalyzeX? ) DagsHub Toggle DagsHub ( What is DagsHub? ) GotitPub Toggle Gotit.pub ( What is GotitPub? ) Huggingface Toggle Hugging Face ( What is Huggingface? ) Links to Code Toggle Papers with Code ( What is Papers with Code? ) ScienceCast Toggle ScienceCast ( What is ScienceCast? ) Demos Demos Replicate Toggle Replicate ( What is Replicate? ) Spaces Toggle Hugging Face Spaces ( What is Spaces? ) Spaces Toggle TXYZ.AI ( What is TXYZ.AI? ) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower ( What are Influence Flowers? ) Core recommender toggle CORE Recommender ( What is CORE? ) IArxiv recommender toggle IArxiv Recommender ( What is IArxiv? ) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs . Which authors of this paper are endorsers? | Disable MathJax ( What is MathJax? )

AI 안전성 메커니즘 해석 가능성 언어 모델 탈옥 공격 미세조정