Hacker News • 71일 전

AI 담론이 자가 충족적 얼라인먼트를 만드는 방식

IMP

8/10

핵심 요약

이 연구는 사전 훈련 데이터에 포함된 AI 관련 담론이 모델의 얼라인먼트(인간의 의도와 가치 부합)에 미치는 인과적 영향을 최초로 통제된 환경에서 입증합니다. 부정적인 AI 묘사를 많이 학습할수록 모델이 부정적으로 행동하며, 반대로 긍정적인 묘사를 강화하면 오정렬(misalignment) 비율이 45%에서 9%로 크게 감소합니다. 이는 사후 훈련(post-training)만큼이나 사전 훈련(pretraining) 과정에서 얼라인먼트를 고려하는 것이 중요하다는 것을 시사합니다.

번역된 본문

컴퓨터 과학 > 연산 및 언어 (Computation and Language) arXiv:2601.10160 (cs) [2026년 1월 15일 제출 (v1), 2026년 2월 19일 최종 수정 (현재 버전, v2)]

제목: 얼라인먼트 사전 훈련: AI 담론이 자가 충족적 (오)정렬을 유발한다 저자: Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien

초록: 사전 훈련 코퍼스에는 AI 시스템에 대한 광범위한 담론이 포함되어 있지만, 이러한 담론이 다운스트림 얼라인먼트(Alignment, 인간의 의도와 가치관에 부합하는 특성)에 미치는 인과적 영향은 아직 제대로 이해되지 않았습니다. 만약 AI의 행동에 대한 일반적인 설명이 주로 부정적이라면, 대규모 언어 모델(LLM)은 이에 상응하는 행동적 사전 지식(behavioral priors)을 내면화하여 자가 충족적 오정렬(misalignment)을 초래할 수 있습니다.

본 논문은 다양한 양의 (오)정렬 담론을 포함하여 69억(6.9B) 파라미터 규모의 LLM을 사전 훈련함으로써 이 가설에 대한 최초의 통제 연구를 제공합니다. 연구 결과, AI에 대한 논의는 오정렬에 기여하는 것으로 나타났습니다. AI의 오정렬에 대한 합성 훈련 문서의 샘플링 비율을 높이면(upsampling) 오정렬된 행동이 눈에 띄게 증가했습니다. 반대로, 올바르게 정렬된(aligned) 행동에 대한 문서의 샘플링 비율을 높이면 오정렬 점수가 45%에서 9%로 감소했습니다. 우리는 이를 자가 충족적 얼라인먼트(self-fulfilling alignment)의 증거로 간주합니다.

이러한 효과는 사후 훈련(post-training) 과정을 거치면서 약화되기는 하지만 여전히 지속되었습니다. 우리의 연구 결과는 사후 훈련의 보완재로서, 사전 훈련 데이터가 얼라인먼트 사전 지식을 어떻게 형성하는지에 대한 연구(즉, 얼라인먼트 사전 훈련, alignment pretraining)라는 새로운 분야를 확립합니다. 우리는 실무자들에게 모델의 성능(capatibilities) 향상과 함께 얼라인먼트를 위한 사전 훈련도 함께 고려할 것을 권장합니다.

주제: 연산 및 언어 (cs.CL); 인공지능 (cs.AI); 머신러닝 (cs.LG) 인용: arXiv:2601.10160 [cs.CL]로 인용 (또는 이 버전의 경우 arXiv:2601.10160v2 [cs.CL]) https://doi.org/10.48550/arXiv.2601.10160

제출 기록: 작성자: Kyle O'Brien [이메일 보기] [v1] 2026년 1월 15일 목요일 07:59:31 UTC (1,982 KB) [v2] 2026년 2월 19일 목요일 22:53:56 UTC (2,369 KB)

전문 링크: 논문 전문 접근 (Cameron Tice 외 5인의 저자가 작성한 'Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment' PDF 및 HTML 보기)

현재 탐색 컨텍스트: cs.CL < 이전 | 다음 > 새 글 | 최근 글 | 2026-01 주제별 탐색 변경: cs cs.AI cs.LG

참고문헌 및 인용: NASA ADS, Google Scholar, Semantic Scholar BibTeX 형식의 인용문 내보내기 제공

서지 도구: 서지 및 인용 도구, 서지 탐색기 전환, Connected Papers, Litmaps, scite.ai 스마트 인용 전환

코드, 데이터, 미디어: 본 논문과 관련된 코드, 데이터 및 미디어 alphaXiv, CatalyzeX 코드 파인더, DagsHub, GotitPub, Hugging Face, ScienceCast 연동 제공

데모: Replicate, Hugging Face Spaces, TXYZ.AI 데모 제공

관련 논문: 추천인 및 검색 도구, 영향력 플라워(Influence Flower) 링크, CORE 추천인 전환

arXivLabs 정보: 커뮤니티 협력자와 함께하는 실험적 프로젝트. arXivLabs는 협력자들이 웹사이트에서 직접 새로운 arXiv 기능을 개발하고 공유할 수 있도록 하는 프레임워크입니다. arXivLabs와 함께하는 개인과 조직은 개방성, 커뮤니티, 우수성이라는 가치를 수용하고 포용합니다.

원문 보기

원문 보기 (영어)

--> Computer Science > Computation and Language arXiv:2601.10160 (cs) [Submitted on 15 Jan 2026 ( v1 ), last revised 19 Feb 2026 (this version, v2)] Title: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment Authors: Cameron Tice , Puria Radmard , Samuel Ratnam , Andy Kim , David Africa , Kyle O'Brien View a PDF of the paper titled Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment, by Cameron Tice and 5 other authors View PDF HTML (experimental) Abstract: Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners consider pretraining for alignment alongside capabilities. We share our models, data, and evaluations at this http URL . Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.10160 [cs.CL] (or arXiv:2601.10160v2 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.10160 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Kyle O'Brien [ view email ] [v1] Thu, 15 Jan 2026 07:59:31 UTC (1,982 KB) [v2] Thu, 19 Feb 2026 22:53:56 UTC (2,369 KB) Full-text links: Access Paper: View a PDF of the paper titled Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment, by Cameron Tice and 5 other authors View PDF HTML (experimental) TeX Source view license Current browse context: cs.CL < prev | next > new | recent | 2026-01 Change to browse by: cs cs.AI cs.LG References & Citations NASA ADS Google Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer ( What is the Explorer? ) Connected Papers Toggle Connected Papers ( What is Connected Papers? ) Litmaps Toggle Litmaps ( What is Litmaps? ) scite.ai Toggle scite Smart Citations ( What are Smart Citations? ) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv ( What is alphaXiv? ) Links to Code Toggle CatalyzeX Code Finder for Papers ( What is CatalyzeX? ) DagsHub Toggle DagsHub ( What is DagsHub? ) GotitPub Toggle Gotit.pub ( What is GotitPub? ) Huggingface Toggle Hugging Face ( What is Huggingface? ) ScienceCast Toggle ScienceCast ( What is ScienceCast? ) Demos Demos Replicate Toggle Replicate ( What is Replicate? ) Spaces Toggle Hugging Face Spaces ( What is Spaces? ) Spaces Toggle TXYZ.AI ( What is TXYZ.AI? ) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower ( What are Influence Flowers? ) Core recommender toggle CORE Recommender ( What is CORE? ) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs . Which authors of this paper are endorsers? | Disable MathJax ( What is MathJax? )

얼라인먼트 사전 훈련 LLM 데이터 품질 AI 안전성