Hacker News • 59일 전

로터리 GPU: 제한된 VRAM 환경에서의 대규모 MoE 모델 로컬 실행 탐구

IMP

8/10

핵심 요약

이 논문은 VRAM이 8GB에 불과한 소비자용 노트북에서 약 350억 파라미터 규모의 대규모 MoE 모델을 로컬 환경에서 실행할 수 있는 '로터리 GPU' 기법을 제안합니다. 실험 결과, 약 6.3GB의 VRAM만 사용하면서도 초당 21.06 토큰의 디코딩 처리량을 달성하며 뛰어난 메모리 효율성을 입증했습니다. 이는 클라우드 인프라에 의존하기 어려운 하드웨어, 보안, 예산 제약이 있는 환경에서도 거대 언어 모델(LLM)을 효과적으로 활용할 수 있는 가능성을 제시한다는 점에서 매우 중요합니다.

번역된 본문

컴퓨터 과학 > 성능(arXiv:2605.29135 (cs)) [2026년 5월 27일 제출]

제목: Rotary GPU: 제한된 GPU 메모리 환경에서 대규모 혼합 전문가(MoE) 모델을 위한 로컬 실행 경로 탐구 저자: Myeong Jun Jo (조명준)

초록: 대규모 언어 모델은 확장(Scaling)을 통해 놀라운 능력을 달성해 왔으며, 본 논문은 이러한 사실에 이의를 제기하는 것이 아닙니다. 대신 본 논문은 다른 질문을 탐구합니다. 즉, 대규모 모델이 이미 존재한다면, 하드웨어 자원이 현저히 부족한 환경에서도 이러한 모델에 더 쉽게 접근할 수 있을까요? 이 연구의 동기는 모델 아키텍처 연구라기보다는 배포 및 활용에 대한 고민에서 출발했습니다. 많은 조직이 하드웨어, 예산, 보안 또는 폐쇄적인 네트워크 제약 속에서 운영되어 대규모 가속기 클러스터에 대한 접근이 제한됩니다. 모델이 계속해서 발전함에 따라, 모델의 배포 접근성은 모델의 기능 자체만큼이나 중요해질 것입니다.

본 논문은 이전에 공개된 회전 가속기 상주(Rotary-based accelerator residency) 개념에서 파생된 탐색적 실행 접근 방식인 'Rotary GPU'를 제안합니다. VRAM 8GB를 탑재한 소비자용 노트북(RTX 4060 Laptop GPU) 환경에서 Qwen3.6-35B-A3B급 혼합 전문가(Mixture-of-Experts, MoE) 모델을 로컬로 실행하여 공개 검증을 수행했습니다. 주요 구성에서 이 시스템은 2048개의 출력 토큰을 생성하는 동안 약 6.3GB의 VRAM 사용량을 유지했으며, 관측된 디코딩 처리량은 초당 21.06 토큰(t/s)으로 나타났습니다.

이 연구의 목표는 데이터 센터 인프라를 대체하는 것이 아니라, 그러한 인프라를 사용할 수 없는 제한된 환경으로 대규모 모델의 일부 기능을 가져올 수 있는지 탐구하는 것입니다. 결과는 확정적인 것이라기보다는 탐색적인 것으로 읽혀야 하지만, 모델이 진화함에 따라 배포 접근성에 대한 지속적인 연구 가치가 있음을 시사합니다.

참고 사항: 10페이지, 3개의 그림. Zenodo에도 보관됨 (DOI: https://doi.org/10.5281/zenodo.20406471). 한국 특허 공보 KR 10-2026-0070380와 관련됨. 주제: 성능(cs.PF); 하드웨어 아키텍처(cs.AR); 분산, 병렬 및 클러스터 컴퓨팅(cs.DC) ACM 분류: C.1.4; I.2.7 인용: arXiv:2605.29135 [cs.PF] (또는 이 버전의 arXiv:2605.29135v1 [cs.PF]) https://doi.org/10.48550/arXiv.2605.29135 제출 이력: Myeong Jun Jo [v1] 2026년 5월 27일 수요일 21:57:36 UTC (12 KB)

원문 보기

원문 보기 (영어)

--> Computer Science > Performance arXiv:2605.29135 (cs) [Submitted on 27 May 2026] Title: Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory Authors: Myeong Jun Jo View a PDF of the paper titled Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory, by Myeong Jun Jo View PDF HTML (experimental) Abstract: Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environments with substantially smaller hardware resources? The motivation came from deployment concerns rather than architecture research. Many organizations operate under hardware, budget, security, or closed-network constraints that limit access to large accelerator clusters, and as models continue to improve, deployment accessibility may matter as much as capability itself. This paper presents Rotary GPU, an exploratory execution approach derived from a previously disclosed rotary-based accelerator residency concept. A public validation was conducted using a Qwen3.6-35B-A3B-class Mixture-of-Experts model executed locally on a consumer laptop with an RTX 4060 Laptop GPU containing 8 GB of VRAM. Under the primary configuration, the system generated 2048 output tokens while maintaining approximately 6.3 GB of VRAM usage and an observed decode throughput of 21.06 tokens per second. The goal is not to replace data-center infrastructure but to explore whether some capabilities of large models can be brought closer to environments where such infrastructure is unavailable. The results should be read as exploratory rather than definitive, but they suggest deployment accessibility deserves continued investigation as these models evolve. Comments: 10 pages, 3 figures. Also archived at Zenodo (DOI: https://doi.org/10.5281/zenodo.20406471 ). Related to Korean Patent Publication KR 10-2026-0070380 Subjects: Performance (cs.PF) ; Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC) ACM classes: C.1.4; I.2.7 Cite as: arXiv:2605.29135 [cs.PF] (or arXiv:2605.29135v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2605.29135 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI : https://doi.org/10.5281/zenodo.20406471 Focus to learn more DOI(s) linking to related resources Submission history From: Myeong Jun Jo [ view email ] [v1] Wed, 27 May 2026 21:57:36 UTC (12 KB) Full-text links: Access Paper: View a PDF of the paper titled Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory, by Myeong Jun Jo View PDF HTML (experimental) TeX Source view license Current browse context: cs.PF < prev | next > new | recent | 2026-05 Change to browse by: cs cs.AR cs.DC References & Citations NASA ADS Google Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer ( What is the Explorer? ) Connected Papers Toggle Connected Papers ( What is Connected Papers? ) Litmaps Toggle Litmaps ( What is Litmaps? ) scite.ai Toggle scite Smart Citations ( What are Smart Citations? ) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv ( What is alphaXiv? ) Links to Code Toggle CatalyzeX Code Finder for Papers ( What is CatalyzeX? ) DagsHub Toggle DagsHub ( What is DagsHub? ) GotitPub Toggle Gotit.pub ( What is GotitPub? ) Huggingface Toggle Hugging Face ( What is Huggingface? ) ScienceCast Toggle ScienceCast ( What is ScienceCast? ) Demos Demos Replicate Toggle Replicate ( What is Replicate? ) Spaces Toggle Hugging Face Spaces ( What is Spaces? ) Spaces Toggle TXYZ.AI ( What is TXYZ.AI? ) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower ( What are Influence Flowers? ) Core recommender toggle CORE Recommender ( What is CORE? ) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs . Which authors of this paper are endorsers? | Disable MathJax ( What is MathJax? )

로컬-LLM MoE 메모리-최적화 온디바이스-AI GPU-가속