Hacker News • 114일 전

Rust 나이틀리에 도입된 꼬리 재귀 최적화 인터프리터 후기

IMP

7/10

핵심 요약

Rust 나이틀리 버전에 새롭게 추가된 'become' 키워드를 활용해 Uxn CPU 에뮬레이터를 꼬리 재귀(tail-call) 기반으로 구현한 경험기를 공유했습니다. 이 방식을 적용한 결과, 작성자의 기존 Rust 구현체는 물론 직접 작성한 ARM64 어셈블리 버전보다도 더 높은 성능을 달성하는 놀라운 결과를 얻었습니다. 이 글은 시스템 프로그래밍 및 에뮬레이터 개발에서 Rust의 꼬리 재귀 최적화가 가지는 실용성과 성능적 이점을 보여줍니다.

번역된 본문

Matt Keeter // 블로그 프로젝트 연구 블로그 정보 링크

지난주 저는 최근 나이틀리 Rust에 추가된(7개월 전이 최근 맞죠?) become 키워드를 사용하여 꼬리 재귀(tail-call) 인터프리터를 작성했습니다. 놀라울 정도로 쾌적한 경험이었으며, 결과적으로 만들어진 VM(가상 머신)은 제 이전 Rust 구현체와 직접 손으로 작성한 ARM64 어셈블리 구현체 모두보다 뛰어난 성능을 보였습니다. 꼬리 재귀 호출(tailcall) 기반 기법은 최근 유행하고 있습니다(이 개요 참조). 이 글은 단순하지만 결코 만만치 않은 시스템을 구현한 저의 경험 보고서라고 생각해 주십시오. 상황을 지켜보시는 분들을 위해 말씀드리자면, 이번 글은 Hundred Rabbits 생태계의 수많은 애플리케이션을 구동하는 Uxn CPU의 고성능 에뮬레이션 탐구 시리즈의 최신작입니다. 전체 스토리를 읽고 싶으시다면 다음 목록을 참고하세요:

원래의 Rust 구현체인 Raven에 대한 프로젝트 글
직접 작성한 ARM64 어셈블리 구현체가 Rust 코드의 성능을 뛰어넘는 내용을 다룬 '컴파일러 이기기'
1년 후 코드를 다시 검토하고 테스트 및 CI를 개선한 내용인 '테스트 스위트의 아름다움에 안내 받아'
ARM64 어셈블리 구현체를 x86으로 포팅한(Claude Code의 도움을 받아) 'raven-uxn을 위한 x86-64 백엔드'

LLM(대형 언어 모델) 실험이 논란이 된 것은 놀라운 일이 아니었습니다. 하지만 기쁘게도 이번 꼬리 재귀 코드는 모두 사람이 직접 작성했으며, 새 백엔드는 약간의 성능 저하만 감수하면 x86 어셈블리 백엔드를 대체할 수 있습니다. (이 블로그 글 역시 저의 개인적인 기준에 따라 전적으로 사람이 작성했습니다)

다음 몇 섹션은 이전 작업을 요약한 것이므로, 이미 읽으신 분들은 대충 훑어보고 'Rust에서의 꼬리 재귀 호출' 섹션으로 바로 건너뛰셔도 됩니다.

Uxn 에뮬레이션 기초 Uxn은 256개의 명령어를 가진 단순한 스택 머신입니다. 전체 CPU는 겨우 64KB가 넘는 공간을 차지하며, 몇 가지 메모리로 나뉩니다:

각각 1바이트의 인덱스를 갖는 두 개의 256바이트 스택
데이터와 프로그램 텍스트 모두에 사용되는 65536바이트(64KB)의 RAM
2바이트 프로그램 카운터
주변 기기에 사용되는 256바이트의 "장치 메모리"

가장 단순한 에뮬레이터는 RAM에서 프로그램 카운터 위치의 바이트를 읽은 다음, (프로그램 카운터를 업데이트할 수 있는) 명령어를 호출합니다:

fn run(core: &mut Uxn, dev: &mut Device, mut pc: u16) -> u16 { loop { let op = core.next(&mut pc); let Some(next) = core.op(op, dev, pc) else { break pc; }; pc = next; } }

impl Uxn { fn op( &mut self, op: u8, dev: &mut Device, pc: u16 ) -> Option { match op { op::BRK => self.brk(pc), op::INC => self.inc::<0b000>(pc), op::POP => self.pop::<0b000>(pc), op::NIP => self.nip::<0b000>(pc), op::SWP => self.swp::<0b000>(pc), // ... 기타 등등 op::ORA2kr => self.ora::<0b111>(pc), op::EOR2kr => self.eor::<0b111>(pc), op::SFT2kr => self.sft::<0b111>(pc), } } }

256개의 명령어가 있으며, 그중 상당수는 플래그(flags)를 통해 매개변수화됩니다. 다음은 스택의 최상단 바이트를 증가시키는 INC 명령어의 예시입니다:

impl Uxn { pub fn inc(&mut self, pc: u16) -> Option { let mut s = self.stack_view::(); let v = s.pop(); s.push(v.wrapping_add(1)); Some(pc) } }

모든 연산자(opcode) 구현은 주요 op 함수 내에 인라인 처리되어 있지만, 개선할 여지가 여전히 존재합니다. 일부 값은 레지스터 대신 메모리에 저장되며, 주요 op 선택 분기(branch)는 예측하기가 불가능합니다.

어셈블리의 스레드 코드 (Threaded code) 반면 어셈블리 구현체에서는 스레드 코드(구체적으로는 토큰 스레딩, token threading)를 대신 사용할 수 있습니다. 모든 CPU 상태를 레지스터에 저장한 다음, 각 명령어의 끝에서 다음 명령어로 점프(jump)하도록 합니다:

; x0 | 스택 포인터 ; x1 | 스택 인덱스 ; x4 | RAM 포인터 ; x5 | 프로그램 카운터 ; x8 | 연산자 테이블 _INC: ldrb w9, [x0, x1] ; 스택의 최상단에서 바이트 읽기 add w9, w9, #1 ; 1 증가 strb w9, [x0, x1] ; 다시 쓰기 ldrb w9, [x4, x5] ; RAM에서 다음 연산자 로드 add x5, x5, #1 ; 프로그램 카운터 1 증가 and x5, x5, #0xffff ; 프로그램 카운터 래핑(초기화) ldr x10, [x8, x9, lsl #3] ; 연산자 구현 주소 로드 br x10 ; 연산자 구현으로 점프

이 방식은 디스패치(dispatch) 작업을 각 연산자에 분산시킵니다.

원문 보기

원문 보기 (영어)

Matt Keeter // blog projects research blog about links A tail-call interpreter in (nightly) Rust Last week, I wrote a tail-call interpreter using the become keyword, which was recently added to nightly Rust (seven months ago is recent, right?). It was a surprisingly pleasant experience, and the resulting VM outperforms both my previous Rust implementation and my hand-coded ARM64 assembly. Tailcall-based techniques have been all the rage recently (see this overview ); consider this my trip report implementing a simple but non-trivial system. For those keeping track at home, this is the latest in my exploration of high-performance emulation of the Uxn CPU , which runs a bunch of applications in the Hundred Rabbits ecosystem. If you want to read the whole saga, here's the list: Project writeup for Raven , my original Rust implementation Beating the compiler , in which I write an ARM64 assembly implementation which outperforms my Rust code Guided by the beauty of our test suite , in which I revisit the code a year later and improve its testing and CI An x86-64 backend for raven-uxn , in which I port the assembly implementation from ARM64 to x86 (with the help of Claude Code) Experimenting with LLMs proved controversial , which wasn't a surprise; I'm pleased to declare that all of the tail-call code is human-written, and the new backend can be used as a substitute for the x86 assembly backend at a minor performance penalty. (This blog post is also entirely human-written, per my personal standards ) The next few sections summarize previous work, so feel free to skim them if you've done the reading and jump straight to tailcalls in Rust . Basics of Uxn emulation Uxn is a simple stack machine with 256 instructions. The whole CPU has just over 64K of space, split between a few memories: Two 256-byte stacks, each with an index byte 65536 bytes of RAM, which is used for both data and program text A 2-byte program counter 256 bytes of "device memory", used for peripherals The simplest emulator reads a byte from RAM at the program counter, then calls into an instruction (which may update the program counter): fn run(core: &mut Uxn, dev: &mut Device, mut pc: u16) -> u16 { loop { let op = core.next(&mut pc); let Some(next) = core.op(op, dev, pc) else { break pc; }; pc = next; } } impl Uxn { fn op( &mut self, op: u8, dev: &mut Device, pc: u16 ) -> Option<u16> { match op { op::BRK => self.brk(pc), op::INC => self.inc::<0b000>(pc), op::POP => self.pop::<0b000>(pc), op::NIP => self.nip::<0b000>(pc), op::SWP => self.swp::<0b000>(pc), // ... etc op::ORA2kr => self.ora::<0b111>(pc), op::EOR2kr => self.eor::<0b111>(pc), op::SFT2kr => self.sft::<0b111>(pc), } } } There are 256 instructions, many of which are parameterized with flags. Here's the INC instruction, which increments the top byte on the stack: impl Uxn { pub fn inc<const FLAGS: u8>(&mut self, pc: u16) -> Option<u16> { let mut s = self.stack_view::<FLAGS>(); let v = s.pop(); s.push(v.wrapping_add(1)); Some(pc) } } All of the opcode implementations are inlined into the main op function, but there's room for improvement : some values are stored in memory rather than registers, and the main op selection branch is unpredictable. Threaded code in assembly In our assembly implementation, we can instead use threaded code (specifically token threading ). We store all of the CPU state in registers, then end each instruction with a jump to the subsequent instruction: ; x0 | stack pointer ; x1 | stack index ; x4 | ram pointer ; x5 | program counter ; x8 | opcode table _INC: ldrb w9, [x0, x1] ; read the byte from the top of the stack add w9, w9, #1 ; increment it strb w9, [x0, x1] ; write it back ldrb w9, [x4, x5] ; load the next opcode from RAM add x5, x5, #1 ; increment the program counter and x5, x5, #0xffff ; wrap the program counter ldr x10, [x8, x9, lsl #3] ; load the opcode implementation address br x10 ; jump to the opcode's implementation This distributes the dispatch operation across every opcode, making it easier for the branch predictor to learn sequences of opcodes in the program. Overall speedups were significant: 40-50% faster on ARM64, and about 2× faster on x86-64. Unfortunately, it requires maintaining about 2000 lines of code , and is incredibly unsafe . In my x86 port, I introduced an out-of-bounds write, which stomped on a few bytes outside of device RAM; the only symptom was that the fuzzer would segfault when exiting after running a very particular program. So, what's to be done? Tail calls in Rust We'd like to get the same behavior as our assembly implementation – VM state stored in registers, dispatch at the end of each opcode – without hand-writing every instruction in assembly. Fortunately, there is hope! The core idea has almost certainly been reinvented a bunch of times, but I first encountered the idea of tail-call interpreters in the Massey Meta Machine writeup , which was a mind-expanding read. There are two pieces: Store program state in function arguments, which are mapped to registers based on your system's calling convention End each function by calling the next function We could write this today in Rust; here's our inc function: const TABLE: FunctionTable = FunctionTable([ brk, inc::<0b000>, pop::<0b000>, nip::<0b000>, swp::<0b000>, rot::<0b000>, dup::<0b000>, // ...etc ]); fn inc<'a, const FLAGS: u8>( stack_data: &'a mut [u8; 256], stack_index: u8, rstack_data: &'a mut [u8; 256], rstack_index: u8, dev: &'a mut [u8; 256], ram: &'a mut [u8; 65536], mut pc: u16, vdev: &mut dyn Device, ) -> (Uxn<'a>, u16) { let mut core = Uxn { stack: Stack { data: stack_data, index: stack_index, }, ret: Stack { data: rstack_data, index: rstack_index, }, dev, ram, }; match core.inc::<FLAGS>(pc) { Some(pc) => { let op = core.next(&mut pc); TABLE.0[op as usize]( core.stack.data, core.stack.index, core.ret.data, core.ret.index, core.dev, core.ram, pc, vdev, ) } None => (core, pc) } } We want to reuse our existing Uxn opcode implementations, so we reconstruct the core: Uxn object at the beginning of the function, call its inc function, then deconstruct it again when calling the next operation. There's a lot of boilerplate, and it's tempting to just pass a &mut Uxn argument, but that removes the "state is stored in registers" benefit; we'll remove boilerplate with a macro later on. Unfortunately, there's a problem with this implementation: thread 'snapshots::tailcall::mandelbrot' (67889685) has overflowed its stack fatal runtime error: stack overflow, aborting error: test failed, to rerun pass `-p raven-varvara --test snapshots` Even in a release build, the compiler has not optimized out the stack. As we execute more and more operations, the stack gets deeper and deeper until it inevitably overflows. We need tell the compiler to generate a br (branch to register) instead of a bl (branch-and-link) instruction, and – more importantly – not to allocate any persistent space on the stack. In other words, we need a tail call . In nightly Rust, this is a one-word fix: match core.inc::<FLAGS>(pc) { Some(pc) => { let op = core.next(&mut pc); - TABLE.0[op as usize]( + become TABLE.0[op as usize]( core.stack.data, core.stack.index, core.ret.data, core.ret.index, core.dev, core.ram, pc, vdev, ) With this change in place, the Rust compiler makes a guarantee : When tail calling a function, instead of its stack frame being added to the stack, the stack frame of the caller is directly replaced with the callee’s. That's it, everything works! End of writeup! Implementation details Okay, okay, I've got a little more to say. First, I promised a macro to eliminate the boilerplate. As always, it's a horrifying thing to behold: macro_rules! tail_fn { ($name:ident $(::<$flags:ident>)?) => { tail_fn!($name $(::<$flags>)?[][vdev: &mut dyn Device]); }; ($name:ident $(::<$flags:ident>)?($($arg:ident: $ty:ty),*)) => { tail_fn!($name $(::<$flags>)?[$($arg: $ty),*][]); }; ($name:ident $(::<$flags:ident>)?[$($arg0:ident: $ty0:ty),*][$($arg1:id

Rust 에뮬레이터 꼬리 재귀 최적화 어셈블리 성능 최적화