ARM Codegen
<gba/codegen> compiles ARM instruction sequences at C++ consteval time,
installs them into executable RAM at runtime, and provides zero-overhead patching
to fill in runtime values without re-copying.
Quick start
The main power of codegen is patching: compile the ARM instruction sequence once, then replace runtime values (like loop counts, thresholds, or offsets) without re-copying.
#include <gba/codegen>
#include <gba/args>
#include <cstring>
using namespace gba::codegen;
using namespace gba::literals;
// 1. Define a template with named patch arguments
static constexpr auto add_const = arm_macro([](auto& b) {
b.add_imm(arm_reg::r0, arm_reg::r0, "c"_arg) // r0 = r0 + c
.bx(arm_reg::lr);
});
// 2. Install into executable RAM (once)
alignas(4) std::uint32_t code[add_const.size()] = {};
std::memcpy(code, add_const.data(), add_const.size_bytes());
// 3. Patch and call - reuse the same code buffer with different constants
constexpr auto patch = add_const.patcher<int(int)>();
auto add_10 = patch(code, "c"_arg = 10u);
int result = add_10(5); // 15 = 5 + 10
auto add_100 = patch(code, "c"_arg = 100u);
result = add_100(5); // 105 = 5 + 100
Named placeholders such as "c"_arg are filled at patch time.
No re-copy needed - the same code buffer switches from adding 10 to adding 100.
Building templates
arm_macro (preferred)
static constexpr auto tpl = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r0, 42)
.bx(arm_reg::lr);
});
arm_macro infers the required capacity automatically.
All instruction encodings are validated at consteval time - invalid operands are
compile errors, not runtime surprises.
arm_macro_builder<N> (explicit capacity)
Use when the capacity must be fixed at the call site, for example inside a constinit
variable or a constexpr template:
constexpr auto tpl = [] {
auto b = arm_macro_builder<4>{};
b.mov_imm(arm_reg::r0, 42).bx(arm_reg::lr);
return b.compile();
}();
b.mark() returns the current word index - useful for computing forward branch targets
before emitting the branch instruction.
compiled_block<N> accessors
| Member | Type | Description |
|---|---|---|
data() | const arm_word* | Pointer to first instruction word |
size() | std::size_t | Number of instruction words |
size_bytes() | std::size_t | Byte count (size() * 4) |
operator[] | arm_word | Read a single instruction word |
Patch arguments
Codegen supports two patching styles:
- named arguments:
"name"_arg - positional slots:
imm_slot(n),s12_slot(n),b_slot(n),instr_slot(n)
Positional slots use an index n (0-31) that maps to a call-site argument.
| Slot | Instruction(s) | Value |
|---|---|---|
imm_slot(n) | mov_imm, add_imm, sub_imm, orr_imm, and_imm, eor_imm, bic_imm, mvn_imm, rsb_imm, cmp_imm, tst_imm | 0-255 |
s12_slot(n) | ldr_imm, str_imm | -4095 … +4095 |
b_slot(n) | b_to, b_if | 24-bit signed word offset |
instr_slot(n) | instruction(...) / word(...) / literal_word(...) | Any 32-bit word |
word_slot and literal_slot are aliases for instr_slot.
// Named patch args (primary)
static constexpr auto named_tpl = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r0, "x"_arg)
.add_imm(arm_reg::r0, arm_reg::r0, "y"_arg)
.bx(arm_reg::lr);
});
// Positional slots (alternative)
static constexpr auto slot_tpl = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r0, imm_slot(0)) // arg 0 -> 8-bit immediate
.ldr_imm(arm_reg::r1, arm_reg::r2, s12_slot(1)) // arg 1 -> +/-4095 byte offset
.instruction(instr_slot(2)) // arg 2 -> full 32-bit word
.bx(arm_reg::lr);
});
Patching
The primary workflow uses compiled_block::patcher() with named arguments.
This keeps call sites self-documenting and order-independent.
Preferred: compiled_block::patcher() (named args)
static constexpr auto tpl = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r0, "value"_arg).bx(arm_reg::lr);
});
constexpr auto patch = tpl.patcher<int()>();
alignas(4) std::uint32_t code[tpl.size()] = {};
std::memcpy(code, tpl.data(), tpl.size_bytes());
auto fn = patch(code, "value"_arg = 42u); // patch + typed function pointer
Named patch arguments are order-independent and self-documenting.
Zero-overhead variant: block_patcher<tpl> (positional)
Use this when you want fully compile-time patch metadata and positional patch values.
static constexpr auto tpl = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r0, imm_slot(0)).bx(arm_reg::lr);
});
constexpr auto fn_patch = block_patcher<tpl>{}.typed<int()>();
auto fn = fn_patch(code, 42u);
Generic Runtime Dispatch: apply_patches<Sig>(...)
Generic runtime function for when the block is not available as a constexpr at the call site,
or when patching arguments need to be packed into an array before application.
Variadic form - arguments passed directly:
auto fn = apply_patches<int(int)>(tpl, code, tpl.size(), 42u);
Packed array form - pre-assembled argument array:
std::uint32_t args[] = {30u, 12u};
auto fn = apply_patches_packed<int(int)>(tpl, code, tpl.size(), args, 2);
Whole-instruction patching
Reserve an instruction word and replace it entirely at patch time. Use the checked helpers to build valid instruction values:
static constexpr auto op_tpl = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r2, imm_slot(0))
.instruction(instr_slot(1)) // replaced at runtime
.bx(arm_reg::lr);
});
alignas(4) std::uint32_t code[op_tpl.size()] = {};
std::memcpy(code, op_tpl.data(), op_tpl.size_bytes());
// Pick the operation at runtime
auto add_fn = apply_patches<int(int)>(op_tpl, code, op_tpl.size(),
5u, add_reg_instr(arm_reg::r0, arm_reg::r0, arm_reg::r2));
auto sub_fn = apply_patches<int(int)>(op_tpl, code, op_tpl.size(),
5u, sub_reg_instr(arm_reg::r0, arm_reg::r0, arm_reg::r2));
Available checked instruction helpers:
nop_instr()
add_reg_instr(rd, rn, rm) sub_reg_instr(rd, rn, rm)
orr_reg_instr(rd, rn, rm) and_reg_instr(rd, rn, rm) eor_reg_instr(rd, rn, rm)
lsl_imm_instr(rd, rm, shift) lsr_imm_instr(rd, rm, shift)
mul_instr(rd, rm, rs)
Callback Patching: apply_word_patches(...)
When instruction word patches are generated dynamically at runtime, use the callback-based
apply_word_patches function instead of apply_patches. This is useful for multi-operation
switching or complex patch-value computation:
static constexpr auto op_tpl = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r2, imm_slot(0))
.instruction(instr_slot(1)) // replaced at runtime via callback
.bx(arm_reg::lr);
});
alignas(4) std::uint32_t code[op_tpl.size()] = {};
std::memcpy(code, op_tpl.data(), op_tpl.size_bytes());
// Use a callback to generate instruction words based on patch index
apply_word_patches(op_tpl, code, op_tpl.size(), [](std::size_t patch_idx) -> std::uint32_t {
// patch_idx == 1 here (the instruction slot)
// Return the desired instruction word
if (some_condition) {
return add_reg_instr(arm_reg::r0, arm_reg::r0, arm_reg::r2);
} else {
return sub_reg_instr(arm_reg::r0, arm_reg::r0, arm_reg::r2);
}
});
Instruction reference
All instructions are available as builder methods on arm_macro_builder<N> and
accepted by the arm_macro lambda.
Data movement
| Builder method | Effect |
|---|---|
mov_imm(rd, imm8) | rd = imm8 (0-255) |
mov_imm(rd, imm_slot(n)) | rd = arg[n] at patch time |
mov_reg(rd, rm) | rd = rm |
Arithmetic
| Method | Effect | Patch variant |
|---|---|---|
add_imm(rd, rn, imm8) | rd = rn + imm8 | imm_slot |
add_reg(rd, rn, rm) | rd = rn + rm | |
sub_imm(rd, rn, imm8) | rd = rn - imm8 | imm_slot |
sub_reg(rd, rn, rm) | rd = rn - rm | |
rsb_imm(rd, rn, imm8) | rd = imm8 - rn | imm_slot |
rsb_reg(rd, rn, rm) | rd = rm - rn | |
adc_imm(rd, rn, imm8) | rd = rn + imm8 + C | |
adc_reg(rd, rn, rm) | rd = rn + rm + C | |
sbc_imm(rd, rn, imm8) | rd = rn - imm8 - !C | |
sbc_reg(rd, rn, rm) | rd = rn - rm - !C |
Bitwise
| Method | Effect | Patch variant |
|---|---|---|
orr_imm(rd, rn, imm8) | rd = rn | imm8 | imm_slot |
orr_reg(rd, rn, rm) | rd = rn | rm | |
and_imm(rd, rn, imm8) | rd = rn & imm8 | imm_slot |
and_reg(rd, rn, rm) | rd = rn & rm | |
eor_imm(rd, rn, imm8) | rd = rn ^ imm8 | imm_slot |
eor_reg(rd, rn, rm) | rd = rn ^ rm | |
bic_imm(rd, rn, imm8) | rd = rn & ~imm8 | imm_slot |
bic_reg(rd, rn, rm) | rd = rn & ~rm | |
mvn_imm(rd, imm8) | rd = ~imm8 | imm_slot |
mvn_reg(rd, rm) | rd = ~rm |
Shifts and rotates
| Method | Shift amount | Range |
|---|---|---|
lsl_imm(rd, rm, shift) | Immediate | 0-31 |
lsr_imm(rd, rm, shift) | Immediate | 1-32 |
asr_imm(rd, rm, shift) | Immediate | 1-32 |
ror_imm(rd, rm, shift) | Immediate | 1-31 |
lsl_reg(rd, rm, rs) | Register rs | |
lsr_reg(rd, rm, rs) | Register rs | |
asr_reg(rd, rm, rs) | Register rs | |
ror_reg(rd, rm, rs) | Register rs |
Comparison / flag-setting
These set CPSR flags without writing a destination register.
| Method | Flags set on |
|---|---|
cmp_imm(rn, imm8) / cmp_reg(rn, rm) | rn - operand |
cmn_imm(rn, imm8) / cmn_reg(rn, rm) | rn + operand |
tst_imm(rn, imm8) / tst_reg(rn, rm) | rn & operand |
teq_imm(rn, imm8) / teq_reg(rn, rm) | rn ^ operand |
cmp_imm and tst_imm also accept imm_slot(n).
Memory - word and byte
| Method | Access |
|---|---|
ldr_imm(rd, rn, offset) / str_imm(rd, rn, offset) | 32-bit word, offset -4095…+4095; accepts s12_slot |
ldrb_imm(rd, rn, offset) / strb_imm(rd, rn, offset) | Unsigned byte, immediate offset |
ldrb_reg(rd, rn, rm) / strb_reg(rd, rn, rm) | Unsigned byte, register offset |
Memory - halfword and signed forms
| Method | Access |
|---|---|
ldrh_imm(rd, rn, offset) / strh_imm(rd, rn, offset) | Unsigned halfword, immediate offset |
ldrh_reg(rd, rn, rm) / strh_reg(rd, rn, rm) | Unsigned halfword, register offset |
ldrsb_imm(rd, rn, offset) / ldrsb_reg(rd, rn, rm) | Signed byte |
ldrsh_imm(rd, rn, offset) / ldrsh_reg(rd, rn, rm) | Signed halfword |
Multi-register and stack
Build a register bitmask with reg_list(r0, r4, lr, ...).
| Method | ARM mnemonic |
|---|---|
push(regs) | STMDB SP!, {regs} |
pop(regs) | LDMIA SP!, {regs} |
ldmia(rn, regs [,wb]) | LDMIA rn[!], {regs} |
stmia(rn, regs [,wb]) | STMIA rn[!], {regs} |
ldmib(rn, regs [,wb]) | LDMIB rn[!], {regs} |
stmib(rn, regs [,wb]) | STMIB rn[!], {regs} |
ldmda(rn, regs [,wb]) | LDMDA rn[!], {regs} |
stmda(rn, regs [,wb]) | STMDA rn[!], {regs} |
ldmdb(rn, regs [,wb]) | LDMDB rn[!], {regs} |
stmdb(rn, regs [,wb]) | STMDB rn[!], {regs} |
b.push(reg_list(arm_reg::r4, arm_reg::r5, arm_reg::lr));
// ... body ...
b.pop(reg_list(arm_reg::r4, arm_reg::r5, arm_reg::pc));
Multiply
ARM7TDMI constraint:
rdmust differ fromrm.
| Method | Effect |
|---|---|
mul(rd, rm, rs) | rd = rm * rs |
mla(rd, rm, rs, rn) | rd = rm * rs + rn |
Branches
| Method | Effect |
|---|---|
b_to(target) | Unconditional, by word index |
b_to(b_slot(n)) | Patchable branch offset |
b_if(cond, target) | Conditional, by word index |
b_if(cond, b_slot(n)) | Patchable conditional branch |
bl_to(target) | Branch with link |
bx(rm) | Branch exchange - use for function returns |
blx(rm) | Branch exchange with link |
arm_cond values:
eq ne cs/hs cc/lo mi pl vs vc hi ls ge lt gt le al
Branching patterns
b_to and b_if take a target word index - the index of the instruction you want
to jump to. Use b.mark() to read the current word index at any point during
construction:
// Loop: count down from r0 to zero
const auto loop_top = b.mark(); // remember top of loop
b.sub_imm(arm_reg::r0, arm_reg::r0, 1); // r0--
b.cmp_imm(arm_reg::r0, 0);
b.b_if(arm_cond::ne, loop_top); // branch back while r0 != 0
b.bx(arm_reg::lr);
For forward branches, emit the branch first, then record where the target lands:
b.cmp_imm(arm_reg::r0, 100);
const auto branch_instr = b.mark(); // index of the b_if we're about to emit
b.b_if(arm_cond::ge, 0); // target unknown yet - placeholder
b.add_imm(arm_reg::r0, arm_reg::r0, 5); // only reached when r0 < 100
// ... forward code goes here ...
Note: Forward branches where the target index is not yet known require
arm_macro_builder<N>with explicit capacity, since you need to emit the branch before you know the target. Witharm_macroyou can structure control flow so that all targets are emitted before the branch (back-branches) or known fromb.mark()arithmetic.
AAPCS calling convention
Generated leaf functions receive and return values through the standard ARM AAPCS convention used on GBA. No special setup is needed - just cast the destination pointer to the right type.
| Role | Register |
|---|---|
| Argument 0 | r0 |
| Argument 1 | r1 |
| Argument 2 | r2 |
| Argument 3 | r3 |
| Return value | r0 |
Register-form instructions (add_reg, sub_reg, mul, …) operate directly on
call-time arguments without any patch slots.
Examples
Patched constant (simplest case)
This is the Quick start pattern - add a call-time argument to a patched constant:
static constexpr auto add_const = arm_macro([](auto& b) {
b.add_imm(arm_reg::r0, arm_reg::r0, imm_slot(0))
.bx(arm_reg::lr);
});
alignas(4) std::uint32_t code[add_const.size()] = {};
std::memcpy(code, add_const.data(), add_const.size_bytes());
constexpr block_patcher<add_const> patch{};
auto fn = patch.entry<int(int)>(code, 42u);
int result = fn(8); // 50 = 8 + 42
Function with two call-time arguments
Both arguments come through AAPCS registers; no patching needed:
static constexpr auto add_fn = arm_macro([](auto& b) {
b.add_reg(arm_reg::r0, arm_reg::r0, arm_reg::r1)
.bx(arm_reg::lr);
});
alignas(4) std::uint32_t code[add_fn.size()] = {};
std::memcpy(code, add_fn.data(), add_fn.size_bytes());
auto fn = reinterpret_cast<int (*)(int, int)>(code);
int result = fn(30, 12); // 42
Loop with patched iteration count
Count down from a patched limit:
// int countdown_by_step(int start) - counts down with a patched step size
static constexpr auto countdown_loop = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r1, 0); // count = 0
const auto loop_start = b.mark(); // loop top: index 1
b.sub_imm(arm_reg::r0, arm_reg::r0, imm_slot(0)); // start -= step_size (patched)
b.add_imm(arm_reg::r1, arm_reg::r1, 1); // count++
b.cmp_imm(arm_reg::r0, 0); // if start <= 0, exit
b.b_if(arm_cond::gt, loop_start); // if start > 0, loop
b.mov_reg(arm_reg::r0, arm_reg::r1); // return count
b.bx(arm_reg::lr);
});
alignas(4) std::uint32_t code[countdown_loop.size()] = {};
std::memcpy(code, countdown_loop.data(), countdown_loop.size_bytes());
constexpr block_patcher<countdown_loop> patch{};
// Patch step size = 1
auto count_by_1 = patch.entry<int(int)>(code, 1u);
int loops_by_1 = count_by_1(10); // 10 iterations: 10, 9, 8, ..., 1, 0
// Re-patch: step size = 2 (no re-copy needed!)
auto count_by_2 = patch.entry<int(int)>(code, 2u);
int loops_by_2 = count_by_2(10); // 5 iterations: 10, 8, 6, 4, 2, 0
Mixed: call-time arguments and patch-time constant
// x * 4 + c - x is a call-time argument, c is patched in
static constexpr auto scale_add = arm_macro([](auto& b) {
b.add_reg(arm_reg::r0, arm_reg::r0, arm_reg::r0) // *2
.add_reg(arm_reg::r0, arm_reg::r0, arm_reg::r0) // *4
.add_imm(arm_reg::r0, arm_reg::r0, imm_slot(0)) // + c
.bx(arm_reg::lr);
});
constexpr block_patcher<scale_add> patch{};
alignas(4) std::uint32_t code[scale_add.size()] = {};
std::memcpy(code, scale_add.data(), scale_add.size_bytes());
auto fn = patch.entry<int(int)>(code, 2u); // 4x + 2
int r = fn(10); // 42
Callee-save register pattern
// int compute(int a, int b, int c) - (a * b) + (c << 2)
static constexpr auto compute = arm_macro([](auto& b) {
b.push(reg_list(arm_reg::r4, arm_reg::lr));
b.mul(arm_reg::r4, arm_reg::r0, arm_reg::r1); // r4 = a * b (r4 != r0)
b.lsl_imm(arm_reg::r0, arm_reg::r2, 2); // r0 = c << 2
b.add_reg(arm_reg::r0, arm_reg::r4, arm_reg::r0);
b.pop(reg_list(arm_reg::r4, arm_reg::pc));
});
Conditional loop with comparison
// Count iterations from `start` until value reaches `limit`
static constexpr auto count_loop = arm_macro([](auto& b) {
b.mov_imm(arm_reg::r2, 0); // count = 0; index 0
// loop top: index 1
b.cmp_reg(arm_reg::r0, arm_reg::r1);
b.b_if(arm_cond::ge, 5); // exit if r0 >= limit; index 2
b.add_imm(arm_reg::r0, arm_reg::r0, 1);// r0++; index 3
b.add_imm(arm_reg::r2, arm_reg::r2, 1);// count++; index 4
b.b_to(1); // back to loop top; index 5 - exit
b.mov_reg(arm_reg::r0, arm_reg::r2); // return count; index 6
b.bx(arm_reg::lr);
});
Patchable threshold
// Returns value * 2 if below threshold, value + 10 otherwise
static constexpr auto threshold_fn = arm_macro([](auto& b) {
b.cmp_imm(arm_reg::r0, imm_slot(0)); // index 0
b.b_if(arm_cond::ge, 3); // index 1 - skip to else
b.add_reg(arm_reg::r0, arm_reg::r0, arm_reg::r0); // *2; index 2
b.b_to(4); // index 3 - skip else
b.add_imm(arm_reg::r0, arm_reg::r0, 10); // +10; index 4
b.bx(arm_reg::lr); // index 5
});
alignas(4) std::uint32_t code[threshold_fn.size()] = {};
std::memcpy(code, threshold_fn.data(), threshold_fn.size_bytes());
// Install with threshold = 50; re-patch any time without re-copying
constexpr block_patcher<threshold_fn> patch{};
auto fn = patch.entry<int(int)>(code, 50u);
Halfword OAM update (GBA sprite system)
// void update_sprite(volatile std::uint16_t* oam, int x, int y)
static constexpr auto update_sprite = arm_macro([](auto& b) {
// attr0: clear Y field, insert new Y
b.ldrh_imm(arm_reg::r3, arm_reg::r0, 0);
b.bic_imm(arm_reg::r3, arm_reg::r3, 0xFF);
b.orr_reg(arm_reg::r3, arm_reg::r3, arm_reg::r2);
b.strh_imm(arm_reg::r3, arm_reg::r0, 0);
// attr1: clear X field, insert new X
b.ldrh_imm(arm_reg::r3, arm_reg::r0, 2);
b.bic_imm(arm_reg::r3, arm_reg::r3, 0xFF);
b.orr_reg(arm_reg::r3, arm_reg::r3, arm_reg::r1);
b.strh_imm(arm_reg::r3, arm_reg::r0, 2);
b.bx(arm_reg::lr);
});
Safety notes
- The destination buffer must be word-aligned (
alignas(4)) and located in executable RAM (IWRAM or EWRAM on GBA). - Encoding errors (immediate out of range, invalid register combination) are
compile errors in
constevalcontext. b_to/b_iftargets are in instruction words, not bytes.mul/mla:rd ≠ rm(ARM7TDMI hardware constraint).- These APIs cover leaf-function patterns (AAPCS
r0-r3arguments,r0return). Stack-passed arguments, calls to other functions, and floating-point are not abstracted.