The Compiler Processing Unit (CPUA): Reimagining Compilation as a Hardware Problem
Compiler Processing Unit (CPU*) a pure graph/dataflow accelerator with obscene memory bandwidth. Below is a concrete, buildable spec you can prototype on FPGA and later harden into an ASIC.
(CPU here = Compiler Processing Unit, not the host CPU.)
1) High-level objective
Accelerate the mid-end and backend of compilers (LLVM/MLIR/GCC) by offloading:
- Bitset-style dataflow analyses (reachability, liveness, dominators, alias approximations)
- CFG/SSA graph transforms (construction, renaming, φ-resolution, SCCs)
- Instruction selection via tree/DAG covering
- Register allocation (hybrid linear-scan + bounded ILP for hot blocks)
- List scheduling with resource/latency tables
All tuned for pointer-heavy, irregular workloads where latency + bandwidth beat FLOPs.
2) Board & package
- Form factor: PCIe Gen5 x16 card; optional CXL 3.0 Type-2 device for shared-memory attach.
- On-package memory: 32–128 GB HBM3/3e (≥1.5 TB/s aggregate BW). This is the main performance lever.
- On-die SRAM: 64–256 MB distributed across clusters for hot sets/bitsets/queues.
- Coherency: With CXL, expose a coherent window; with PCIe, use ATS/PRI or bounce buffers.
3) Compute fabric (no tensor units—just graph/dataflow)
3.1 MIMD micro-cores
- Count: 256–1024 tiny out-of-order-ish or wide-issue VLIW cores.
- ISA: 64-bit scalar + rich bit-manipulation, byte-lane ops, and predication.
- Threading: 4–8 HW threads/core (latency hiding).
- Local scratch: 128–512 KB/core SRAM with 1-cycle bitset ops (e.g., 2048-bit wide AND/OR/XOR/POPCOUNT via microcoded wide slices).
3.2 Fixed-function graph/dataflow blocks
All are programmable by tables/descriptors:
- Bitset Engine (BSE):
- Primitive:
OUT = f(IN, TRANSFER, MEET)
over very wide bitsets (1k–64k bits). - Accels: block-popcount, streaming AND/OR, sparse-run compression, worklist scheduler in hardware.
- Primitive:
- CFG/SSA Kernel (CSK):
- Dominators (Lengauer-Tarjan), post-doms, DF, RPO/PO numbering.
- SSA construct/rename, φ-pruning, copy-coalescing hints.
- SCC/Topo Unit (STU):
- Tarjan/Kosaraju variants, Kahn topological sort, loop nesting tree build.
- DAG Cover/Matcher (DCM):
- Bottom-up tree/DAG pattern matching with programmable rules (imports from LLVM .td / TableGen).
- Cost tables in SRAM; supports chain rules, commutativity, target predicates.
- Register Allocation Engine (RAE):
- Linear-scan with live-range splitting & spill heuristics;
- Optional bounded ILP/SAT micro-solver for hot blocks (timeboxed, e.g., 50–200 µs).
- List Scheduler (LSU):
- Resource tables (ports, FU counts, latencies), anti/output deps; tie-breaking by critical path/slack.
4) Memory hierarchy (the star of the show)
- HBM arena allocator: IR objects (basic blocks, instructions, operands) packed in SoA layouts for stride-friendly prefetching.
- Pointer compression: 32-bit region-relative handles; accelerators understand handles natively.
- Bitset store: Row-major chunks aligned to 512B; hardware supports range masks (operate on subsets).
- Symbol/str tables: Hardware hash + de-dup to accelerate LTO symbol churn.
- DMA descriptors: Scatter/gather for IR import/export; batch many functions/modules at once.
5) Programming & host integration
5.1 Offload API (C header)
// libcpua.h (Compiler Processing Unit Accelerator)
typedef uint64_t cpua_mod_t;
typedef uint64_t cpua_func_t;
typedef enum {
CPUA_OK=0, CPUA_EBUSY, CPUA_EINVAL, CPUA_ENOMEM, CPUA_ETIME
} cpua_status_t;
typedef struct {
uint32_t passes_bitmap; // e.g., bit0=GVN, bit1=SROA, ...
uint32_t flags; // debug, deterministic, etc.
} cpua_pipeline_t;
typedef struct {
const void* ir_blob; // compact SoA IR (see §6.1)
size_t ir_bytes;
uint32_t target_id; // e.g., x86_64_zen4, armv9_neon, rv64gc
uint32_t opt_level; // 0..3
} cpua_module_desc_t;
cpua_status_t cpua_init(void);
cpua_status_t cpua_load_module(const cpua_module_desc_t*, cpua_mod_t* out);
cpua_status_t cpua_run_pipeline(cpua_mod_t, const cpua_pipeline_t*);
cpua_status_t cpua_inst_select(cpua_mod_t);
cpua_status_t cpua_reg_alloc(cpua_mod_t, uint32_t strategy_flags);
cpua_status_t cpua_schedule(cpua_mod_t, uint32_t µarch_id);
cpua_status_t cpua_emit_obj(cpua_mod_t, void** out_ptr, size_t* out_bytes);
void cpua_free_module(cpua_mod_t);
5.2 LLVM/MLIR glue (outline)
- LLVM Pass:
CPUAOffloadPass
buckets functions by size/shape; serializes IR into Compact-IR (SoA), launches pipelines vialibcpua
. - Target hooks: extend
TargetTransformInfo
to advertise DCM rules and RAE knobs the device supports. - Fallback path: If the board is busy/absent, run stock LLVM passes.
Minimal skeleton:
// Pseudocode
PreservedAnalyses CPUAOffloadPass::run(Module &M, ModuleAnalysisManager &AM) {
auto bins = bucketize(M); // group funcs by size/CFG
cpua_mod_t mod = serialize_to_compact_ir(M);
cpua_run_pipeline(mod, &pipeline_spec);
cpua_inst_select(mod);
cpua_reg_alloc(mod, RA_LINEAR_SCAN | RA_BOUNDED_ILP);
cpua_schedule(mod, µarch_id);
auto obj = cpua_emit_obj(mod);
link_replace(M, obj);
return PreservedAnalyses::none();
}
6) Data formats the hardware loves
6.1 Compact-IR (host-pack, device-native)
- Instruction SoA:
- arrays:
opcodes[]
,opnd0[]
,opnd1[]
,opnd2[]
,flags[]
- operands point to value table: defs/uses, type class, size.
- arrays:
- CFG: CSR-like:
bb_offsets[]
,succ_idx[]
,succ_list[]
. - Liveness bitsets: per-BB variable bitsets packed to 512-byte stripes.
- Pattern tables: compiled from TableGen/MLIR patterns → DCM rules.
6.2 Command descriptors
- Ring buffer of commands (
RUN_DATAFLOW
,BUILD_DOM
,DAG_COVER
,REG_ALLOC
,SCHEDULE
,EMIT
), each with handle ranges so the hardware can parallelize across functions.
7) Microarch scheduling model
- Worklist hardware: lock-free queues per cluster; push/pop for IR blocks and bitset deltas.
- Deterministic option: token-based commit order for reproducibility builds.
8) What gets faster (realistically)
- Lexing/regex/DFA: 5–15× (table-driven engines + SRAM).
- Bitset dataflow (liveness, reaching defs, GVN/DCE aids): 3–8× (wide bit ops + HBM).
- Dominators/DF/SSA: 2–4× (graph kernels in CSK).
- Instruction selection (tree/DAG cover): 2–5× (DCM rules over SoA IR).
- Register allocation: 1.5–3× avg; 5× on pathological hot blocks via bounded solver.
- ThinLTO/whole-program: end-to-end 3–10× on huge monorepos by parallelizing thousands of functions and removing memory stalls.
9) Prototyping path (today, on FPGA)
Board: Xilinx Alveo U55C/U280 or Intel PAC D5005.
Map:
- BSE: HLS-friendly 2048-bit lanes, BRAM-backed bitsets, run-length compression.
- CSK/STU: RTL for stack/union-find heavy kernels (Tarjan, LT).
- DCM: microcoded matcher with CAM for opcode→rule fanout.
- RAE: linear-scan in RTL; optional soft-core for hot-block ILP (or off to host for the prototype).
- DMA: AXI4 with SG lists; ring descriptors.
HLS module stubs (illustrative):
// bitset_engine.hls.cpp (Vitis HLS pseudocode)
void bitset_meet_apply(
hls::stream<cmd_t>& cmds,
ap_uint<2048> bitset_mem[MAX_ROWS]) {
#pragma HLS INTERFACE m_axi depth=... port=bitset_mem ...
#pragma HLS PIPELINE II=1
while (true) {
cmd_t c = cmds.read();
if (c.op == OP_QUIT) break;
for (int row=c.row0; row<c.row1; ++row) {
#pragma HLS LOOP_FLATTEN off
ap_uint<2048> in = bitset_mem[row];
ap_uint<2048> out = apply_transfer(in, c.transfer_mask);
ap_uint<2048> meet= out & c.meet_mask; // example MEET=AND
if (meet != in) {
bitset_mem[row] = meet;
enqueue_work(row); // pushes successors
}
}
}
}
// domtree_ltarjan.rtl.sv (sketch)
module domtree_ltarjan(
input clk, rst,
input cmd_valid, input [DESC_W-1:0] desc,
output cmd_ready,
// CSR-like graph ports
...
);
// Implement LT with path compression + buckets in SRAM
endmodule
10) Software: example pass recipes
10.1 Liveness + RA
- Host computes def/use maps (SoA).
- BSE runs iterative liveness to fixed point.
- RAE performs linear-scan with spill heuristics using live intervals from BSE;
optional hot-block window → bounded ILP solver.
10.2 Instruction selection
- Host emits DCM rules (compiled from target backend).
- DCM performs bottom-up matching; emits selected target ops with cost-minimizing covers; preserves debug/value maps for later passes.
11) Determinism & tooling
- Repro mode: deterministic queues + fixed seeds for tie-break.
- Perf counters: per-pass cycles, HBM reads/writes, queue pressure, spill counts.
- Debug: IR snapshot before/after each kernel; on-device checksum for run equivalence.
12) BOM & bring-up (prototype)
- FPGA board (~$8–12k), dev tools (Vitis/Quartus), server with PCIe Gen4/5.
- Driver: user-space via VFIO (Linux), doorbells + MMIO for rings.
- Host side:
libcpua.so
, LLVM plugin, MLIR dialect to declare acceleratable regions (optional).
13) Bench & success criteria
- Targets: LLVM
clang -O2/-O3
on LLVM test-suite, SPEC CPU, large C++ monorepo, Rust crates. - KPIs:
- Wall-clock compile time reduction (module, LTO, full build).
- Bitset GUPS (giga-updates/s) on dataflow kernels.
- Spills reduced (%) and schedule quality (proxying IPC on known µarch models).
- Power/perf vs. host-only AVX-512 bitset kernels.
14) Risks & mitigations
- IR marshalling cost → pack SoA + batch hundreds of functions per DMA; persistent module residency for ThinLTO.
- Pattern drift across backends → auto-generate DCM rules from TableGen/MLIR backends during build.
- RA corner cases → timebox to hardware; auto-fallback per-block to host RA if constraints violated.
You don’t need ML/tensor units. A memory-rich, MIMD + fixed-function graph/dataflow accelerator with HBM will move the needle on real compilers:
- Build the Bitset Engine, CFG/SSA kernels, DAG Cover, Reg-Alloc, and List Scheduler in hardware.
- Feed it SoA-packed IR over PCIe/CXL.
- Expect 3–10× wall-clock wins on big builds, with 2–8× on the heaviest passes.