The Compiler Processing Unit (CPUA): Reimagining Compilation as a Hardware Problem

Sep, Mon, 2025
compiler , hardware

Compiler Processing Unit (CPU*) a pure graph/dataflow accelerator with obscene memory bandwidth. Below is a concrete, buildable spec you can prototype on FPGA and later harden into an ASIC.

(CPU here = Compiler Processing Unit, not the host CPU.)

1) High-level objective

Accelerate the mid-end and backend of compilers (LLVM/MLIR/GCC) by offloading:

Bitset-style dataflow analyses (reachability, liveness, dominators, alias approximations)
CFG/SSA graph transforms (construction, renaming, φ-resolution, SCCs)
Instruction selection via tree/DAG covering
Register allocation (hybrid linear-scan + bounded ILP for hot blocks)
List scheduling with resource/latency tables
All tuned for pointer-heavy, irregular workloads where latency + bandwidth beat FLOPs.

2) Board & package

Form factor: PCIe Gen5 x16 card; optional CXL 3.0 Type-2 device for shared-memory attach.
On-package memory: 32–128 GB HBM3/3e (≥1.5 TB/s aggregate BW). This is the main performance lever.
On-die SRAM: 64–256 MB distributed across clusters for hot sets/bitsets/queues.
Coherency: With CXL, expose a coherent window; with PCIe, use ATS/PRI or bounce buffers.

3) Compute fabric (no tensor units—just graph/dataflow)

3.1 MIMD micro-cores

Count: 256–1024 tiny out-of-order-ish or wide-issue VLIW cores.
ISA: 64-bit scalar + rich bit-manipulation, byte-lane ops, and predication.
Threading: 4–8 HW threads/core (latency hiding).
Local scratch: 128–512 KB/core SRAM with 1-cycle bitset ops (e.g., 2048-bit wide AND/OR/XOR/POPCOUNT via microcoded wide slices).

3.2 Fixed-function graph/dataflow blocks

All are programmable by tables/descriptors:

Bitset Engine (BSE):
- Primitive: OUT = f(IN, TRANSFER, MEET) over very wide bitsets (1k–64k bits).
- Accels: block-popcount, streaming AND/OR, sparse-run compression, worklist scheduler in hardware.
CFG/SSA Kernel (CSK):
- Dominators (Lengauer-Tarjan), post-doms, DF, RPO/PO numbering.
- SSA construct/rename, φ-pruning, copy-coalescing hints.
SCC/Topo Unit (STU):
- Tarjan/Kosaraju variants, Kahn topological sort, loop nesting tree build.
DAG Cover/Matcher (DCM):
- Bottom-up tree/DAG pattern matching with programmable rules (imports from LLVM .td / TableGen).
- Cost tables in SRAM; supports chain rules, commutativity, target predicates.
Register Allocation Engine (RAE):
- Linear-scan with live-range splitting & spill heuristics;
- Optional bounded ILP/SAT micro-solver for hot blocks (timeboxed, e.g., 50–200 µs).
List Scheduler (LSU):
- Resource tables (ports, FU counts, latencies), anti/output deps; tie-breaking by critical path/slack.

4) Memory hierarchy (the star of the show)

HBM arena allocator: IR objects (basic blocks, instructions, operands) packed in SoA layouts for stride-friendly prefetching.
Pointer compression: 32-bit region-relative handles; accelerators understand handles natively.
Bitset store: Row-major chunks aligned to 512B; hardware supports range masks (operate on subsets).
Symbol/str tables: Hardware hash + de-dup to accelerate LTO symbol churn.
DMA descriptors: Scatter/gather for IR import/export; batch many functions/modules at once.

5) Programming & host integration

5.1 Offload API (C header)

// libcpua.h (Compiler Processing Unit Accelerator)
typedef uint64_t cpua_mod_t;
typedef uint64_t cpua_func_t;

typedef enum { 
  CPUA_OK=0, CPUA_EBUSY, CPUA_EINVAL, CPUA_ENOMEM, CPUA_ETIME 
} cpua_status_t;

typedef struct {
  uint32_t passes_bitmap;  // e.g., bit0=GVN, bit1=SROA, ...
  uint32_t flags;          // debug, deterministic, etc.
} cpua_pipeline_t;

typedef struct {
  const void* ir_blob;     // compact SoA IR (see §6.1)
  size_t      ir_bytes;
  uint32_t    target_id;   // e.g., x86_64_zen4, armv9_neon, rv64gc
  uint32_t    opt_level;   // 0..3
} cpua_module_desc_t;

cpua_status_t cpua_init(void);
cpua_status_t cpua_load_module(const cpua_module_desc_t*, cpua_mod_t* out);
cpua_status_t cpua_run_pipeline(cpua_mod_t, const cpua_pipeline_t*);
cpua_status_t cpua_inst_select(cpua_mod_t);
cpua_status_t cpua_reg_alloc(cpua_mod_t, uint32_t strategy_flags);
cpua_status_t cpua_schedule(cpua_mod_t, uint32_t µarch_id);
cpua_status_t cpua_emit_obj(cpua_mod_t, void** out_ptr, size_t* out_bytes);
void          cpua_free_module(cpua_mod_t);

5.2 LLVM/MLIR glue (outline)

LLVM Pass: CPUAOffloadPass buckets functions by size/shape; serializes IR into Compact-IR (SoA), launches pipelines via libcpua.
Target hooks: extend TargetTransformInfo to advertise DCM rules and RAE knobs the device supports.
Fallback path: If the board is busy/absent, run stock LLVM passes.

Minimal skeleton:

// Pseudocode
PreservedAnalyses CPUAOffloadPass::run(Module &M, ModuleAnalysisManager &AM) {
  auto bins = bucketize(M);           // group funcs by size/CFG
  cpua_mod_t mod = serialize_to_compact_ir(M);
  cpua_run_pipeline(mod, &pipeline_spec);
  cpua_inst_select(mod);
  cpua_reg_alloc(mod, RA_LINEAR_SCAN | RA_BOUNDED_ILP);
  cpua_schedule(mod, µarch_id);
  auto obj = cpua_emit_obj(mod);
  link_replace(M, obj);
  return PreservedAnalyses::none();
}

6) Data formats the hardware loves

6.1 Compact-IR (host-pack, device-native)

Instruction SoA:
- arrays: opcodes[], opnd0[], opnd1[], opnd2[], flags[]
- operands point to value table: defs/uses, type class, size.
CFG: CSR-like: bb_offsets[], succ_idx[], succ_list[].
Liveness bitsets: per-BB variable bitsets packed to 512-byte stripes.
Pattern tables: compiled from TableGen/MLIR patterns → DCM rules.

6.2 Command descriptors

Ring buffer of commands (RUN_DATAFLOW, BUILD_DOM, DAG_COVER, REG_ALLOC, SCHEDULE, EMIT), each with handle ranges so the hardware can parallelize across functions.

7) Microarch scheduling model

Worklist hardware: lock-free queues per cluster; push/pop for IR blocks and bitset deltas.
Deterministic option: token-based commit order for reproducibility builds.

8) What gets faster (realistically)

Lexing/regex/DFA: 5–15× (table-driven engines + SRAM).
Bitset dataflow (liveness, reaching defs, GVN/DCE aids): 3–8× (wide bit ops + HBM).
Dominators/DF/SSA: 2–4× (graph kernels in CSK).
Instruction selection (tree/DAG cover): 2–5× (DCM rules over SoA IR).
Register allocation: 1.5–3× avg; 5× on pathological hot blocks via bounded solver.
ThinLTO/whole-program: end-to-end 3–10× on huge monorepos by parallelizing thousands of functions and removing memory stalls.

9) Prototyping path (today, on FPGA)

Board: Xilinx Alveo U55C/U280 or Intel PAC D5005.
Map:

BSE: HLS-friendly 2048-bit lanes, BRAM-backed bitsets, run-length compression.
CSK/STU: RTL for stack/union-find heavy kernels (Tarjan, LT).
DCM: microcoded matcher with CAM for opcode→rule fanout.
RAE: linear-scan in RTL; optional soft-core for hot-block ILP (or off to host for the prototype).
DMA: AXI4 with SG lists; ring descriptors.

HLS module stubs (illustrative):

// bitset_engine.hls.cpp (Vitis HLS pseudocode)
void bitset_meet_apply(
  hls::stream<cmd_t>& cmds,
  ap_uint<2048> bitset_mem[MAX_ROWS]) {

  #pragma HLS INTERFACE m_axi depth=... port=bitset_mem ...
  #pragma HLS PIPELINE II=1

  while (true) {
    cmd_t c = cmds.read();
    if (c.op == OP_QUIT) break;
    for (int row=c.row0; row<c.row1; ++row) {
      #pragma HLS LOOP_FLATTEN off
      ap_uint<2048> in  = bitset_mem[row];
      ap_uint<2048> out = apply_transfer(in, c.transfer_mask);
      ap_uint<2048> meet= out & c.meet_mask;     // example MEET=AND
      if (meet != in) {
        bitset_mem[row] = meet;
        enqueue_work(row); // pushes successors
      }
    }
  }
}

// domtree_ltarjan.rtl.sv (sketch)
module domtree_ltarjan(
  input clk, rst,
  input cmd_valid, input [DESC_W-1:0] desc,
  output cmd_ready,
  // CSR-like graph ports
  ...
);
// Implement LT with path compression + buckets in SRAM
endmodule

10) Software: example pass recipes

10.1 Liveness + RA

Host computes def/use maps (SoA).
BSE runs iterative liveness to fixed point.
RAE performs linear-scan with spill heuristics using live intervals from BSE;
optional hot-block window → bounded ILP solver.

10.2 Instruction selection

Host emits DCM rules (compiled from target backend).
DCM performs bottom-up matching; emits selected target ops with cost-minimizing covers; preserves debug/value maps for later passes.

11) Determinism & tooling

Repro mode: deterministic queues + fixed seeds for tie-break.
Perf counters: per-pass cycles, HBM reads/writes, queue pressure, spill counts.
Debug: IR snapshot before/after each kernel; on-device checksum for run equivalence.

12) BOM & bring-up (prototype)

FPGA board (~$8–12k), dev tools (Vitis/Quartus), server with PCIe Gen4/5.
Driver: user-space via VFIO (Linux), doorbells + MMIO for rings.
Host side: libcpua.so, LLVM plugin, MLIR dialect to declare acceleratable regions (optional).

13) Bench & success criteria

Targets: LLVM clang -O2/-O3 on LLVM test-suite, SPEC CPU, large C++ monorepo, Rust crates.
KPIs:
- Wall-clock compile time reduction (module, LTO, full build).
- Bitset GUPS (giga-updates/s) on dataflow kernels.
- Spills reduced (%) and schedule quality (proxying IPC on known µarch models).
- Power/perf vs. host-only AVX-512 bitset kernels.

14) Risks & mitigations

IR marshalling cost → pack SoA + batch hundreds of functions per DMA; persistent module residency for ThinLTO.
Pattern drift across backends → auto-generate DCM rules from TableGen/MLIR backends during build.
RA corner cases → timebox to hardware; auto-fallback per-block to host RA if constraints violated.

You don’t need ML/tensor units. A memory-rich, MIMD + fixed-function graph/dataflow accelerator with HBM will move the needle on real compilers:

Build the Bitset Engine, CFG/SSA kernels, DAG Cover, Reg-Alloc, and List Scheduler in hardware.
Feed it SoA-packed IR over PCIe/CXL.
Expect 3–10× wall-clock wins on big builds, with 2–8× on the heaviest passes.

YARIAN.COM

YARIAN.COM

YARIAN.COM

YARIAN.COM

The Compiler Processing Unit (CPUA): Reimagining Compilation as a Hardware Problem