Plug-In Ternary Power: Designing a Trinary GPU-Class Accelerator for PCs
1) What “ternary” are we building?
- Number system: Balanced ternary (trits ∈ {−1, 0, +1}).
Why: natural signed representation, simpler normalization, fewer long carry chains, elegant multiply/accumulate for AI with weights in {−1,0,+1}. - Logic encoding (on-chip): Choose one and stick to it end-to-end:
- Multi-Level Voltage (MLV): V−, V0, V+. (Fastest route on silicon; needs good analog margins + calibration.)
- Current-mode ternary (three distinct currents; great at speed, more analog design work).
- 1-of-3 (ternary unary) wires (robust, but triples wires; use for short/critical control paths only).
For a first-generation card, I’d pick MLV on chip and binary on the external bus, translating at the bridge.
2) Card from 10,000 feet (blocks & data flow)
[Host CPU/RAM]
│ PCIe Gen4/5 x8/x16 (binary)
▼
┌─────────────────────────────────────────────────────────┐
│ TRINARY ACCELERATOR (PCIe add-in) │
│ │
│ A. PCIe & Bridge │
│ • PCIe Controller + DMA engines (binary) │
│ • Pack/Unpack & Encode/Decode: bits ⇄ trits │
│ • Command Queues, Doorbells, Interrupts │
│ │
│ B. Control Complex │
│ • Small binary MCU (RISC-V/Arm) for mgmt, firmware │
│ • Calibration DAC/ADC for V−/V0/V+ thresholds │
│ • Telemetry: temps, voltages, error rates │
│ │
│ C. Ternary Compute Arrays │
│ • Many-core TPUs* with ternary ALUs, MACs, SFUs │
│ (*TPU = Ternary Processing Unit here) │
│ • Scratchpad TRAM (ternary SRAM or emulated) │
│ • SIMD/SIMT style lanes; warp/wavefront scheduler │
│ │
│ D. Ternary Memory Complex │
│ • On-package TRAM banks (true 3-state cells if avail) │
│ else dual-bit SRAM emulating trits (pragmatic) │
│ • Optional HBM (binary) with on-the-fly ternary codec │
│ │
│ E. Interconnect │
│ • Ternary NoC (3-level links + repeaters) │
│ • Crossbar or mesh; reduction trees for dot-products │
└─────────────────────────────────────────────────────────┘
Philosophy: keep the outside world boring (PCIe + a normal driver), and put all the ternary magic behind a bridge that packs/unpacks, schedules kernels, and manages calibration.
3) Host interface & software stack (what you’d actually code)
3.1 Driver & runtime
- Kernel driver (Linux/Windows):
- Maps BAR registers (queues, doorbells, status).
- Provides IOCTLs for memory alloc, ternary kernel launch, DMA.
- Interrupts or eventfd for completion.
- User-space runtime (libtrix):
- Ternary memory allocators (pinned host buffers + on-card TRAM).
- Data packers:
pack_trits()
,unpack_trits()
. - Kernel launcher:
trixLaunch(grid, block, args...)
. - Stream API (like CUDA streams) for overlap of compute and DMA.
- DSL / compiler front-end:
- Option A (fastest to ship): LLVM dialect for balanced ternary with intrinsics (
btrit
,btrit_vec<N>
,btrit_dot
). - Option B: A Pythonic DSL (NumPy-like) that JITs via MLIR to the ISA.
- Option A (fastest to ship): LLVM dialect for balanced ternary with intrinsics (
3.2 Programming model
- SIMD/SIMT kernels running on TPUs:
- Intrinsics:
tadd3(a,b)
→ balanced ternary add (−1,0,+1 with end-around carry rules).tmul3(a,b)
→ ternary multiply (table-based in 1 cycle).tmac3(acc, a, b)
→ fused ternary MAC with saturate/normalize.tsgn(x)
→ sign/tritization (map real to {−1,0,+1}).tdot3(vecA, vecB)
→ vector dot product with ternary partials + reduction tree.
- Collectives: warp-level ternary reductions, prefix ops.
- Intrinsics:
- Key accelerators:
- TNN/TBMM (Ternary Neural Networks / Block Matrix Multiply).
- Ternary convolution, graph ops on {−1,0,+1} edge weights.
- Balanced-ternary big-int primitives (fast carry-limited add/sub).
4) Encoding & packing (binary world ⇄ trits)
You need efficient packing to keep PCIe from being the bottleneck.
- Information-theoretic note: 1 trit ≈ log2(3) ≈ 1.585 bits.
- Practical pack: pack 5 trits → 8 bits (since 3^5=243 < 256).
Packing ratio: 0.625 bytes/trit (vs 0.5 ideal), ~96.4% of capacity used.
Host-side reference pack (conceptual):
# packs an array of trits in {-1,0,+1} (map to {0,1,2}) into bytes, 5 trits per byte
def pack5(trits): # len(trits) multiple of 5
out = bytearray()
for i in range(0, len(trits), 5):
q = 0
for t in trits[i:i+5]:
q = q*3 + (t+1) # map -1,0,1 -> 0,1,2
out.append(q) # 0..242
return bytes(out)
def unpack5(bytes_in):
trits = []
for b in bytes_in:
q = b
v = [0]*5
for i in range(4,-1,-1):
v[i] = (q % 3) - 1 # map 0,1,2 -> -1,0,1
q //= 3
trits.extend(v)
return trits
On the card, the bridge reverses this pack into lane-width trit vectors for the compute arrays.
5) Ternary arithmetic (balanced) essentials
- Digits: di∈{−1,0,1}d_i \in \{-1,0,1\}, value ∑di3i\sum d_i 3^i.
- Addition: carry tends to localize; you can use signed-digit end-around carry rules that shorten worst-case propagation.
- Multiply: 9-entry LUT per trit pair; multi-trit multiply accumulates with ternary compressors (3:1, 4:1 trees).
- Normalization: if a digit falls outside {−1,0,1}, push ±1 to the next position (cheap in hardware).
- Dot-product: perfect for TNN: weights/activations in {−1,0,+1}. Accumulator can be higher radix (binary or wide ternary).
6) Microarchitecture of a Ternary Processing Unit (TPU)
- Front-end: warp scheduler (binary FSM), instruction fetch from a small I-cache (binary), instruction words carry ternary opcodes/fields to the decode.
- Decode: maps opcodes to ternary ALU/MAC control signals.
- Execution lanes:
- T-ALU: add/sub/abs/compare in balanced ternary.
- T-MAC: (a*b)+acc with 1-cycle LUT multiply + 2-3 stage ternary compressor tree.
- T-SFU: signum, clip, ternary activation (e.g.,
tanh→{-1,0,1}
via thresholds).
- Register file: multi-ported; each register holds W trits (e.g., 128).
- Lane width example: 64 lanes × 128-trit vectors = 8192 trits/inst.
- Scratchpad TRAM: 64–256 KB per core cluster; banks with ternary sense amps.
7) Memory options (realistic → ambitious)
- Emulated ternary SRAM (ship-now option)
Store each trit in 2 binary bits (00→−1, 01→0, 10→+1, 11→illegal/ECC).
Pros: easy, fast, tolerant. Cons: wastes area. - True 3-state SRAM
8T/10T cell variants with multi-threshold inverters and a ternary sense amp.
Pros: density win, “real ternary.” Cons: device & PVT margin work. - Non-volatile multi-level (PCM/ReRAM) with 3 stable levels for cheap dense TRAM (higher latency, great density).
ECC: Use ternary BCH or Hamming-like codes; or store a parity trit per K trits; in emulated mode, use binary ECC over packed bytes.
8) Calibration & reliability (the analog reality)
- Per-die DACs set V−, V0, V+ references; on-die ADCs and PRBS test patterns measure error rates.
- Temperature drift compensation: background calibration while compute runs (dither a spare lane).
- Margins: aim for ≥6σ separation between levels at worst corner; dynamic voltage swing scaling under load.
- Soft-error handling: retry, scrub, and auto-re-tritize noisy buffers (re-threshold to {-1,0,1}).
9) PCIe bridge & queues (how work gets onto the card)
- Submission queues in host memory (ring buffers). Each entry:
op
: (copy H→D, copy D→H, kernel, memset(trit), barrier)ptrs
: src/dst host/device addressessizes
: packed bytes & trit countsgrid/block
: execution geometryflags
: stream id, priority, fences
- Doorbells: mmio write to poke the bridge.
- Completions: MSI-X interrupts or polled CQE.
Zero-copy option: pin a host buffer; bridge does on-the-fly ternary decode into the compute fabric.
10) ISA sketch (enough to write a compiler)
- Scalar & vector forms (suffix
.vN
for vector length). - Selected ops (pseudocode mnemonics):
TADD rd, ra, rb
; ternary addTSUB rd, ra, rb
TMUL rd, ra, rb
; ternary multiplyTMAC rd, ra, rb, rc
; rd = ra*rb + rcTDOT rd, ra, rb, #len
; reduce into rdTSGN rd, ra, #t0,#t1
; thresholds to {-1,0,1}TSEL rd, ra, rb, rc
; ternary selectTLD rd, [addr]
/TST [addr], rs
; load/store TRAMTSHF rd, ra, #k
; ternary shift by powers of 3TBAR #scope
; syncTLUT rd, ra, #imm243
; 5-trit LUT immediate (tiny tables)
- Predication:
tp
register holds a trit; −1/0/+1 can drive 3-way predicated paths (nice!)
11) What workloads fly?
- Ternary Neural Nets (weights & activations in {−1,0,+1}):
Drastic memory & MAC energy savings; dot-products are native. - Graph analytics (sign/neutral edges): highly compressible.
- Search/IR with ternary signatures (−1 = anti-feature).
- Big-integer & exact arithmetic using balanced-ternary tricks (carry control).
- Symbolic systems where −1/0/+1 semantics are first-class (constraint solvers).
12) Bring-up plan (from “this week” to “silicon”)
Phase A — Developer card using FPGAs (pragmatic MVP)
- PCIe FPGA board (e.g., Gen4 x8), HBM as backing store.
- Implement ternary ALUs in LUTs; TRAM emulated (2 bits/trit).
- Do binary PCIe, binary HBM, ternary only inside the fabric.
- Ship the driver + runtime + compiler intrinsics now.
(You’ll already accelerate TNNs and custom ternary kernels.)
Phase B — Mixed-signal prototype
- Add a small analog tile that supports 3-level cells and sense amps.
Use it as an on-card scratchpad to validate margins, ECC, calibration.
Phase C — Custom ASIC
- Full ternary arrays, ternary NoC, true TRAM banks, on-package HBM.
PCIe/CXL on the edge; clock/power domains, DVFS per array.
13) How it looks to a programmer (minimal but real)
// Host-side pseudo-API
TrixDevice dev = trixOpen(0);
size_t n = 1<<20; // number of trits
Trit *h_a = trixHostAlloc(n);
Trit *h_b = trixHostAlloc(n);
Trit *h_c = trixHostAlloc(n);
// fill with {-1,0,1}
init_random_ternary(h_a, n);
init_random_ternary(h_b, n);
// Device buffers
TrixBuf d_a = trixAlloc(dev, n);
TrixBuf d_b = trixAlloc(dev, n);
TrixBuf d_c = trixAlloc(dev, n);
// H→D copies (packs 5 trits/byte automatically unless you pre-pack)
trixMemcpyHTD(dev, d_a, h_a, n);
trixMemcpyHTD(dev, d_b, h_b, n);
// Launch: c = a ⊙ b dot for blocks of 256 (toy)
dim3 grid = {(unsigned)(n/256),1,1};
dim3 block = {256,1,1};
trixLaunch(dev, "tdot_kernel", grid, block, d_c, d_a, d_b, (int)n);
// D→H
trixMemcpyDTH(dev, h_c, d_c, n);
A ternary kernel (in a DSL) might look like:
kernel tdot_kernel(out Trit* c, const Trit* a, const Trit* b, int n) {
int i = tid();
Trit acc = 0;
for (int k = i*256; k < (i+1)*256 && k < n; ++k) {
acc = tmac3(acc, a[k], b[k]); // acc += a[k]*b[k]
}
c[i] = acc;
}
14) Physical design choices (pros/cons)
Choice | Pros | Cons |
---|---|---|
MLV ternary CMOS | Fast, compact, keeps wires few | Analog margins, PVT calibration |
1-of-3 wires | Simple sensing, robust | 3× wires, area/power up |
Emulated TRAM (2b/trit) | Ship now, easy ECC | Area/energy tax, not “pure” ternary |
True TRAM (SRAM) | Density win, purity | New cell dev, sense amps, risk |
PCM/ReRAM (3-level) | Density, NVM | Latency, endurance modeling |
15) Performance ballparks (intuition, not hype)
- Packing: at 5-to-8, PCIe Gen4 x8 (~16 GB/s uni) becomes ~25.6 Gtrits/s effective payload.
- Array: 64 lanes × 128-trit wide × 1.5 GHz → 12.3 Ttrit-ops/s per core (TADD/TMUL).
With 16 such cores: ~200 Ttrit-ops/s (not counting MAC fusion). - AI benefit: ternary TNNs often trade a few % accuracy for >10× energy/bit gains; this card makes that trade very attractive.
16) Tooling you’ll want to ship
- trix-asm (for microbenchmarks), trix-objdump, trix-prof (per-kernel counters: occupancy, stall reasons, error-rate).
- calibd (daemon) to nudge V−/V0/V+ with temp/load.
- Converter tools:
float_to_trit
: quantize tensors using learned thresholds.ternary_onnx_convert
: port TNNs into ternary ops graph.
17) Risks & mitigations
- Analog drift / noise: continuous background calibration, ECC, and auto-re-thresholding passes.
- Compiler maturity: start with intrinsics + MLIR dialect; keep ISA small and regular.
- PCIe bottlenecks: aggressive packing, device-side pipelines that keep data ternary once on card.
- Market fit: lead with TNN inference, graph analytics, and security-friendly ternary signatures.
18) Shortest path to something you can build this year
- FPGA dev card with: PCIe bridge, ternary ALUs, emulated TRAM.
- Release SDK (driver + C++/Python API + intrinsics).
- Demo models: ternary MobileNet-like, ternary BERT-ish classifier with {−1,0,1} projections.
- Iterate toward mixed-signal tile proving true 3-level cells + sense amp.
- Lock an ASIC once the toolchain and kernels stabilize.
TL;DR (design in one paragraph)
Build a PCIe accelerator that treats the host as binary and does all ternary internally. Use a bridge that packs/unpacks trits (e.g., 5→8), a calibrated ternary compute fabric (balanced ternary ALUs/MACs) with (emulated first, true later) TRAM, and a CUDA-like runtime so programmers launch ternary kernels without wrestling analog details. Lead with ternary neural nets and dot-product-heavy kernels, then grow into a general ternary ISA as the silicon matures.