blueprint for this pci card

Plug-In Ternary Power: Designing a Trinary GPU-Class Accelerator for PCs

Computers have always spoken in the language of 1s and 0s—but what if they could also think in threes? This article explores the design of a trinary accelerator card: a plug-in device that brings balanced-ternary computing into a standard PC. From hardware architecture and memory options to programming models and real-world workloads, it shows how adding “trits” alongside bits could unlock faster, more efficient ways of running neural networks, data-intensive tasks, and even entirely new forms of computation.

Example: Setun Computer

Origins

  • Developed: 1958 at Moscow State University by Nikolay Brusentsov and his team.
  • Motivation: Binary wasn’t the only possible logic system. Ternary was attractive because:
    • Balanced ternary (−1, 0, +1) is mathematically elegant.
    • It can represent signed numbers naturally without separate sign bits.
    • In some hardware, ternary can reduce the number of storage elements and interconnects.

Design

  • Logic base: Balanced ternary system (digits: −1, 0, +1).
  • Word length: 18 trits (roughly equal in information to ~28 bits).
  • Memory: Magnetic core memory, adapted to hold three states instead of two.
  • Performance: Around 3,000 operations per second—modest by today’s standards, but efficient compared to many binary contemporaries of similar size.
  • Instruction set: Included arithmetic in balanced ternary, branching, and subroutine support. Arithmetic was more concise—multiplication and division could sometimes be done with fewer steps than binary.

Application in Space / Aerospace

  • Context: The Soviet space program in the late 1950s and early 1960s was experimenting with compact, reliable computing for satellites and control systems.
  • Claimed Use: A modified Setun-like design was reportedly adapted for certain spacecraft subsystems, where efficiency and reduced memory requirements were valuable.
    • Ternary storage could, in theory, reduce wiring and core count by up to 37% compared to binary for the same representational range.
    • It also offered natural signed math, simplifying control computations.
  • Evidence: While Setun itself was primarily an academic and industrial machine, its design principles influenced onboard processors in Soviet space devices, where ternary circuits were tested for telemetry and control. Documentation is scarce (much was classified), but multiple Soviet-era accounts mention its aerospace role.

Legacy

  • Short-lived: Only about 50–60 Setun machines were ever produced.
  • Replaced: By binary machines in the 1960s, due to Western dominance in binary components and standardization.
  • Modern view: Computer scientists still admire it as a “cleaner” system. Brusentsov even built a later version called Setun-70 (a ternary virtual machine), running ternary logic on binary hardware.

Key Point: The Russians really did build a working trinary digital computer—the Setun—and while it wasn’t mass-deployed in every satellite, the concept directly informed space-use computers where efficiency and compactness were critical. It remains the only serious, production-level ternary system in computing history.

1) What “ternary” are we building?

  • Number system: Balanced ternary (trits ∈ {−1, 0, +1}).
    Why: natural signed representation, simpler normalization, fewer long carry chains, elegant multiply/accumulate for AI with weights in {−1,0,+1}.
  • Logic encoding (on-chip): Choose one and stick to it end-to-end:
    • Multi-Level Voltage (MLV): V−, V0, V+. (Fastest route on silicon; needs good analog margins + calibration.)
    • Current-mode ternary (three distinct currents; great at speed, more analog design work).
    • 1-of-3 (ternary unary) wires (robust, but triples wires; use for short/critical control paths only).

For a first-generation card, I’d pick MLV on chip and binary on the external bus, translating at the bridge.


2) Card from 10,000 feet (blocks & data flow)

 [Host CPU/RAM]
      │  PCIe Gen4/5 x8/x16 (binary)
      ▼
 ┌─────────────────────────────────────────────────────────┐
 │            TRINARY ACCELERATOR (PCIe add-in)            │
 │                                                         │
 │  A. PCIe & Bridge                                       │
 │   • PCIe Controller + DMA engines (binary)              │
 │   • Pack/Unpack & Encode/Decode: bits ⇄ trits           │
 │   • Command Queues, Doorbells, Interrupts               │
 │                                                         │
 │  B. Control Complex                                     │
 │   • Small binary MCU (RISC-V/Arm) for mgmt, firmware    │
 │   • Calibration DAC/ADC for V−/V0/V+ thresholds         │
 │   • Telemetry: temps, voltages, error rates             │
 │                                                         │
 │  C. Ternary Compute Arrays                              │
 │   • Many-core TPUs* with ternary ALUs, MACs, SFUs       │
 │     (*TPU = Ternary Processing Unit here)               │
 │   • Scratchpad TRAM (ternary SRAM or emulated)          │
 │   • SIMD/SIMT style lanes; warp/wavefront scheduler     │
 │                                                         │
 │  D. Ternary Memory Complex                              │
 │   • On-package TRAM banks (true 3-state cells if avail) │
 │     else dual-bit SRAM emulating trits (pragmatic)      │
 │   • Optional HBM (binary) with on-the-fly ternary codec │
 │                                                         │
 │  E. Interconnect                                        │
 │   • Ternary NoC (3-level links + repeaters)             │
 │   • Crossbar or mesh; reduction trees for dot-products  │
 └─────────────────────────────────────────────────────────┘

Philosophy: keep the outside world boring (PCIe + a normal driver), and put all the ternary magic behind a bridge that packs/unpacks, schedules kernels, and manages calibration.


3) Host interface & software stack (what you’d actually code)

3.1 Driver & runtime

  • Kernel driver (Linux/Windows):
    • Maps BAR registers (queues, doorbells, status).
    • Provides IOCTLs for memory alloc, ternary kernel launch, DMA.
    • Interrupts or eventfd for completion.
  • User-space runtime (libtrix):
    • Ternary memory allocators (pinned host buffers + on-card TRAM).
    • Data packers: pack_trits(), unpack_trits().
    • Kernel launcher: trixLaunch(grid, block, args...).
    • Stream API (like CUDA streams) for overlap of compute and DMA.
  • DSL / compiler front-end:
    • Option A (fastest to ship): LLVM dialect for balanced ternary with intrinsics (btrit, btrit_vec<N>, btrit_dot).
    • Option B: A Pythonic DSL (NumPy-like) that JITs via MLIR to the ISA.

3.2 Programming model

  • SIMD/SIMT kernels running on TPUs:
    • Intrinsics:
      • tadd3(a,b) → balanced ternary add (−1,0,+1 with end-around carry rules).
      • tmul3(a,b) → ternary multiply (table-based in 1 cycle).
      • tmac3(acc, a, b) → fused ternary MAC with saturate/normalize.
      • tsgn(x) → sign/tritization (map real to {−1,0,+1}).
      • tdot3(vecA, vecB) → vector dot product with ternary partials + reduction tree.
    • Collectives: warp-level ternary reductions, prefix ops.
  • Key accelerators:
    • TNN/TBMM (Ternary Neural Networks / Block Matrix Multiply).
    • Ternary convolution, graph ops on {−1,0,+1} edge weights.
    • Balanced-ternary big-int primitives (fast carry-limited add/sub).

4) Encoding & packing (binary world ⇄ trits)

You need efficient packing to keep PCIe from being the bottleneck.

  • Information-theoretic note: 1 trit ≈ log2(3) ≈ 1.585 bits.
  • Practical pack: pack 5 trits → 8 bits (since 3^5=243 < 256).
    Packing ratio: 0.625 bytes/trit (vs 0.5 ideal), ~96.4% of capacity used.

Host-side reference pack (conceptual):

# packs an array of trits in {-1,0,+1} (map to {0,1,2}) into bytes, 5 trits per byte
def pack5(trits):  # len(trits) multiple of 5
    out = bytearray()
    for i in range(0, len(trits), 5):
        q = 0
        for t in trits[i:i+5]:
            q = q*3 + (t+1)      # map -1,0,1 -> 0,1,2
        out.append(q)            # 0..242
    return bytes(out)

def unpack5(bytes_in):
    trits = []
    for b in bytes_in:
        q = b
        v = [0]*5
        for i in range(4,-1,-1):
            v[i] = (q % 3) - 1   # map 0,1,2 -> -1,0,1
            q //= 3
        trits.extend(v)
    return trits

On the card, the bridge reverses this pack into lane-width trit vectors for the compute arrays.


5) Ternary arithmetic (balanced) essentials

  • Digits: di∈{−1,0,1}d_i \in \{-1,0,1\}, value ∑di3i\sum d_i 3^i.
  • Addition: carry tends to localize; you can use signed-digit end-around carry rules that shorten worst-case propagation.
  • Multiply: 9-entry LUT per trit pair; multi-trit multiply accumulates with ternary compressors (3:1, 4:1 trees).
  • Normalization: if a digit falls outside {−1,0,1}, push ±1 to the next position (cheap in hardware).
  • Dot-product: perfect for TNN: weights/activations in {−1,0,+1}. Accumulator can be higher radix (binary or wide ternary).

6) Microarchitecture of a Ternary Processing Unit (TPU)

  • Front-end: warp scheduler (binary FSM), instruction fetch from a small I-cache (binary), instruction words carry ternary opcodes/fields to the decode.
  • Decode: maps opcodes to ternary ALU/MAC control signals.
  • Execution lanes:
    • T-ALU: add/sub/abs/compare in balanced ternary.
    • T-MAC: (a*b)+acc with 1-cycle LUT multiply + 2-3 stage ternary compressor tree.
    • T-SFU: signum, clip, ternary activation (e.g., tanh→{-1,0,1} via thresholds).
  • Register file: multi-ported; each register holds W trits (e.g., 128).
  • Lane width example: 64 lanes × 128-trit vectors = 8192 trits/inst.
  • Scratchpad TRAM: 64–256 KB per core cluster; banks with ternary sense amps.

7) Memory options (realistic → ambitious)

  1. Emulated ternary SRAM (ship-now option)
    Store each trit in 2 binary bits (00→−1, 01→0, 10→+1, 11→illegal/ECC).
    Pros: easy, fast, tolerant. Cons: wastes area.
  2. True 3-state SRAM
    8T/10T cell variants with multi-threshold inverters and a ternary sense amp.
    Pros: density win, “real ternary.” Cons: device & PVT margin work.
  3. Non-volatile multi-level (PCM/ReRAM) with 3 stable levels for cheap dense TRAM (higher latency, great density).

ECC: Use ternary BCH or Hamming-like codes; or store a parity trit per K trits; in emulated mode, use binary ECC over packed bytes.


8) Calibration & reliability (the analog reality)

  • Per-die DACs set V−, V0, V+ references; on-die ADCs and PRBS test patterns measure error rates.
  • Temperature drift compensation: background calibration while compute runs (dither a spare lane).
  • Margins: aim for ≥6σ separation between levels at worst corner; dynamic voltage swing scaling under load.
  • Soft-error handling: retry, scrub, and auto-re-tritize noisy buffers (re-threshold to {-1,0,1}).

9) PCIe bridge & queues (how work gets onto the card)

  • Submission queues in host memory (ring buffers). Each entry:
    • op: (copy H→D, copy D→H, kernel, memset(trit), barrier)
    • ptrs: src/dst host/device addresses
    • sizes: packed bytes & trit counts
    • grid/block: execution geometry
    • flags: stream id, priority, fences
  • Doorbells: mmio write to poke the bridge.
  • Completions: MSI-X interrupts or polled CQE.

Zero-copy option: pin a host buffer; bridge does on-the-fly ternary decode into the compute fabric.


10) ISA sketch (enough to write a compiler)

  • Scalar & vector forms (suffix .vN for vector length).
  • Selected ops (pseudocode mnemonics):
    • TADD rd, ra, rb ; ternary add
    • TSUB rd, ra, rb
    • TMUL rd, ra, rb ; ternary multiply
    • TMAC rd, ra, rb, rc ; rd = ra*rb + rc
    • TDOT rd, ra, rb, #len ; reduce into rd
    • TSGN rd, ra, #t0,#t1 ; thresholds to {-1,0,1}
    • TSEL rd, ra, rb, rc ; ternary select
    • TLD rd, [addr] / TST [addr], rs ; load/store TRAM
    • TSHF rd, ra, #k ; ternary shift by powers of 3
    • TBAR #scope ; sync
    • TLUT rd, ra, #imm243 ; 5-trit LUT immediate (tiny tables)
  • Predication: tp register holds a trit; −1/0/+1 can drive 3-way predicated paths (nice!)

11) What workloads fly?

  • Ternary Neural Nets (weights & activations in {−1,0,+1}):
    Drastic memory & MAC energy savings; dot-products are native.
  • Graph analytics (sign/neutral edges): highly compressible.
  • Search/IR with ternary signatures (−1 = anti-feature).
  • Big-integer & exact arithmetic using balanced-ternary tricks (carry control).
  • Symbolic systems where −1/0/+1 semantics are first-class (constraint solvers).

12) Bring-up plan (from “this week” to “silicon”)

Phase A — Developer card using FPGAs (pragmatic MVP)

  • PCIe FPGA board (e.g., Gen4 x8), HBM as backing store.
  • Implement ternary ALUs in LUTs; TRAM emulated (2 bits/trit).
  • Do binary PCIe, binary HBM, ternary only inside the fabric.
  • Ship the driver + runtime + compiler intrinsics now.
    (You’ll already accelerate TNNs and custom ternary kernels.)

Phase B — Mixed-signal prototype

  • Add a small analog tile that supports 3-level cells and sense amps.
    Use it as an on-card scratchpad to validate margins, ECC, calibration.

Phase C — Custom ASIC

  • Full ternary arrays, ternary NoC, true TRAM banks, on-package HBM.
    PCIe/CXL on the edge; clock/power domains, DVFS per array.

13) How it looks to a programmer (minimal but real)

// Host-side pseudo-API
TrixDevice dev = trixOpen(0);
size_t n = 1<<20; // number of trits
Trit *h_a = trixHostAlloc(n);
Trit *h_b = trixHostAlloc(n);
Trit *h_c = trixHostAlloc(n);

// fill with {-1,0,1}
init_random_ternary(h_a, n);
init_random_ternary(h_b, n);

// Device buffers
TrixBuf d_a = trixAlloc(dev, n);
TrixBuf d_b = trixAlloc(dev, n);
TrixBuf d_c = trixAlloc(dev, n);

// H→D copies (packs 5 trits/byte automatically unless you pre-pack)
trixMemcpyHTD(dev, d_a, h_a, n);
trixMemcpyHTD(dev, d_b, h_b, n);

// Launch: c = a ⊙ b dot for blocks of 256 (toy)
dim3 grid = {(unsigned)(n/256),1,1};
dim3 block = {256,1,1};
trixLaunch(dev, "tdot_kernel", grid, block, d_c, d_a, d_b, (int)n);

// D→H
trixMemcpyDTH(dev, h_c, d_c, n);

A ternary kernel (in a DSL) might look like:

kernel tdot_kernel(out Trit* c, const Trit* a, const Trit* b, int n) {
  int i = tid();
  Trit acc = 0;
  for (int k = i*256; k < (i+1)*256 && k < n; ++k) {
    acc = tmac3(acc, a[k], b[k]); // acc += a[k]*b[k]
  }
  c[i] = acc;
}

14) Physical design choices (pros/cons)

Choice Pros Cons
MLV ternary CMOS Fast, compact, keeps wires few Analog margins, PVT calibration
1-of-3 wires Simple sensing, robust 3× wires, area/power up
Emulated TRAM (2b/trit) Ship now, easy ECC Area/energy tax, not “pure” ternary
True TRAM (SRAM) Density win, purity New cell dev, sense amps, risk
PCM/ReRAM (3-level) Density, NVM Latency, endurance modeling

15) Performance ballparks (intuition, not hype)

  • Packing: at 5-to-8, PCIe Gen4 x8 (~16 GB/s uni) becomes ~25.6 Gtrits/s effective payload.
  • Array: 64 lanes × 128-trit wide × 1.5 GHz → 12.3 Ttrit-ops/s per core (TADD/TMUL).
    With 16 such cores: ~200 Ttrit-ops/s (not counting MAC fusion).
  • AI benefit: ternary TNNs often trade a few % accuracy for >10× energy/bit gains; this card makes that trade very attractive.

16) Tooling you’ll want to ship

  • trix-asm (for microbenchmarks), trix-objdump, trix-prof (per-kernel counters: occupancy, stall reasons, error-rate).
  • calibd (daemon) to nudge V−/V0/V+ with temp/load.
  • Converter tools:
    • float_to_trit: quantize tensors using learned thresholds.
    • ternary_onnx_convert: port TNNs into ternary ops graph.

17) Risks & mitigations

  • Analog drift / noise: continuous background calibration, ECC, and auto-re-thresholding passes.
  • Compiler maturity: start with intrinsics + MLIR dialect; keep ISA small and regular.
  • PCIe bottlenecks: aggressive packing, device-side pipelines that keep data ternary once on card.
  • Market fit: lead with TNN inference, graph analytics, and security-friendly ternary signatures.

18) Shortest path to something you can build this year

  1. FPGA dev card with: PCIe bridge, ternary ALUs, emulated TRAM.
  2. Release SDK (driver + C++/Python API + intrinsics).
  3. Demo models: ternary MobileNet-like, ternary BERT-ish classifier with {−1,0,1} projections.
  4. Iterate toward mixed-signal tile proving true 3-level cells + sense amp.
  5. Lock an ASIC once the toolchain and kernels stabilize.

TL;DR (design in one paragraph)

Build a PCIe accelerator that treats the host as binary and does all ternary internally. Use a bridge that packs/unpacks trits (e.g., 5→8), a calibrated ternary compute fabric (balanced ternary ALUs/MACs) with (emulated first, true later) TRAM, and a CUDA-like runtime so programmers launch ternary kernels without wrestling analog details. Lead with ternary neural nets and dot-product-heavy kernels, then grow into a general ternary ISA as the silicon matures.