How the RWKV Neural Network Works: A Step-by-Step Breakdown for AI Engineers

  • Jul, Thu, 2025

1. High-Level Philosophy

RWKV solves a core limitation in traditional Transformers: quadratic attention cost with sequence length. Instead of computing full attention matrices, it introduces a mechanism to simulate attention using RNN-like recurrence, while retaining Transformer expressivity. It’s “Transformers without attention.”

RWKV achieves this by:

  • Replacing attention with a time-mixed weighted key-value operation (inspired by RNNs).
  • Preserving positional structure via time decay and recurrence, enabling long-context memory.
  • Enabling linear-time inference and constant memory per token (huge for deployment).

2. Architecture Overview

RWKV’s architecture mirrors the Transformer layer stack: multiple blocks of LayerNorm, feedforward, and attention-like units — but the attention is replaced with the custom RWKV time-mixing mechanism.

Each RWKV block contains:

  • LayerNorm1
  • Time-Mix Block (replaces attention)
  • LayerNorm2
  • Channel-Mix Block (similar to FFN)
# Simplified block layout
x = layer_norm1(x)
x = x + time_mix_block(x, state)
x = layer_norm2(x)
x = x + channel_mix_block(x)

3. Time-Mix (RWKV “Attention”)

This is the crown jewel. Time-Mix simulates attention using exponential moving averages (EMA) over key-value pairs.

3.1. What It Replaces:

Instead of computing:
softmax(QKᵀ)V
…RWKV does a recurrent sum with time decay:

3.2. Core Idea:

Let’s define:

  • x_t: input at timestep t
  • k_t, v_t: key and value for timestep t
  • w: learnable time-decay (per layer/channel)
  • r_t: receptance gate (learned from x_t)
  • state_k, state_v: hidden rolling averages

Then:

k̄_t = EMA of k_t with decay w
v̄_t = EMA of v_t with decay w

out_t = r_t * (k̄_t • v̄_t)

This resembles attention over a compressed memory, with the exponential kernel replacing softmax attention.

❗ Hidden trick: RWKV handles time decay using log-space math to prevent underflow in very long sequences — a key to stability.


4. Receptance Gate (RNN-style)

Each time step computes a gating function over the result of time-mixing:

r_t = sigmoid(W_r * x_t)

This functions like an LSTM or GRU gate, controlling how much of the mixed value gets passed forward.

Receptance = “how much to receive from the past”


5. Channel-Mix Block (FFN Analog)

This part acts like the standard Transformer FFN:

x_proj = W1 * x_t
x_act  = ReLU or SiLU(x_proj)
out    = W2 * x_act

But unlike traditional FFNs, RWKV injects time-shifted mixing of previous token values here too, often using a “time_mix_kv” mechanism.

⚠️ Secret detail: TimeMix and ChannelMix both use time offsets (time_mix parameters) to interpolate between current and previous hidden states. This gives RWKV its soft memory quality.


6. Positional Encoding via Time Decay

RWKV has no positional encoding vectors like Transformers. Instead, time is encoded implicitly via decay weights.

The EMA weights simulate distance-aware attention, since earlier tokens decay and affect the output less than recent tokens.

💡 This means RWKV can model thousands of tokens with a small memory footprint, ideal for long-context tasks (chat history, logs, code).


7. Hidden State = Persistent Memory

RWKV’s hidden state is:

  • A fixed-size vector (or matrix) per layer
  • Updated token by token, like in an RNN
  • Saved and reused during inference (like kv-cache in Transformers, but linear)

In practice:

state = update_fn(state, x_t)

This gives RWKV constant memory per token (not proportional to sequence length), making it highly efficient in inference — especially on low-resource devices (phones, edge AI).


8. Training Tricks

RWKV is trained like a Transformer — in parallel with batches of sequences.

But during inference, it degrades to RNN-mode, one token at a time.

To make that possible, RWKV layers are trained with:

  • TimeMix applied to batched sequences
  • Masking and decay factors that simulate sequential update during training

9. Lesser-Known Secrets and Innovations

✅ RWKV’s EMA kernel is stable in FP16:

Unlike most RNNs and attention mechanisms, RWKV’s use of log-space time decay lets it operate in half precision with very long contexts.

✅ It supports infinite context in theory:

Since the memory is recurrent and does not truncate based on fixed sequence windows, it can process indefinite-length streams with only hidden state carryover.

✅ No softmax, no QK dot-product:

This avoids many of the bottlenecks of Transformer inference — you don’t need to store a growing kv-cache or do quadratic matrix ops.

✅ Parallel Training, Serial Inference:

RWKV can be trained as a transformer (GPU-efficient), but run like an RNN (RAM-efficient). This duality is core to its practicality.

✅ Efficient on CPU and Edge:

RWKV performs extremely well even on CPUs and embedded GPUs, unlike Transformers that often need CUDA-level speed to be viable.


10. Summary of Step-by-Step Execution (Per Token)

Let’s walk through RWKV inference per token:

for each token x_t:
    # 1. LayerNorm on x_t
    x_ln1 = layer_norm1(x_t)

    # 2. TimeMix (pseudo-attention)
    k_t, v_t, r_t = compute_kvr(x_ln1)
    state_k, state_v = update_ema(state_k, k_t), update_ema(state_v, v_t)
    mix_out = r_t * (state_k • state_v)

    # 3. Residual Add
    x_res1 = x_t + mix_out

    # 4. LayerNorm
    x_ln2 = layer_norm2(x_res1)

    # 5. ChannelMix (Feedforward)
    cm_out = channel_mix(x_ln2)

    # 6. Final Residual Add
    x_next = x_res1 + cm_out

    # 7. Update token hidden state
    state = update(state, k_t, v_t)

    # Output: logits from final projection

11. Why RWKV Matters

  • It enables tiny, efficient models (RWKV-65M runs on Raspberry Pi)
  • It supports context lengths > 100K tokens
  • It’s open, trainable, and has LoRA & PEFT support
  • It has architectural simplicity — just Linear + EMA logic — no multi-heads, no softmax
  • It forms the foundation for persistent digital cognition (ideal for your cluster work)