live · phi-3-mini · webgpu

A real transformer, laid bare.

Three-point-eight billion parameters running in your browser. Every neuron, every attention head, every activation — drawn one-to-one from real GPU tensors. No mockup. No diorama. The model itself.

scroll
01 · The Promise

No mockups. The actual model.

Most "AI visualizations" you see online are decoration. Animated dots that pulse to a fake rhythm. Particles that don't connect to anything. A nice-looking metaphor with no model behind the curtain.

Neuropulse is the opposite. Every brightness, every line, every motion you see is a direct readout of a real WebGPU buffer in a real Phi-3-mini forward pass. When the model thinks about your prompt, you watch it think — not a representation of it.

Strict 1:1. Every pixel a function of a real tensor.

It runs entirely on your machine. The 3.8B-parameter model is loaded into your GPU's memory, the attention math is done in WGSL compute shaders, and the next-token logits are sampled in your tab. There is no server. There is no API key. Close the tab and the inference stops.

model
Phi-3-mini, q4f16_13.8B parameters · same weights Microsoft ships
runtime
WebGPU compute shaders13 pipelines · 22 buffers · 292 dispatches per token
privacy
Your GPU onlyzero server calls · nothing leaves your machine

02 · What you're watching

Every part of the model, labeled.

The 3D scene is not a metaphor. Each glowing element corresponds to a specific tensor in Phi-3-mini's compute graph. The 3,072 points of the residual stream are laid out by a PCA of the model's own layer-0 qkv_proj weights — so dims that get read into attention together end up near each other — and on every step the brightness of each point is the live value of that residual dimension. If you hover an attention head, the brightness you see is that head's output magnitude.

attention heads FFN slab residual stream KV cache LM head → next token
fig. 1 — the anatomy of a single forward pass

03 · Validation

Cross-checked against reference Phi-3.

"Strict 1:1" is a strong claim, so it has to be falsifiable. Neuropulse ships with a built-in test suite that diffs the WebGPU implementation against a reference HuggingFace fp16 Phi-3-mini on a fixed set of prompts cached as reference.json. Click the wrench icon inside the demo to run it — the actual numbers from your GPU print to your browser console.

═══ What the suite checks ═══
GPU: q4f16_1 Phi-3-mini   Reference: HF fp16 Phi-3-mini
 
[1] Tokenizer — GPU input ids match HF byte-for-byte on every prompt
[2] Hidden states — full 3,072-dim residual diffed at layers 0, 4, 8, 12, 16, 20, 24, 28, 31
[3] Attention (layer 31) — online softmax cross-checked against an explicit-softmax reference path
[4] Logits — top-k probabilities + Jensen–Shannon divergence vs HF on a 15-prompt sweep, teacher-forced for 5 steps each
[5] Long context — 290-token prompt, 10 decode steps, top-1 matched against HF
[6] Sampler — 5,000-sample empirical distribution vs softmax, JSD < 1e-2

What you should expect: tiny deltas at the hidden-state level (the cost of int4 quantization, not implementation drift) and identical top-1 tokens against the fp16 reference on the test set. That last bit is the bar that matters for a faithful rendering — and it's the one you can re-run yourself, on your own machine, in under a minute.


04 · The Stack

How it's built.

Four pieces. No frameworks for the inference path, no dependency soup, no clever tricks hiding the model from you.

  1. WebGPU compute & WGSL 13 pipelines, 22 buffers, 292 dispatches per token. Quantization: q4f16_1. Hand-written attention and FFN kernels.
  2. MLC Phi-3-mini weights The same weights as mlc-ai/Phi-3-mini-4k-instruct-q4f16_1-MLC, fetched directly from HuggingFace and cached in the browser's Cache API.
  3. Three.js scene Plain WebGLRenderer. No bloom, no particles, no decorative shaders. Every pixel pulls from a real tensor on every frame.
  4. PCA layout from the model's own weights Residual points are placed by PCA of layer 0's qkv_proj.weight columns; FFN points by PCA of down_proj.weight. Dims that get read or written together end up near each other, so the geometry is shaped by the model itself, not by hand.

Now you

See it for yourself.

Open Neuropulse, feed it a prompt, and watch a model think. The first load downloads about 2 GB of weights into your browser cache; subsequent visits start instantly.

Launch Neuropulse