๐Ÿ“Š Protocol View HBM4 Interface

HBM4 (High Bandwidth Memory 4) Timing Diagram & Protocol Analysis

HBM4 (High Bandwidth Memory 4)

HBM4 is the fourth generation of the JEDEC High Bandwidth Memory standard. It stacks multiple DRAM dies on a base logic die using Through-Silicon Vias (TSVs), providing an ultra-wide 2048-bit interface per stack. HBM4 targets AI/ML accelerators, HPC, data-centre GPUs, and advanced networking ASICs where memory bandwidth is the primary bottleneck. Compared to HBM3E, HBM4 doubles the independent channel count (up to 32), increases per-pin data rates beyond 9.6 Gb/s, and introduces a customisable base-die logic interface for tighter SoC integration.

๐Ÿ“Œ HBM4 Key Features

  • Ultra-Wide Bus: 2048-bit data interface per stack (128 DQ per channel ร— 16 or 32 channels)
  • High Per-Pin Rate: โ‰ฅ 9.6 Gb/s per pin (roadmap to 12+ Gb/s)
  • Peak Bandwidth: > 2 TB/s per stack (32 channels ร— 2048 bits @ 9.6 Gb/s)
  • 32 Independent Channels: Doubled from HBM3/3E's 16 channels for finer-grained parallelism
  • Stacked Architecture: 12โ€“16 DRAM dies + 1 base logic die, bonded with TSVs/micro-bumps
  • Inline ECC: On-die error correction per 256-bit data granularity
  • Customisable Base Die: Allows SoC vendors to co-design logic on the HBM base die
  • DDR Signalling: Double-data-rate with source-synchronous DQS strobes

๐Ÿ”„ HBM4 Timing Waveform

๐Ÿ”ง JSON Editor

๐Ÿ“– HBM4 Read Operation Sequence

1
Activate (ACT): The controller issues an ACT command with the target bank and row address. The DRAM opens the row into the sense amplifiers (tRCD latency)
2
Read (RD): After tRCD, a column-read command is sent with the column address. CAS latency (CL) elapses before data appears
3
Data Burst: The DRAM drives DQ and toggles RDQS (read data strobe) source-synchronously. BL16 = 16 data beats (8 CK cycles DDR)
4
ECC: Inline ECC bits are transmitted alongside or immediately after the data burst for on-the-fly error detection and correction
5
Precharge: If auto-precharge is enabled (RDA), the bank is closed automatically; otherwise PRE must be issued explicitly

โšก HBM Generation Comparison

Feature HBM2 HBM2E HBM3 HBM3E HBM4
Per-Pin Rate 2.0 Gb/s 3.6 Gb/s 6.4 Gb/s 9.6 Gb/s 9.6โ€“12+ Gb/s
Bus Width 1024-bit 1024-bit 1024-bit 1024-bit 2048-bit
Channels 8 8 16 16 32
Stack BW 256 GB/s 460 GB/s 819 GB/s 1.2 TB/s > 2 TB/s
Die Stack 4โ€“8 Hi 8 Hi 8โ€“12 Hi 8โ€“12 Hi 12โ€“16 Hi
Capacity/Stack 8 GB 16 GB 24 GB 36 GB 48โ€“64 GB
ECC Optional Optional Inline Inline Inline (enhanced)
JEDEC Std JESD235B JESD235C JESD238 JESD238A JESD238B (draft)

๐Ÿ—๏ธ HBM4 Physical Architecture

A
Base Logic Die: Contains PHYs, command decoders, refresh logic, ECC engine, and potentially customisable compute logic (HBM4 "open base die" initiative). Connected to the host SoC via a silicon interposer or direct CoWoS packaging
B
DRAM Die Stack (12โ€“16 Hi): Each die provides 2 channels (HBM4: 32 ch total from 16 dies). Dies are connected vertically via TSVs and bonded with micro-bumps (~20 ยตm pitch)
C
Interposer / Packaging: 2.5D CoWoS (Chip-on-Wafer-on-Substrate) or similar advanced packaging connects HBM stacks to GPU/ASIC through short silicon traces (~100 ยตm)
D
Thermal: HBM stacks dissipate 15โ€“20 W each. The narrow TSV pitch and stacked architecture demand heat spreaders and advanced TIM (Thermal Interface Material)

๐Ÿ”Œ HBM4 Signal Interface (Per Channel)

Signal Width Direction Description
DQ 64 bits Bidirectional Data bus โ€” 64 DQ pins per channel (128 in pseudo-channel mode)
DQS / RDQS / WDQS 8 pairs Source-sync Read/write data strobes, differential, DDR-aligned to DQ
DM / DBI 8 bits Input Data mask / data bus inversion for write operations
CMD / CA ~8 bits Input Command/address bus โ€” row, column, bank, activate, read, write
CK_t / CK_c 1 pair Input Differential clock โ€” commands sampled on CK_t rising edge
ECC 8 bits Bidirectional Inline ECC bits per 256-bit data word
AERR_n 1 bit Output Asynchronous error alert from DRAM to controller

โฑ๏ธ HBM4 Key Timing Parameters

Parameter Symbol Typical Value Description
CAS Latency tCL / CL ~32โ€“40 nCK Column-access to first data out
RAS-to-CAS Delay tRCD ~14 ns Activate to read/write command
Row Precharge tRP ~14 ns Precharge to next activate
Row Active Time tRAS ~32 ns Minimum activate to precharge
Refresh Period tREFI ~3.9 ยตs Average interval between refresh commands
Refresh Cycle tRFC ~260 ns Refresh command to next activate
Write Latency tWL / WL ~16โ€“20 nCK Write command to first data in
Burst Length BL 16 / 32 Data beats per access (BL16 = 8 CK DDR cycles)

๐ŸŽฏ HBM4 Target Applications

๐Ÿค– AI / ML Training

LLM training demands > 2 TB/s
per GPU (NVIDIA B200,
AMD MI400)

๐Ÿ–ฅ๏ธ HPC / Supercomputing

Scientific simulations
requiring massive
memory bandwidth

๐ŸŒ Networking ASICs

800G/1.6T switch chips
with deep packet buffers
and flow tables

โš™๏ธ FPGA Accelerators

High-bandwidth compute
offload with HBM-attached
FPGA fabrics

๐Ÿ†š HBM4 vs Other Memory Technologies

Feature HBM4 GDDR7 DDR5 LPDDR5X
Bus Width 2048-bit/stack 32-bit/chip 64-bit/ch 32-bit/ch
Per-Pin Rate 9.6โ€“12 Gb/s 36โ€“40 Gb/s 4.8โ€“8.4 Gb/s 8.5 Gb/s
BW (typical config) > 2 TB/s ~1.5 TB/s ~60 GB/s ~68 GB/s
Packaging 2.5D CoWoS (TSV) Standard BGA Standard DIMM Package-on-Package
Power Efficiency ~3.9 pJ/bit ~8 pJ/bit ~12 pJ/bit ~6 pJ/bit
Capacity/Device 48โ€“64 GB 2โ€“4 GB 8โ€“64 GB 8โ€“16 GB
Use Case AI GPU, HPC Gaming GPU Server, desktop Mobile, laptop