HBM4 (High Bandwidth Memory 4)
HBM4 is the fourth generation of the JEDEC High Bandwidth Memory standard. It stacks multiple DRAM dies on a base logic die using Through-Silicon Vias (TSVs), providing an ultra-wide 2048-bit interface per stack. HBM4 targets AI/ML accelerators, HPC, data-centre GPUs, and advanced networking ASICs where memory bandwidth is the primary bottleneck. Compared to HBM3E, HBM4 doubles the independent channel count (up to 32), increases per-pin data rates beyond 9.6 Gb/s, and introduces a customisable base-die logic interface for tighter SoC integration.
๐ HBM4 Key Features
- Ultra-Wide Bus: 2048-bit data interface per stack (128 DQ per channel ร 16 or 32 channels)
- High Per-Pin Rate: โฅ 9.6 Gb/s per pin (roadmap to 12+ Gb/s)
- Peak Bandwidth: > 2 TB/s per stack (32 channels ร 2048 bits @ 9.6 Gb/s)
- 32 Independent Channels: Doubled from HBM3/3E's 16 channels for finer-grained parallelism
- Stacked Architecture: 12โ16 DRAM dies + 1 base logic die, bonded with TSVs/micro-bumps
- Inline ECC: On-die error correction per 256-bit data granularity
- Customisable Base Die: Allows SoC vendors to co-design logic on the HBM base die
- DDR Signalling: Double-data-rate with source-synchronous DQS strobes
๐ HBM4 Timing Waveform
๐ง JSON Editor
๐ HBM4 Read Operation Sequence
โก HBM Generation Comparison
| Feature | HBM2 | HBM2E | HBM3 | HBM3E | HBM4 |
|---|---|---|---|---|---|
| Per-Pin Rate | 2.0 Gb/s | 3.6 Gb/s | 6.4 Gb/s | 9.6 Gb/s | 9.6โ12+ Gb/s |
| Bus Width | 1024-bit | 1024-bit | 1024-bit | 1024-bit | 2048-bit |
| Channels | 8 | 8 | 16 | 16 | 32 |
| Stack BW | 256 GB/s | 460 GB/s | 819 GB/s | 1.2 TB/s | > 2 TB/s |
| Die Stack | 4โ8 Hi | 8 Hi | 8โ12 Hi | 8โ12 Hi | 12โ16 Hi |
| Capacity/Stack | 8 GB | 16 GB | 24 GB | 36 GB | 48โ64 GB |
| ECC | Optional | Optional | Inline | Inline | Inline (enhanced) |
| JEDEC Std | JESD235B | JESD235C | JESD238 | JESD238A | JESD238B (draft) |
๐๏ธ HBM4 Physical Architecture
๐ HBM4 Signal Interface (Per Channel)
| Signal | Width | Direction | Description |
|---|---|---|---|
| DQ | 64 bits | Bidirectional | Data bus โ 64 DQ pins per channel (128 in pseudo-channel mode) |
| DQS / RDQS / WDQS | 8 pairs | Source-sync | Read/write data strobes, differential, DDR-aligned to DQ |
| DM / DBI | 8 bits | Input | Data mask / data bus inversion for write operations |
| CMD / CA | ~8 bits | Input | Command/address bus โ row, column, bank, activate, read, write |
| CK_t / CK_c | 1 pair | Input | Differential clock โ commands sampled on CK_t rising edge |
| ECC | 8 bits | Bidirectional | Inline ECC bits per 256-bit data word |
| AERR_n | 1 bit | Output | Asynchronous error alert from DRAM to controller |
โฑ๏ธ HBM4 Key Timing Parameters
| Parameter | Symbol | Typical Value | Description |
|---|---|---|---|
| CAS Latency | tCL / CL | ~32โ40 nCK | Column-access to first data out |
| RAS-to-CAS Delay | tRCD | ~14 ns | Activate to read/write command |
| Row Precharge | tRP | ~14 ns | Precharge to next activate |
| Row Active Time | tRAS | ~32 ns | Minimum activate to precharge |
| Refresh Period | tREFI | ~3.9 ยตs | Average interval between refresh commands |
| Refresh Cycle | tRFC | ~260 ns | Refresh command to next activate |
| Write Latency | tWL / WL | ~16โ20 nCK | Write command to first data in |
| Burst Length | BL | 16 / 32 | Data beats per access (BL16 = 8 CK DDR cycles) |
๐ฏ HBM4 Target Applications
LLM training demands > 2 TB/s
per GPU (NVIDIA B200,
AMD MI400)
Scientific simulations
requiring massive
memory bandwidth
800G/1.6T switch chips
with deep packet buffers
and flow tables
High-bandwidth compute
offload with HBM-attached
FPGA fabrics
๐ HBM4 vs Other Memory Technologies
| Feature | HBM4 | GDDR7 | DDR5 | LPDDR5X |
|---|---|---|---|---|
| Bus Width | 2048-bit/stack | 32-bit/chip | 64-bit/ch | 32-bit/ch |
| Per-Pin Rate | 9.6โ12 Gb/s | 36โ40 Gb/s | 4.8โ8.4 Gb/s | 8.5 Gb/s |
| BW (typical config) | > 2 TB/s | ~1.5 TB/s | ~60 GB/s | ~68 GB/s |
| Packaging | 2.5D CoWoS (TSV) | Standard BGA | Standard DIMM | Package-on-Package |
| Power Efficiency | ~3.9 pJ/bit | ~8 pJ/bit | ~12 pJ/bit | ~6 pJ/bit |
| Capacity/Device | 48โ64 GB | 2โ4 GB | 8โ64 GB | 8โ16 GB |
| Use Case | AI GPU, HPC | Gaming GPU | Server, desktop | Mobile, laptop |