
Disaggregated GPU Memory Pools
Bill Koss - CEO and President of Corespan Systems
Disaggregated GPU Memory Pools
AI data centers are leaving enormous GPU memory stranded behind server boundaries, and a disaggregated vRAM pool is how Corespan turns that into usable capacity at the system level.
Why Scaling Past a Single 8-GPU Server Breaks
The industry default scaling pattern is to start with a single server containing 8 tightly coupled GPUs and then add more 8-GPU boxes as demand grows. Once you step from that first node to 2 or 3 hosts with 8 GPUs each, most jobs are suddenly forced to traverse the network for cross-GPU communication, which dramatically reduces effective performance compared to the original single-node configuration. Even with high-speed Ethernet or InfiniBand, latency and limited bisection bandwidth turn many synchronizations and collective operations into bottlenecks, so aggregate FLOPS rise while step-time and throughput barely improve. The result is an expensive illusion of scale where you own 2–3× the accelerators but see only marginal gains on real-world training and inference workloads.
Because traditional architectures bind GPUs and memory to specific hosts, each new 8‑GPU box becomes its own island; workloads that span nodes now pay both the network penalty and suffer from fragmented vRAM. This is exactly where the CAPEX paradox, static silos, ghost capacity, and forced over‑provisioning show up most clearly in production clusters.
Why GPU Memory Is Stranded Today
Most AI infrastructure still treats GPUs as fixed accessories of a single host, which creates four structural problems.
- CAPEXParadox : Enterprises spend millions on accelerators, yet can’t fluidly point those GPUs at the workloads that need them most, depressing realized ROI.
- StaticSilos : Server-centric designs pin GPUs and their local memory to specific hosts, so a model on the “wrong” box cannot reach adjacent idle accelerators.
- GhostCapacity : GPUs and their vRAM sit idle while nearby jobs starve, not due to a GPU shortage but due to rigid composition boundaries.
- Forced Over‑Provisioning: Because you can’t share capacity, you overbuy larger GPU SKUs and extra nodes to cover peaks, inflating TCO while average utilization stays low.
As AI spend shifts from training to inference and cost-per-token becomes the defining metric, these inefficiencies become untenable.
What A Disaggregated vRAM Pool Looks Like
Corespan’s DynamicXcelerator platform builds a unified GPU memory pool by decoupling accelerators from individual servers and presenting them interconnected attached resource.
- UnifiedvRAM Pool : Memory from up to 832 GPUs can be aggregated into a single logical vRAM pool, eliminating host DRAM spillover and reducing PCIe paging overhead by up to 80.
- PCIe-NativeDisaggregation : A Fabric Interface Card (FIC 2500) provides PCIe Gen 5 lanes over photonics into a Photonic Resource Unit (PRU 2500) that hosts GPUs and other xPUs.
- PhotonicFabricBandwidth : Each FIC 2500 offers 32×100G fabric-side bandwidth via co-packaged optics, feeding into PRU 2500 chassis with up to twelve PCIe 5.0 slots for GPUs, RDMA NICs, FPGAs, and NVMe.
- ScalableHostAttach : The FIC 2500 can connect at x16 or x8, enabling shared access to pooled accelerators without vendor lock-in on either the hosts or the GPUs.
In a specific example, 32 RTX 6000 Blackwell GPUs with 96 GB of GDDR7 each can be composed into a 3.0 TB memory pool behind a single host, and multiple such hosts can tap into the larger fabric-wide vRAM pool as needed.
Direct Device DMA, Not Network Bottlenecks
At the heart of the architecture is direct device DMA across the fabric, which keeps GPUs busy instead of waiting on slow, CPU-mediated or network-bound data paths.
- DirectDevice DMA : GPUs, xPUs, and FPGAs can perform peer-to-peer transfers without host CPU involvement, eliminating bounce buffers and slashing data-movement latency.
- PCIeFabricScale: Within a node, the PCIe fabric scales to 32 GPUs, reducing the dependency on large RDMA clusters and keeping many training and inference jobs local.
- Beyond 8 GPUs withoutNetworkTax : Instead of replicating 8‑GPU islands and forcing cross-node traffic through the LAN, Corespan extends PCIe semantics over photonics so additional GPUs appear as part of the same fabric, not as remote boxes.
- Dedicated DMAPools : Corespan dedicates DMA pools for streaming I/O—data ingest, shuffle, and checkpoints so inference vRAM stays hot while background transfers run uninterrupted.
This allows operators to scale from 8 GPUs to 24 or 32 GPUs worth of vRAM and compute capacity while preserving the low-latency characteristics of PCIe interconnect rather than falling off the performance cliff associated with traditional multi-host scaling.
System-Level Benefits For AI Operators
Pooling disaggregated GPU memory is not just a clever topology trick; it changes the economics and operational model of the data center.
- ZeroStrandedCapacity : Memory is allocated from a shared pool, eliminating idle vRAM on underutilized GPUs and reducing SKU over-provisioning costs by 30–40.
- HigherClusterUtilization : Workloads can be dynamically composed with the exact mix of GPUs, RDMA, and storage they require, driving significantly higher realized utilization of existing fleets.
- SimplifiedProgramming : A unified address space is exposed to frameworks, eliminating manual per-GPU sharding and enabling global caching, prefetch, and checkpoint policies across the pool.
- HostSimplification : GPU-heavy capacity moves into PRU 2500 chassis, allowing operators to standardize, simplify, and refresh hosts independently of accelerators.
- FlexibleResourceMix : PRU 2500 can host GPUs, RDMA NICs, FPGAs, xPUs, and NVMe scratchpads with TBs of storage, so operators can dial in the exact resource composition per service.
Instead of accepting that performance must fall off as you scale past a single 8‑GPU server, Corespan lets AI operators extend the benefits of tight, local coupling across a photonic PCIe fabric and a disaggregated vRAM pool.
How Corespan Composer Makes It Consumable
Disaggregation only pays off if it is easy to consume, which is why Corespan wraps the photonic fabric and DMA pools in a software control plane called the Corespan Composer.
- UnifiedControlPlane : Corespan Composer manages system composition and fabric control, attaching and detaching disaggregated physical resources to hosts as workloads change.
- Hardware-AgnosticOrchestration : Composer abstracts silicon diversity across GPUs, ASICs, CPUs, and tiered memory, enabling enterprises to choose for cost, latency, or power without infrastructure lock-in.
- API-drivenComposition : Shared pools of compute, acceleration, storage, and interconnect can be programmatically assigned to services, letting platforms automatically right-size GPU and vRAM allocations per model.
As AI infrastructure enters a heterogeneous, inference-heavy era, a disaggregated GPU memory pool built on photonic PCIe fabric and direct DMA gives operators a way to scale beyond the single 8‑GPU server without surrendering performance to the interconnect.