Enfabrica: Improving AI GPU Interaction with Accelerated Compute Fabric (ACF)

25 Nov, 2024 Dedicated Server

Massive datasets fuel AI models, driving training and enabling accurate predictions. This relentless demand for data challenges traditional computer and networking architectures, prompting the need for innovative solutions.

Current AI Computer Networking Architecture

AI networking today relies on a hierarchical structure of interconnected components, including:

GPUs for parallel data processing.
PCI Switches that link multiple GPUs within a server.
RDMA NICs (Remote Direct Memory Access Network Interface Cards) for direct memory access between GPUs across different servers, reducing CPU involvement and speeding up data transfer.
Network Switches that form the leaf-spine network backbone, connecting servers and facilitating data center communication.

Despite being functional, this approach has significant limitations that impede AI workload scalability and efficiency:

Inter-GPU Communication Bottlenecks: With more GPUs in a cluster, hierarchical networks create bottlenecks, adding latency and reducing throughput.
Limited Bandwidth and Resilience: Current architectures struggle to meet AI workloads’ growing bandwidth demands, and single points of failure can disrupt training jobs, leading to costly restarts.
Lack of Composability: Traditional architectures’ rigidity limits support for diverse AI applications requiring different compute and memory resources, stifling innovation.
Escalating Total Cost of Ownership (TCO): Scaling AI infrastructure with traditional components increases TCO due to hardware costs, power consumption, and cooling needs.

Enfabrica’s Solution: The Accelerated Compute Fabric

Enfabrica’s Accelerated Compute Fabric (ACF) technology marks a significant departure from conventional approaches. ACF introduces the MegaNIC concept, merging PCI switching, RDMA, and first-tier network switching into a single, high-bandwidth, resilient device.

ACF’s unique architecture integrates multiple high-speed Ethernet NICs, interconnected by internal crossbar switches, creating a high-bandwidth, non-blocking fabric. This design separates packet header processing and payload transfer, allowing NICs to handle headers and forwarding while payloads are transferred directly between endpoints via DMA, minimizing latency. This approach ensures efficient data movement for AI workloads.

ACF’s architecture includes:

Converged PCI and Ethernet Crossbar: ACF lowers latency and improves performance by combining Ethernet networking and PCI switching to provide a low-latency data transfer channel between GPUs and throughout the network.
Massive Bandwidth and Path Diversity: ACF offers up to 3.2 terabits per second on the network side and 5 terabits per second on the host/accelerator side, ensuring high throughput and mitigating component failures.
Programmable Transport and Congestion Control: ACF’s programmable transport layer on a standard CPU allows customized congestion control mechanisms, tailoring network behavior to specific workloads.
Composability and Heterogeneity: ACF supports diverse compute and memory resources, including GPUs, CPUs, storage, and CXL-attached memory, enabling tailored systems for specific AI applications.

At AI Field Day 5, Enfabrica’s CEO, Rochan Sankar, noted, “The role of a PCI networking card has no relevance in AI going forward,” as each GPU connects directly to all Ethernet interfaces in the chip, expanding throughput to the fabric’s 3.2 Tbps.

Potential Disadvantages of Enfabrica’s Solution

While compelling, Enfabrica’s solution has potential drawbacks:

Hardware Dependency: ACF requires server design modifications, making it incompatible with current off-the-shelf systems, potentially hindering adoption for organizations with existing infrastructure investments.
Single Point of Failure: Despite multipath architecture, ACF itself represents a single point of failure. A failure at the ACF level could disrupt connected GPUs, though the design minimizes this risk.
Limited Compatibility: By prioritizing compatibility with InfiniBand verbs and RoCE over Ultra Ethernet, Enfabrica aims to address immediate scalability challenges while considering future advancements.

Why This Matters

AI workloads, particularly large language models, require enormous data movement, processing, and storage. High-bandwidth, low-latency architectures are crucial to avoid performance bottlenecks.

Enfabrica, focused on revolutionizing network infrastructure for AI, proposes a shift in approach. Instead of treating networking as peripheral, Enfabrica positions it at the heart of AI computing, recognizing its critical role in performance and scalability.

Enfabrica’s core value proposition addresses key AI networking challenges:

Reduced TCO: ACF reduces the cost of AI infrastructure by combining parts into a single device and streamlining data, freeing up resources for processing power.
Improved Performance: By unlocking the full capability of GPUs, ACF’s high bandwidth, low latency, and multipath capabilities speed up training and inference.
Improved Resilience: For large-scale AI installations, ACF’s reliable design and failure recovery reduce downtime and guarantee reliable operation.
Future-Proofing AI Infrastructure: ACF’s varied resource support and configurable transport layer allow for flexibility to changing AI workloads and emerging technologies.

Enfabrica’s ACF represents a significant advancement in AI networking, enabling increasingly complex and demanding AI applications. As AI evolves, solutions like Enfabrica’s will play a crucial role in unlocking AI’s full potential and shaping computing’s future. Contact us via live chat or email sales@dataplugs.com to learn more about our GPU Dedicated Server Plans.