AI Computing Architecture Evolution: Scale-Up vs. Scale-Out and the Choice of Optical Modules

Date 04/02/2026

In the design of AI computing clusters, Scale-Up and Scale-Out have different goals from the outset, and their evaluation criteria are not aligned along the same dimension. This article will show what Scale-Up and Scale-Out is, and how to choose the optical module.

As large-scale models reach trillions of parameters and the size of AI clusters approaches hundreds of thousands of GPUs, the underlying computing infrastructure is undergoing a significant architectural shift. The two mainstream network models for AI computing—Scale-Up and Scale-Out—are advancing along different technical paths, but in practice, they are gradually converging at key points.

This change is essentially the result of repeated trade-offs between "low latency and high scalability" and "high reliability and cost control." At the same time, the continuous evolution of optical module technologies (such as LPO, NPO, and CPO) is increasingly shaping system design.

Scale-Up vs. Scale-Out: Differentiation and Trade-offs Between Two Paths

In the design of AI computing clusters, Scale-Up and Scale-Out have different goals from the outset, and their evaluation criteria are not aligned along the same dimension. They can be understood as two divergent approaches.

Scale-Up: Tightly Coupled Architecture Emphasizing Low Latency and Strong Consistency

Scale-up is closer to extending the computing power of a single node; essentially, it's about "horizontal integration" within the existing GPU architecture. By constructing highly coupled computing units, the local memory of multiple GPUs is abstracted into a unified logical space, allowing the computing core to experience near-local HBM when accessing remote memory. This near-memory semantics approach aims not only to expand scale but also to minimize the distance between computation and data, reducing latency to below microseconds while maintaining high consistency and stability.

Scale-Out: Distributed Computing Model Oriented Towards Scalability

In contrast, Scale-Out is more like piecing together dispersed computing power by organizing a large number of nodes into a cohesive whole via a network. Typical implementations are based on a two- or three-layer Clos architecture, connecting tens of thousands or even larger GPUs to the same training cluster to support data parallelism, pipelined parallelism, and other training modes. This approach is all about scalability and cost control, ensuring the system is always available and can grow to meet your needs while working with your current hardware and software.

In general, Scale-Up is about improving "individual capabilities" to achieve top performance and low latency. On the other hand, Scale-Out is all about "scale efficiency," which lets large-scale clusters run smoothly. The former focuses on the core units in high-performance computing, while the latter forms the foundational network for large-scale training and inference. These two modes often coexist and complement each other in real-world AI infrastructure.

Current Evolution: Scale-Up Architecture Changes and Scale-Out Simplification Trends

In recent years, Scale-Up and Scale-Out have been evolving rapidly, and leading manufacturers have figured out how to implement them in practice. The goal is to boost performance while simplifying complexity.

Scale-Up: From Chassis-Based to Multi-Form Coexistence, 224G Interconnect Becomes a Key Node

Early solutions for growing businesses mostly focused on "chassis SuperNodes," which brought computing resources together into a single system and connected them via high-speed links. These solutions emphasized stability and consistency, typically employing a combination of "cable backplane + L1/L2 switching": internal chassis interconnects relied primarily on electrical connections to ensure high bandwidth density; cross-chassis expansion relied on optical interconnects.

224G optical interconnects are gradually becoming the mainstream choice for scale-up scenarios. From a cost model perspective, the price of 1.6T optical modules will likely remain roughly 1.2 to 1.4 times that of 800G over the next three to five years. This difference is pretty easy to deal with, whether it's based on LPO or traditional DSP architecture. But using 112G tech often means using more fiber for the same bandwidth, which ends up costing more. When you consider factors such as bandwidth, power consumption, and cabling complexity, the 224G solution is better in the long run.

QSFPTEK 1.6T Optical Module

The industry's been following this trend and has started developing related products, such as QSFPTEK's OSFP series modules that support 1.6T rates (e.g., 2×DR4 and 2×FR4 specifications). These are primarily aimed at meeting the high-bandwidth, low-latency requirements of future supernode architectures. These products are also getting data centers ready for the next step in their growth.

Scale-Out: From Three Layers to Two Layers, Network Structure Gradually Flattens

A significant change in scale-out architectures is the increasing emphasis on simplification in the overall design. The previously common three-layer Clos network is gradually being replaced by a two-layer Clos architecture. By reducing the number of intermediate layers, the network path becomes more direct, reducing system complexity. Under this design, even training clusters targeting hundreds of thousands of GPUs can maintain high scalability while keeping network costs under control.

This change is largely due to improvements in switching chip capabilities. As port sizes keep getting bigger, for example, with Radix reaching 512 ports, a two-layer Clos network can directly support node scales over 100,000, and in some design models, even cover around 130,000 accelerator cards, without needing an extra core switching layer. This means that large AI clusters can maintain bandwidth density while maintaining a clearer network topology during scaling.

Multi-Plane Networks: Alleviating Bandwidth Pressure Through Multi-Plane Design

Building on this foundation, some manufacturers have begun exploring new network organization methods. For example, the Multi-Plane architecture proposed in related research is a typical approach. In this scheme, the AI-NIC simultaneously accesses different Clos network planes via multiple high-speed ports; for instance, it uses four 200G interfaces to connect four independent network planes, and on the sending side, it distributes the same data stream across the planes using a round-robin method.

On the receiving end, a data processing mechanism that supports out-of-order writes reassembles data from different paths, so that a single link doesn't become a bottleneck. This approach improves bandwidth utilization on a single GPU in Scale-Out scenarios, achieving an overall utilization rate of over 95% in some test models.

From a trend perspective, the focus of Scale-Out is not just on increasing scale, but on achieving a more reasonable balance between bandwidth, efficiency, and cost in ultra-large-scale AI clusters through structural optimization and network design innovation.

Optical Module Selection Trends: LPO/NPO Gains Popularity, CPO Gradually Declines

In both Scale-Up and Scale-Out architectures, optical modules are not only the connection medium but also directly affect latency, power consumption, and link stability. Therefore, different technological approaches are increasingly different in their choices regarding optical module form factors. The differences between LPO, NPO, and traditional DSP-enabled solutions are becoming clearer, while the popularity of CPO has relatively declined.

Scale-Up: Lower Latency Solutions Prefer LPO/NPO

In scenarios where the system is scaled up, the goal is to reduce latency and power consumption. This makes LPO or NPO solutions without DSPs more popular. These optical modules make the signal processing path easier, which greatly reduces the amount of power they use. For example, an 800G module with an LPO solution consumes approximately 6W, a significant improvement over traditional DSP-equipped products that typically consume around 15W.

QSFPTEK 800G Optical Module

Simultaneously, reduced DSP involvement in data processing also reduces link latency. One-way latency can be shortened by tens of nanoseconds, with an even more pronounced reduction in round-trip latency. These characteristics make them better suited to the scale-up requirements of "short-range, high consistency," and some LPO products optimized for AI data centers are designed around these metrics.

Scale-Out: DSP Architecture is Still the Mainstream Choice

In environments where systems are scaled out, though, it's a whole other ball game. Optical modules with digital signal processors (DSPs) are better at handling large network connections and long transmission distances. For this reason, they remain the preferred choice.

On the one hand, in terms of error control, DSPs can support more complex forward error correction mechanisms, enabling the link to achieve a lower error rate after correction, making them suitable for long-distance transmission scenarios. LPO/NPO modules, on the other hand, have relatively limited capabilities in this regard, often failing to meet stability requirements for cross-data-center or 10 km-level links.

On the other hand, large-scale clusters typically involve the mixed deployment of equipment from multiple vendors, placing greater demands on interoperability. DSP optical modules based on standard protocols (such as IEEE 802.3) are more mature in cross-vendor compatibility, while LPO/NPO standards are still evolving, and the ecosystem is not yet fully unified.

Overall, the two technical paths have become clear: Scale-Up usually uses LPO/NPO to achieve lower latency and power consumption. Scale-Out continues to rely on DSP architecture to ensure long-distance transmission and system compatibility. Despite their different implementation methods, both share the same goal in bandwidth evolution—moving towards 224G single-channel speeds.

The Real Challenges of Integration and Implementation: Opportunities and Limitations Coexist

When the high reliability of scale-up and the large-scale scalability of scale-out begin to intersect, the direction of "converged networks" seems natural. However, there are still several unavoidable hurdles between experimentation and large-scale deployment.

Architectural Differences: Difficulty in Unifying Memory Semantics and Message Semantics

The two systems operate on different levels. Scale-Up is like "memory semantics," which is a way for GPUs to access remote resources. This makes it so that all the devices can communicate with each other quickly. Scale-Out, on the other hand, is built on "message semantics," using mechanisms like RDMA for sending and receiving data. It's basically an explicit communication model. The main differences in how they design and build their systems mean that integrating them is more complicated than just making sure they can all communicate with each other.

Inconsistent Reliability Paths: How to Introduce Stable Links

To ensure a design is reliable, Scale-Up typically uses a "cable backplane + switching structure" to reduce the risk of link failures and keep the critical path within a high-stability range. In contrast, Scale-Out systems usually use optical modules for interconnection. To achieve convergence, some Scale-Out traffic needs to be diverted to Scale-Up's highly reliable links, requiring the reconstruction of the existing network topology. For clusters with tens of thousands or even hundreds of thousands of cards, such adjustments are inherently complex engineering projects.

Cross-Vendor Collaboration: Standards and Ecosystems Remain Fragmented

Converged networks are not just an architectural issue; they also involve industry collaboration. Closer standard alignment is needed between switching chips, GPUs, and optical modules. For example, LPO-related solutions still rely on the refinement of unified specifications for cross-vendor interoperability. Simultaneously, memory-based switching structures need to adapt to the interface implementations of different GPU vendors. Currently, leading vendors primarily focus on their own ecosystems, and protocols and implementation paths are not entirely unified, which has slowed the pace of convergence to some extent.

Around these issues, some key questions are emerging: Does LPO have the opportunity to achieve a breakthrough in interoperability and become a more universal optical interconnect solution? Can memory semantics and message semantics be integrated at the software level to enable the system to achieve both low latency and high scalability?

What is certain is that the competition for computing infrastructure has entered a deeper stage. Every adjustment to optical module technology, network structure, and industry collaboration methods will affect the outcome. Rather than simply choosing between two paths, a more realistic direction may be to enable different systems to operate collaboratively within the same infrastructure. In line with this trend, ideas similar to "Scale-Across" have also been proposed as a transitional form connecting the two modes.