Overview

The client is a company specializing in professional software for video processing, video generation, and related services. With the explosive growth of AI video generation business, the company's workload surged. However, because the GPU resources purchased by the company's internal teams were scattered across different regions nationwide, resource utilization was low, resulting in serious waste. Therefore, the company's technical director decided to unify and integrate all internal GPU resources to achieve on-demand allocation and efficient utilization.

Challenges

Initial investigations confirmed that the client's GPU resource hardware composition was complex, including the latest H200 and H100 GPUs, alongside 20 servers equipped with NVIDIA Tesla V100 in the old corporate headquarters data center. This heterogeneous environment presented three major challenges:

1.Complex network design.

It required a hybrid network setup of InfiniBand EDR (V100) and high-bandwidth NDR (H100/H200), placing extremely high demands on network architecture design and stability.

2.High-speed optical connectivity compatibility difficulty.

The client's existing NVIDIA SB7890 Switch ports were 100G QSFP28, while the new NVIDIA Quantum™-2 QM9790 Switch ports were 800G OSFP. Technical limitations prevented the 800G OSFP port from directly downshifting to 100G for interoperability with the 100G QSFP28 interface.

3.Cost control and asset retention.

The client required maximum retention of the original optical network transceivers and network architecture, strictly controlling new hardware additions and procurement costs.

Solution

To perfectly meet the client's GPU resource integration needs and address the complex network challenges, AICPLIGHT proposed an innovative 200G transit connection solution, successfully connecting InfiniBand EDR and NDR while preserving the original network structure.

We used the NVIDIA Quantum™ QM8790 Switch (HDR rate) as the core transit layer. This switch connects downwards to the InfiniBand EDR network (V100 servers) and upwards to the InfiniBand NDR network (H100/H200 servers), achieving unified management of all heterogeneous GPU resources company-wide.

By deploying 400G OSFP to 2x200G QSFP56 AOC cables, we successfully enabled the 800G OSFP ports on the NVIDIA Quantum™-2 QM9790 Switch to operate stably at the InfiniBand HDR (200G) rate, thereby achieving reliable interconnection with the QM8790 Switch.

The greatest highlight of this solution is the protection of the client's existing assets. The InfiniBand NDR section in the H100 and H200 networks remained completely unchanged. Concurrently, the connection between the QM8790 Switch and the SB7890 Switch fully retained the original NVIDIA MFA1A00-Exxx 100G AOCs, effectively controlling the cost of new hardware.

To ensure the client could swiftly complete the GPU resource integration and data center migration, AICPLIGHT dispatched technical personnel for on-site participation. We not only finalized the converged network architecture design but also formulated a dedicated periodic maintenance plan for the client. Most importantly, we were involved throughout the data center migration, providing professional data center management expertise, successfully helping the client avoid numerous potential risks, and guaranteeing a smooth business transition.

图片

Advantages

AICPLIGHT's core value lies in its exceptional technology for integrating complex heterogeneous networks, its high degree of cost optimization capability, and its end-to-end professional services. Through the innovative 200G transit connection solution, we successfully overcame the challenges of InfiniBand EDR/NDR hybrid networking and the connection compatibility between 800G and 100G interfaces. This efficiently pooled the client's scattered heterogeneous GPU resources, significantly boosting compute utilization. Simultaneously, by maximizing the retention of original network assets, we achieved a perfect balance between technology upgrade and cost control. Furthermore, AICPLIGHT's deep involvement in the client's entire data center migration process, leveraging our professional experience, helped mitigate potential risks and provided a customized long-term maintenance plan, ensuring the integrated AI compute network operates continuously, stably, and efficiently, thereby providing a solid and reliable infrastructure guarantee for the client's explosive business growth.