Cisco sets a foundation for AI network infrastructure

Cisco adds two new high-end programmable Silicon One devices that can support massive GPU clusters for AI/ML workloads.

Senior Editor, Network World |

diversity saudi arabia turkey middle east networking globe map connections by dem10 gettyimages 118 — Dem10 / Getty Images

Cisco is taking the wraps off new high-end programmable Silicon One processors aimed at underpinning large-scale Artificial Intelligence (AI)/Machine Learning (ML) infrastructure for enterprises and hyperscalers.

The company has added the 5nm 51.2Tbps Silicon One G200 and 25.6Tbps G202 to its now 13-member Silicon One family that can be customized for routing or switching from a single chipset, eliminating the need for different silicon architectures for each network function. This is accomplished with a common operating system, P4 programmable forwarding code, and an SDK.

The new devices, positioned at the top of the Silicon One family, bring networking enhancements that make them ideal for demanding AI/ML deployments or other highly distributed applications, according to Rakesh Chopra, a Cisco Fellow in the vendor’s Common Hardware Group.

“We are going through this huge shift in the industry where we used to build these sorts of reasonably small high-performance compute clusters that seemed large at the time but nothing compared to the absolutely huge deployments required for AI/ML,” Chopra said. AI/ML models have grown from needing a few GPUs to needing tens of thousands linked in parallel and in series. “The number of GPUs and the scale of the network is unheard of.”

The new Silcon One enhancements include a P4-programmable parallel-packet processor capable of launching more than 435 billion lookups per second.

“We have a fully shared packet buffer where every port has full access to the packet buffer regardless of what’s going on,” Chopra said. This is in contrast with allocating buffers to individual input and output ports, which means the buffer you get depends on which port the packets go to. “That means that you’re less capable of writing through traffic bursts and more likely to drop a packet, which really decreases AI/ML performance,” he said.

In addition, each Silicon One device can support 512 Ethernet ports letting customers build a 32K 400G GPU AI/ML cluster requiring 40% fewer switches than other silicon devices needed to support that cluster, Chopra said.

Core to the Silicon One system is its support for enhanced Ethernet features such as improved flow control, congestion awareness, and avoidance.

The system also includes advanced load-balancing capabilities and “packet-spraying” that spreads traffic across multiple GPUs or switches to avoid congestion and improve latency. Hardware-based link-failure recovery also helps ensure the network operates at peak efficiency, the company stated.

Combining these enhanced Ethernet technologies and taking them a step further ultimately lets customers set up what Cisco calls a Scheduled Fabric.

In a Scheduled Fabric, the physical components—chips, optics, switches—are tied together like one big modular chassis and communicate with each other to provide optimal scheduling behavior, Chopra said. "Ultimately what it translates to is much higher bandwidth throughput, especially for flows like AI/ML, which lets you get much lower job-completion time, which means that your GPUs run much more efficiently.”

With Silicon One devices and software, customers can deploy as many or as few of these features as they need, Chopra said.

Cisco is part of a growing AI networking market that includes Broadcom, Marvell, Arista and others that is expected to hit $10B by 2027, up from the $2B it is worth today, according to a recent blog from the 650 Group.

“AI networks have already been thriving for the past two years. In fact, we have been tracking AI/ML networking for nearly two years and see AI/ML as a massive opportunity for networking and one of the main drivers for data-center networking growth in our forecasts,” the 650 blog stated. “The key to AI/ML’s impact on networking is the tremendous amount of bandwidth AI models need to train, new workloads, and the powerful inference solutions that appear in the market. In addition, many verticals will go through multiple digitization efforts because of AI during the next 10 years.”

The Cisco Silicon One G200 and G202 are being tested by unidentified customers now and are available on a sampled basis, according to Chopra.

Next read this:

Michael Cooney is a Senior Editor with Network World who has written about the IT world for more than 25 years. He can be reached at michael_cooney@foundryco.com.

The 10 most powerful companies in enterprise networking 2022