Cerebras Systems is adding six new AI data centers in North America and Europe. This will increase inference capacity to over 40 million tokens per second. The new facilities will be established in Dallas, Minneapolis, Oklahoma City, Montreal, New York, and France, with 85% of the total capacity based in the United States. According to James Wang, director of product marketing at Cerebras, the company’s aim this year is to meet the expected surge in demand for inference tokens driven by new AI models like Llama 4 and DeepSeek.
The inference capacity will increase from 2 million to over 40 million tokens per second by Q4 2025 across the planned eight data centers.
A Nvidia rack with 72 B200 chips would likely provide about 270,000 tokens per second. 150 racks of Nvidia chips would be about 40 million tokens. This would be about 10,000 B200 chips.
Cerebras AI wafer chips only have 40 Gigabytes of SRAM memory on the wafer. If the AI model being run does not operate on the wafer then performance will be far worse than the Nvidia B200 and B300 systems that have advanced NV link communications.
If the AI model and workloads match up well with the Cerebras wafer chips then there is the possibility of getting better performance in some use cases.
Performance and Power Usage Comparison
1. Cerebras WSE-3 (CS-3 System)
System Specs: The Cerebras CS-3, powered by the WSE-3, delivers 125 petaflops of AI compute and consumes 23 kW per system. It occupies 15U of rack space.
Rack Configuration: A standard 42U rack can fit approximately 2 CS-3 systems (15U * 2 = 30U, leaving space for cabling and cooling).
Performance per Rack: 2 * 125 petaflops = 250 petaflops
Power per Rack: 2 * 23 kW = 46 kW
2. Nvidia B200 (NVL72 Rack)
Rack Specs: The NVL72 rack contains 72 B200 GPUs, delivering 360 petaflops (as per industry data) and consuming 132 kW (per the query).
Performance per Rack: 360 petaflops
Power per Rack: 132 kW
3. Scaling Across Multiple Racks
We’ll scale the per-rack values linearly for 10, 100, and 1,000 racks.
Cerebras CS-3:
1 Rack: 250 petaflops, 46 kW
10 Racks: 10 * 250 = 2,500 petaflops, 10 * 46 = 460 kW
100 Racks: 100 * 250 = 25,000 petaflops, 100 * 46 = 4,600 kW
1,000 Racks: 1,000 * 250 = 250,000 petaflops, 1,000 * 46 = 46,000 kW
Nvidia NVL72:
1 Rack: 360 petaflops, 132 kW
10 Racks: 10 * 360 = 3,600 petaflops, 10 * 132 = 1,320 kW
100 Racks: 100 * 360 = 36,000 petaflops, 100 * 132 = 13,200 kW
1,000 Racks: 1,000 * 360 = 360,000 petaflops, 1,000 * 132 = 132,000 kW
4. Power Efficiency
Cerebras: 250 petaflops / 46 kW = 5.43 petaflops per kW
Nvidia: 360 petaflops / 132 kW = 2.73 petaflops per kW
Insight: Nvidia offers higher raw compute power per rack (360 vs. 250 petaflops), but Cerebras is nearly twice as power-efficient, making it more suitable for
energy-constrained data centers.
Inference Performance: Tokens per Second
Inference performance is measured in tokens per second, particularly relevant for real-time AI tasks like language model inference.
Cerebras Inference:
For Llama 3.1 8B: 1,800 tokens per second
For Llama 3.1 70B: 450 tokens per second
Cerebras claims this is 20 times faster than Nvidia GPU-based solutions for similar models.
This is per CS-3 system, so a rack with 2 CS-3 systems could theoretically double this throughput, depending on workload parallelization (e.g., 3,600 tokens/sec for Llama 3.1 8B).
Nvidia B200:
Specific tokens per second data for the B200 or NVL72 rack is not widely published. However, based on Cerebras’ claim of 20x superiority, a comparable Nvidia GPU solution might deliver approximately 90 tokens per second for Llama 3.1 8B and 22.5 tokens per second for Llama 3.1 70B per GPU or system. Scaling to 72 GPUs in an NVL72 rack could increase this, but exact figures depend on optimizations and are unavailable here.
Data Center Performance in 2025
Scalability:
Cerebras: Supports clustering up to 2,048 CS-3 systems (e.g., Condor Galaxy 3 uses 64 CS-3s), potentially spanning 1,024 racks at 2 systems per rack. This enables massive AI supercomputers.
Nvidia: NVL72 racks are designed for large-scale training and can scale indefinitely with sufficient infrastructure.
Interconnect Bandwidth:
Cerebras’ WSE-3 offers 27 petabytes per second of on-wafer bandwidth, claimed to exceed the NVL72’s interconnect by over 200x. This could enhance scaling efficiency for distributed training or inference.
Future Trends:
Nvidia: The Blackwell Ultra B300, announced as 1.5x faster than the B200, could push NVL72 successors to ~540 petaflops per rack by 2025.
Cerebras: Future WSE iterations may emerge, maintaining their edge in efficiency and inference.
2025 Outlook: Both systems will remain competitive, with Nvidia favoring raw compute and Cerebras excelling in efficiency and bandwidth-heavy workloads.
Cost Considerations
Inference Costs:
Cerebras: Priced at 10 cents per million tokens for Llama 3.1 8B and 60 cents per million tokens for Llama 3.1 70B, claimed to be a fraction of GPU-based costs.
Nvidia: Specific inference costs for B200 are unavailable, but GPU solutions are typically more expensive per token due to lower throughput.
Training Costs:
Hardware acquisition, power, and cooling dominate training costs. Cerebras’ lower power usage (e.g., 46 kW vs. 132 kW per rack) could reduce operational expenses, though initial hardware costs are not disclosed here.
Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.