Securing the Future: Why the Government Must Invest in Large-Scale GPU Clusters to Lead in AI

10/27/2024 | Written by: Bob Gourley

The government needs large clusters of AI enabling Graphical Processing Units (GPUs) dedicated to missions that matter to national security and acceleration of AI in service to our citizens. We currently have none.

For years, I have engaged with colleagues in and around government on the importance of dedicated, large-scale GPU infrastructure. The reasons are clear. Federal agencies should have the capacity to securely train and deploy models of their choosing, including the leading open-source models and eventually their own proprietary models. And they should be able to do this on their own data. This is not only a matter of security but also of cultivating AI talent. Operating large clusters of GPUs would enable the government to attract and develop experts capable of tackling the most pressing AI challenges.

Furthermore, real AI leadership requires hands-on experience with large-scale systems. While some highly capable experts in government can analyze and synthesize AI developments, there is no substitute for direct access to substantial GPU clusters. This experience is essential for informing policy decisions and participating credibly in discussions with industry leaders.

From a national security perspective, running AI models in secure environments is non-negotiable, making government-controlled clusters indispensable. Moreover, there is a legitimate concern about ceding all thought leadership in AI to a few powerful commercial players. The rise of open-source AI offers a counterbalance, but only if there is strong government support to ensure its proliferation.

Despite this need, the current state of government-owned clusters is limited. But if we define a large GPU cluster as 20,000 or more GPU, we have none of those. Publicly available information indicates that neither the Department of Defense (DoD) nor the intelligence community operates clusters of significant size at all. While the Department of Energy (DOE) does manage a few noteworthy clusters, their accessibility to other federal agencies remains limited.

Below are some examples, which, I hope, are not a comprehensive reflection of the current state, but may well be:

The National Labs have the largest clusters known in government. These are great systems helping advance science along multiple fronts:

Oak Ridge National Laboratory operates supercomputers with GPUs, one I found has 18,688 Nvidia GPU.
Los Alamos National Labs operates a 2560 CPU HPE Cray EX supercomputer with 2560 H200 chips.
Argonne National Laboratory operates a cluster with 2,240 A100 GPUs.

The DoD and Intelligence Community may have systems that are not known, that is the nature of their business. But there is only information available on one they have access to and it is not really theirs:

MITRE operates a DGX SuperPOD on a project called the Federal AI Sandbox which includes 256 Nvidia H1oo GPU.

Compare that with the following snapshot of the kinds of clusters that are in place today or being built in the commercial sector:

Meta operates two clusters with 24576 Nvidia H100 GPU each, and is on a path to complete a 350,000 H100 GPU cluster by the end of 2024.
AWS has an architecture of UltraClusters, with 20,000 Nvidia h100 GPU per cluster and multiple clusters. This enables scalable service to AWS customers needing cluster compute on demand. It is hard to know the total number of Nvidia H100 in use, with some estimates being over 500,000.
Microsoft Azure offers large-scale GPU clusters to customers. It is hard to know the total number of GPUs, but their architecture seems to be clusters of 14,400 Nvidia chips each with multiple clusters and a possible total of 300,00o GPU.
Google’s GPU use is built into scalable clusters distributed across their datacenters. They operate at least two clusters of 27,000 Nvidia GPU. We should note, Google is an absolute pioneer in all aspects of AI and has fielded an architecture not based on GPUs but on their own purpose built TPUs, Tensor Processing Units. They do not disclose the details of how many of these are in their architecture but most assume there are 10’s of thousands distributed across their data centers).
xAI launched a GPU cluster in September named Colossus to train its Grok-3 model. It is the largest GPU cluster in the world (organizations listed above with more chips are really clusters of clusters), with 100,000 Nvidia H100 GPUs. The system cost around $4 billion and took 120 days from having a desire till operation, with setup of the system taking only 19 days, something Nvidia’s Jensen Huang said would take the average data center four years to do.

Elon Musk is super human.

What would take everyone else a year, only took him 19 days. pic.twitter.com/q51sM48lsu
— Tesla Owners Silicon Valley (@teslaownersSV) October 13, 2024

My view of the above? Congress should consider significantly ramping up funding for funding clusters in government. Two approaches:

Fund a 400,000 CPU H200 cluster focused on furthering national security missions and advancing open source AI models.
Fund an approach of clusters of clusters: 20,000 GPU clusters at each military Service and major IC agency plus DoJ and major fed civ agencies like National Institutes of Health.

This is, of course, all meant to be a discussion starter. I know the costs of this would be huge. My estimate is this is a $16 billion dollar effort if sized the way I recommend. It may be that we start slower with a $4 billion dollar effort to begin, and ramp up as we see the benefits to mission and national security and the economy.

Join us in slack to discuss.

Tagged: Artificial Intelligence

About the Author

Bob Gourley

Subscribe Sign In

Bob Gourley

Related Posts

Subscribe to OODA Daily Pulse