Turing Post
Posts
Guest post: Your infrastructure shouldn’t live in a 'black box'*

Guest post: Your infrastructure shouldn’t live in a 'black box'*

3 factors AI leaders should consider that can make or break their infrastructure – and ambitions

Peter Salanki
October 01, 2024

AI leaders have one of the hardest jobs in the world today: How do you build the best AI systems and products for your business in a new and rapidly changing environment? It’s a hot topic in the market today, but the challenges associated with managing the infrastructure required to train and deploy AI models are daunting. In this guest post, Peter Salanki, CTO at CoreWeave, discusses the critical factors AI leaders must consider to avoid letting their infrastructure live in a 'black box' and offers insights into the three key elements needed to succeed.

The reality of scaling AI: Prepare for failure

Over the last five years, I've spent countless hours talking about the challenges of running AI supercomputers at scale with forward-thinking enterprises and some of the largest AI labs on the planet. While the supercomputers required to fuel AI are powerful, they require diligent management to achieve and maintain peak performance and seamless operation. Operating on the bleeding edge of technology requires constantly pushing systems toward the brink and beyond, which means AI leaders must prepare for failure at some point.

This is no industry secret. The Llama Team at Meta called out this challenge in their latest paper, “The complexity and potential failure scenarios of 16K GPU training surpass those of much larger CPU clusters that we have operated. Moreover, the synchronous nature of training makes it less fault-tolerant – a single GPU failure may require a restart of the entire job.”

The downstream effects of cluster fragility show two opportunities for enhancement in AI infrastructure solutions today. First, AI enterprises need more transparency into the infrastructure that powers their products.

Legacy cloud providers deliver infrastructure in ‘black box’ environments, in which everything is abstracted away from the customer. When something breaks down, users are not informed of what's wrong. They have to submit a service ticket to get support, but even if the problem is fixed, users still don’t have much insight into what happened. The result: longer, costly downtimes that enterprises can do little to prevent.

Second, as the infrastructure scales, so too do the potential points of failure if left unmanaged. The largest cluster we run at CoreWeave to date is 32K GPUs, which means as many as 320,000 possible points of failure, which necessitates diligent upkeep. If running infrastructure in an on-prem environment or with a legacy cloud provider, teams are ultimately left to handle these challenges themselves. The time, expertise, cost, and effort required scale even faster than the potential failure points, setting teams back months – if not years – in their AI ambitions.

AI enterprises are not in the business of building data centers; they’re focused on building and future-proofing their businesses. That’s why it’s mission-critical for them to be able to stabilize their AI infrastructure without sacrificing performance.

What AI enterprises need to stay on the cutting edge

Clusters are neither static nor perfect. They involve a continuous cycle of integration, monitoring, evaluation, and updating of nodes to maintain the integrity, performance, and scalability of AI systems over time.

For AI enterprises to lead this race and get to market faster, infrastructure must be visible and manageable through an automated lifecycle management platform. This entails three critical features: validation, health checks, and observability.

A world-class validation platform
Speed to market depends on how fast a platform can bring-up and deliver healthy infrastructure into the hands of clients. It’s not just about “plugging” GPUs in quickly; it’s about the quality and readiness of the infrastructure that’s delivered. During bring-up, an effective validation platform can have a material impact on getting AI research teams training and deploying models closer to “day 1”, saving valuable time and engineering resources from dealing with preventable issues after accessing infrastructure.
Continuous health checking at runtime
This shifts the burden of managing complex infrastructure from the client to the cloud provider, so enterprise teams spend less time on DevOps and more time training and serving models. When the cloud provider owns and automates this process in the software layer, they’re able to deliver economies of scale across their entire fleet rather than being limited by the learnings within a single cluster.
With proactive health checking, cloud providers can find and begin solving problems before the client knows they exist, thus improving reliability and uptime as measured by the length of uninterrupted training jobs and the time to restart training jobs after a failure.
Deeper observability in AI clusters
Regardless of who manages the health checks and cluster operations, AI enterprises still need visibility into what’s going on in their clusters. A lack of observability – even when the cloud provider manages the job failures – limits learnings from job failures and ultimately slows innovation.
There are countless reasons why jobs fail. Sophisticated observability tools provide transparency into what happened, allow for infrastructure teams to adapt, and avoid problems in the future. We’re already seeing innovation in this area with NVIDIA BlueField DPUs and more utilization of Baseboard Management Controllers (BMCs) as well as NVIDIA Quantum-2 InfiniBand’s autonomous self-healing capabilities.

How to get there – faster

In other words, infrastructure shouldn't live in a ‘black box.’ However, building it yourself is a long road, one that enterprises today don’t have the luxury of time to travel alone.

A deep technical partnership between clients and cloud providers is essential to incorporate end-to-end fleet lifecycle management and observability tools. Andrej Karpathy, co-founder of OpenAI, said this best:

"LLM training runs are significant stress-tests of an overall fault tolerance of a large computing system. When you're shopping around for your compute, think about a lot more than just FLOPs and cost. Think about the whole service from hardware to software across storage, networking, and compute. Think about whether the team maintaining it looks like the Avengers and whether you could become best friends."

With the right partner, enterprise leaders can spend less time, money, and resources managing AI infrastructure and more time doing what they do best: build, train, and deliver models that change the world.

*This post was written by Peter Salanki, CTO at CoreWeave, specially for Turing Post. We thank CoreWeave for their insights and ongoing support of Turing Post.

Reply

or to participate.