Enroll Course

100% Online Study
Web & Video Lectures
Earn Diploma Certificate
Access to Job Openings
Access to CV Builder



online courses

Meta Unveils Details of Two New 24K GPU AI Clusters

business . 

Meta has unveiled the specifications of its two new data center-scale clusters, each equipped with 24,000 GPUs, designed for training its Llama 3 large language AI model. These clusters are built upon Meta’s AI Research SuperCluster (RSC), introduced in 2022. The disclosed details encompass hardware, network infrastructure, storage, design elements, performance metrics, and software components that collectively contribute to the computational power driving Meta’s AI advancements.

The newly revealed data center-scale clusters from Meta are specifically designed to advance AI research and development in fields like natural language processing, speech recognition, and image generation. Each cluster is equipped with 24,576 Nvidia Tensor Core H100 GPUs, representing a notable increase compared to the original clusters, which featured 16,000 Nvidia A100 GPUs. This enhancement in GPU capacity underscores Meta’s commitment to pushing the boundaries of AI capabilities in various domains.

The expanded capacity of Meta’s new data center-scale clusters, featuring 24,576 Nvidia Tensor Core H100 GPUs, enables the support of larger and more complex models compared to the previous AI Research SuperCluster (RSC). This enhancement is crucial for Meta’s pursuit of breakthroughs in generative AI product development, showcasing their commitment to pushing the limits of model size and complexity for innovative applications.

Meta’s ambitious plan to expand its infrastructure by incorporating 350,000 Nvidia H100s by the end of 2024 highlights the company’s commitment to significantly augmenting its compute capabilities. The goal of achieving a portfolio with the compute power equivalent to nearly 600,000 H100s underscores Meta’s strategic focus on scaling its AI capabilities for various applications and advancing the boundaries of computational performance.

Meta’s deployment of two clusters with distinct network infrastructures showcases the company’s strategic approach to optimizing its AI research and development capabilities. The utilization of a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution, alongside the inclusion of Nvidia Quantum2 InfiniBand fabric in the other cluster, reflects Meta’s commitment to exploring diverse technologies for enhanced connectivity and performance. This nuanced network design aligns with Meta’s goal of advancing its AI capabilities through tailored and innovative infrastructure solutions.

Meta’s deployment of clusters based on its in-house open GPU hardware platform, Grand Teton, exemplifies the company’s commitment to developing specialized infrastructure for handling large AI workloads. The enhanced capabilities of Grand Teton, with 4x the host-to-GPU bandwidth, 2x the compute and data network bandwidth, and a 2x increase in power envelope compared to its predecessor Zion-EX, reflect Meta’s dedication to continuous innovation in hardware design. This strategic investment in cutting-edge GPU hardware aligns with Meta’s pursuit of advancing its AI research and development capabilities.

Meta’s utilization of Open Rack power and rack architecture underscores the company’s focus on flexible and efficient data center solutions. The adoption of Open Rack v3 hardware, allowing power shelves to be installed anywhere in the rack, exemplifies Meta’s commitment to adaptability in the data center environment.

Customizing the number of servers per rack for optimal throughput capacity, rack count reduction, and power efficiency reflects a strategic approach to infrastructure design. The collaboration with Hammerspace to develop a parallel network file system (NFS) and enhancements to Meta’s PyTorch AI framework demonstrate the company’s dedication to advancing both hardware and software components in its infrastructure for large-scale GPU training.

 

Related Courses and Certification

Full List Of IT Professional Courses & Technical Certification Courses Online
Also Online IT Certification Courses & Online Technical Certificate Programs