Virga HPC cluster designed for AI workloads

- Advertisement -

Australia’s Commonwealth Scientific and Industrial Research Organisation (CSIRO) has embarked on an ambitious project to build a high-performance computing (HPC) cluster designed for artificial intelligence (AI) workloads. This innovative cluster, named Virga, is powered by Dell XE9640 AI rackservers and is set to revolutionize various research domains, including the diagnosis and treatment of cystic fibrosis.

The Virga cluster, housed at the CDC Hume Data Center in Canberra, marks the first deployment of its kind in Australia. The cluster comprises 14 racks and employs a BeeGFS flash backing store. Its name, “Virga,” is inspired by a meteorological phenomenon where rain evaporates before reaching the ground, symbolizing CSIRO’s extensive research in cloud and rain physics.

Direct liquid cooling technology is a standout feature of the Dell XE9640 servers used in the Virga cluster. This cooling method is significantly more power-efficient than traditional air-cooling systems, which typically consume more electricity. This efficiency aligns with CSIRO’s commitment to sustainability in its technological advancements.

Professor Elanor Huntington of CSIRO emphasized the broad application of AI across the organization’s research endeavors. “AI is integral to numerous fields of research at CSIRO, from developing advanced flexible printed solar panels to predicting wildfires, measuring wheat yields, and creating vaccines. High-performance computing systems like Virga are also vital to our robotics and sensing initiatives and are crucial to the National Robotics Strategy aimed at enhancing the competitiveness and productivity of Australian industry,” she stated.

The initiative to establish the Virga HPC system began with a $14.5 million tender in November 2022, intended to replace CSIRO’s existing Bracewell cluster, which also utilized Dell server nodes. Dell secured the contract in July of the previous year with a bid of $16.3 million.

Angela Fox, Senior Vice President and Managing Director for Dell Technologies Australia and New Zealand, highlighted the transformative potential of the Virga cluster. “With Dell PowerEdge servers at its core, Virga will facilitate groundbreaking Australian scientific research through its AI capabilities while being more sustainable and energy-efficient than previous generation clusters,” she remarked.

The PowerEdge XE9640 servers feature a 2RU chassis, each equipped with two Gen 4 Xeon Platinum 8452Y processors, boasting 36 cores each. The servers also include either four Intel Data Center GPU Max accelerators or four Nvidia H100 GPUs, with CSIRO opting for 448 of the latter. Nvidia’s Infiniband NDR serves as the high-speed interconnect, complemented by up to 500 GB of DRAM and four 61.44 TB NVMe SSDs per node, along with 96 GB of high-bandwidth memory per GPU. This configuration results in approximately 246 TB of flash storage per node. Although the specific SSD provider was not disclosed, Solidigm was the sole public supplier of 61.44 TB SSDs at the time, offering the QLC (4bits/cell) D5-P5336 model.

A Dell spokesperson noted the enduring partnership between Dell and CSIRO in developing robust storage solutions. “The Virga Supercomputer will integrate with the existing Dell HPC BeeGFS storage, which Dell commissioned in 2019. This collaboration in 2018 was founded on a mutual understanding of the extensive data volume, velocity, and variety that scientists would harness to advance scientific discovery, and the longevity of this solution continues to be evident,” the spokesperson explained.

The Virga cluster leverages Nvidia’s Transformer Engine library to enhance AI performance and capabilities, enabling the training of large models within days or even hours, according to CSIRO. The system boasts a total of 60,000 cores and ranks 72nd on the Top500 list, with a maximal achieved performance (Rmax) of 14.94 PFLOPS and a theoretical peak performance (Rpeak) of 18.46 PFLOPS. While the exact node count remains undisclosed, the 14-rack structure can accommodate up to 280 x 2RU slots, some of which are allocated for ancillary equipment. With 448 H100 GPUs in total, the implication is 112 nodes, each housing four H100 GPUs. Each H100 GPU features 14,592 FP32 CUDA Cores and 576 Tensor Cores.

Dr. Jason Dowling of CSIRO’s Australian e-Health Research Centre highlighted the transformative impact of the new HPC facilities on medical research. “The advanced HPC facilities will empower researchers at our Australian e-Health Research Centre to train and validate new computational models, aiding the development of translational software for medical image analysis, including image classification, segmentation, reconstruction, registration, synthesis, and automated radiology reporting,” he said.

One notable project benefiting from the Virga cluster is a collaboration with Queensland Children’s Hospital, focusing on training AI models to diagnose pathology from MRI lung scans in children with cystic fibrosis. This project exemplifies the potential of the Virga cluster to drive significant advancements in medical research and treatment.

Hot this week

India Eyes Global Copper Mines in Massive Push to Secure Metal Future

(Commonwealth_India) India has introduced a strategic initiative to attract...

Asia-Pacific Space Race Heats Up—And AWS Just Handed Startups the Fuel

Amazon Web Services (AWS) has launched a new program...

Britain’s Bold Climate Pivot: Can Green Finance Save Lives—and the Economy?

(Commonwealth_Europe) The UK government has announced a major new...

TikTok Bows to Canada’s Crackdown — But at What Cost to Creators and Cultural Institutions?

Commonwealth_ TikTok announced it was pulling sponsorship of some...

20+ Bullets Fired at Hindu Temple in Utah — Community Fears Rise as Hate Crime Suspected!

Global (Commonwealth Union) _ The Sri Sri Radha Krishna...
- Advertisement -

Related Articles

- Advertisement -sitaramatravels.comsitaramatravels.com

Popular Categories

Commonwealth Union
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.