HPC in the cloud
Cloud Play

Cloud computing is most definitely here, but does it have a role in HPC? We discuss changes in HPC that could be solved effectively by cloud computing.
I'm not a big Bob Dylan fan, but when it comes to HPC, "The Times They Are a-Changin'." Some of this change is going to the cloud, not because it would be really cool, but rather because HPC has existing workloads that fit well into cloud computing and can save money over traditional solutions. Perhaps more importantly, I think HPC has evolved to include non-traditional workloads and is adapting to meet those workloads – in many cases using cloud computing to do so. I will explain by giving two examples.
1. Massively Concurrent Runs
At one HPC center that I'm familiar with, users periodically submit 25,000 to 30,000 jobs as part of a parameter sweep; that is, they run the same application with 25,000 to 30,000 different data sets. Many times the application is a Matlab script, a Python script, an R script, a Perl script, or something particularly serial (i.e., it runs well on a single core). The same script is run but with thousands of different input files, resulting in the need to run thousands of jobs at the same time. Many times these applications run fairly quickly – perhaps in a couple of minutes – and many times they do not produce a great deal of data.
A closely related set of researchers study operating systems and security, during which they run different simulations with different inputs. For example, they might run 20,000 instances of an OS, primarily a kernel, and explore exploits against that OS. As with the previous set of researchers, the goal is to run a huge number of simulations as quickly as possible to find new ideas about how to protect an OS and kernel. The run times are not very long, but they must run the tests against a single OS. Consequently, they run thousands of jobs at the same time and then look through the results and continue with their research.
What is important to both sets of researchers is to have all of the jobs run at nearly the same time so they can examine the results and either focus on a small subset of the data and run more granular input data sets or try yet more data sets (broaden the search space). Additionally, these users want to broaden the search space so they can get either more detail or examine more options. The result is the need for more cores. This same HPC center has users asking for 50,000 and 100,000 cores to run their applications. The coin of the realm for these researchers is core count and not per-core performance.
Another interesting aspect of these researchers is that they don't run these massive job sets all of the time. They create the input data sets and create the job arrays and then run the job array. Once the jobs are done, however, it takes time to process the output to understand it and to determine the next step. What is important to these researchers is to have all of the results before doing this post-processing. If this doesn't happen, the researcher has to wait days for the jobs to finish before post-processing can take place.
Getting more efficiency from the hardware is not the problem because faster hardware will only improve the research time a little bit. Reducing the run time from 120 seconds to 100 seconds wouldn't really improve their research productivity. What improves their productivity is to have all of the jobs run at the same time.
I originally thought this scenario was confined to my experience with a particular HPC center, but I was wrong. I've spoken to several people, and they all have similar workload characteristics with varying sizes (several hundred to 50,000 cores). Although this might not describe your particular workload, a number of centers fit this scenario, and this number is growing rapidly.
2. Web Services
Another popular scenario in HPC centers that I've seen is the increasing need for hosting servers for classes or training, for websites (internal and external), and for other general research-related computing in which the applications are not parallel or might not even be "scientific." I heard one person refer to this as "Ash and Trash computing," probably because it refers to running non-traditional HPC workloads; however, it's becoming fairly common.
Consider an HPC center with training courses or classes that need access to a number of systems. A simple example is a class in parallel computing with 30 students. The students might need many cores per person for their course, and they wouldn't be pushing the performance of the systems; however the data center will need a number of systems for the class. If they need 20 cores per student, that's 600 cores just for a single course.
The need for dedicated web servers for research is also increasing. The websites they host go beyond the classic personal websites. Researchers want, and need, to put their research on a website that allows them to share results, interact with other researchers, and show their research. An increasing number of web-based research tools are available, such as nanoHUB [1] and Galaxy [2]. I know of one HPC center that has close to 20 Galaxy servers, each tuned to a specific research project. HPC centers are discovering that it makes much more sense to handle these non-traditional workloads themselves. The reasons are varied, but in general, HPC centers understand research better than the departments that worry about mail servers, databases, and ERP applications. These enterprise computing functions are critical to the overall center, but research and HPC require a different kind of service. Moreover, HPC centers can react much more rapidly to requests than the enterprise IT department.
Time for a Change
HPC is being asked to adapt to new roles when it comes to the needs of researchers. These needs include applications that require a tremendous number of cores but not a great deal of performance, as well as applications such as web servers, classroom and training support, and web-based applications and tools that are not traditional HPC applications. These workloads fit into the HPC world much better than they fit into the enterprise world.
These changes are everywhere. They may not be a large force, and they might not be as pervasive in your particular HPC center, but they are happening and they are growing more rapidly than traditional workloads. Consequently, I've started to refer to this new generation of computing as Research Computing. If you like, research computing is a superset of traditional HPC, or traditional HPC is a subset of research computing. I also like to think of research computing as adding components, techniques, and technology for solving problems that traditional HPC cannot or might not solve. One of these technologies is cloud computing.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
News
-
Arch Linux 2023.12.01 Released with a Much-Improved Installer
If you've ever wanted to install Arch Linux, now is your time. With the latest release, the archinstall script vastly simplifies the process.
-
Zorin OS 17 Beta Available for Testing
The upcoming version of Zorin OS includes plenty of improvements to take your PC to a whole new level of user-friendliness.
-
Red Hat Migrates RHEL from Xorg to Wayland
If you've been wondering when Xorg will finally be a thing of the past, wonder no more, as Red Hat has made it clear.
-
PipeWire 1.0 Officially Released
PipeWire was created to take the place of the oft-troubled PulseAudio and has finally reached the 1.0 status as a major update with plenty of improvements and the usual bug fixes.
-
Rocky Linux 9.3 Available for Download
The latest version of the RHEL alternative is now available and brings back cloud and container images for ppc64le along with plenty of new features and fixes.
-
Ubuntu Budgie Shifts How to Tackle Wayland
Ubuntu Budgie has yet to make the switch to Wayland but with a change in approaches, they're finally on track to making it happen.
-
TUXEDO's New Ultraportable Linux Workstation Released
The TUXEDO Pulse 14 blends portability with power, thanks to the AMD Ryzen 7 7840HS CPU.
-
AlmaLinux Will No Longer Be "Just Another RHEL Clone"
With the release of AlmaLinux 9.3, the distribution will be built entirely from upstream sources.
-
elementary OS 8 Has a Big Surprise in Store
When elementary OS 8 finally arrives, it will not only be based on Ubuntu 24.04 but it will also default to Wayland for better performance and security.
-
OpenELA Releases Enterprise Linux Source Code
With Red Hat restricting the source for RHEL, it was only a matter of time before those who depended on that source struck out on their own.