HPC in the cloud

Cloud Computing for Research Computing

I freely admit that I scoffed at the use of cloud computing to solve traditional HPC problems when it started coming into vogue. The idea of taking perfectly good hardware and layering virtualization on top of it with a new set of tools and APIs – in a data center that I didn't control or have access to – seemed abhorrent. As I saw HPC morphing into research computing, though, I began to realize that cloud computing is a tool for solving problems that I could not easily solve before.

The first example – in which the users need a massive number of cores and need them to run at the same time – can be solved by classic HPC systems with a large number of cores, a reasonably fast network for data traffic (10GigE), and the associated clustering software. Job arrays can be written to start and schedule 25,000 jobs. With a general-purpose cluster, jobs that have to run at the same time will likely have to wait for a large number of nodes to finish their jobs before enough cores are free to launch the queued jobs. For potentially long periods of time, the nodes will be idle, wasting CPU time waiting for cores to become available. Couple this with the short period of time users need the cores, and you have even more wasted CPU cycles. Many HPC centers have struggled with this inefficient scenario.

Perhaps a more efficient way of providing resources for these researchers and their workload is to take a standard server, virtualize it, and oversubscribe the server, providing more virtual machines (VMs) than the server has physical cores. For example, a four-socket server that has 16 cores per socket has a total of 64 cores per server, and you can run perhaps 128 or 256 VMs on the system as long as each VM has enough memory to run the application. Remember that for this type of workload, performance is not the most important metric. If, by virtualizing the server, you lose 10% to 30% in performance, you are not really negatively affecting the research. In fact, you might be enhancing the research because you can easily provide enough resources in a short period of time without wasting CPU cycles waiting for the physical cores to be available. Moreover, virtualizing the server allows a much smaller number of physical servers to be purchased to meet the needs of these workloads.

Another possibility is to push these workloads into the cloud. Because performance is not the most important concern, cloud computing resources could be very appropriate. The researchers don't need a large number of cores all of the time in this scenario, so buying dedicated hardware, even if it is virtualized, might not be the most efficient use of resources. Also, don't forget that many of these workloads don't do a great deal of I/O, so data movement to and from the cloud could have very little effect on performance. Running these applications in the cloud (e.g., Amazon or Google) might be a much more cost effective approach than providing local resources, even if they are virtualized.

Recently, Cycle Computing [3] announced that they started up 10,600 VM instances [4] inside Amazon EC2. It took two hours to configure the instances and nine hours to run (a total of 11 hours) and cost US$ 4,362. This is $0.4115 per instance for 11 hours, or $0.037 per instance per hour.

For the researcher that needs to run 25,000 instances at the same time, on the basis of Cycle's experience, I'll assume it takes two hours to start up these instances. I'll also assume it takes 15 minutes to run the application if all jobs start at the same time (0.25 hours). The total time is 2.25 hours for 25,000 instances. At a price of $0.037 per instance per hour, the resulting total cost is US$ 2,081.25. Now, assume that the researchers do this three times a week for an entire year (a total of 156 runs). The total for the year is then US$ 324,675. At first blush, this seems like enough money to buy your own on-premises system using oversubscribed virtualized machines. Or is it?

For comparison, assume the building block is a four-socket AMD node with 16 cores per socket (64 physical cores). Also assume that you oversubscribe the physical cores 3:1, producing 192 VMs per physical server. Furthermore, assume that each VM needs at least 2GB of memory, resulting in about 512GB of memory per node.

Using Dell's handy online configuration tool [5], I configured a 2U server that meets the specifications and has a price of about US$ 14,500. The power usage [6] for such a node under load is about 992W (almost 1kW), and the idle load is 434W. Assuming power is $0.14/kW, the power cost for a single system is about US$ 535 (8,721 hours at idle, 39 hours at peak load). Therefore, the cost to buy and operate a single node over one year is roughly US$ 15,000. Using the yearly cost from the Cycle Computing example, you can afford to buy roughly 21 systems. Using 192 VMs per server, you only end up with 4,032 VMs, whereas with Cycle Computing, you get 25,000 instances, even including the two hours to configure all of them when they are needed.

To match the number of VMs needed (25,000), you need about 131 servers. The purchase cost for these is US$ 1,899,500. The yearly power bill is US$ 65,500. Over one year, this works out to a total of US$ 1,965,000. Over three years, the total is US$ 2,096,000. On the other hand, using cloud computing via Cycle Computing, the price for one year is US$ 324,675; over the three years, the price is about US$ 974,025. Cloud computing works out to half the cost of a dedicated system for these workloads.

This is a very simplified analysis because you could argue that the idle systems could run someone else's jobs, but the point of the comparison is to determine whether it is better to buy dedicated systems to run 25,000 jobs at the same time, 156 times a year, using virtualized systems or to use the cloud. I think this rudimentary comparison still shows that this particular workload is more efficient in the cloud than using on-premise resources, even with oversubscribed virtual machines.

Summary

Although the title of this article is about HPC in the cloud, it's really about two things: the evolution of HPC into research computing and how cloud computing can be used to solve research computing problems. At first, it was fairly easy to dismiss cloud computing for traditional HPC workloads. The "HP," after all, stands for "high performance," and doing anything to reduce performance is counterproductive. You are paying more and getting less. However, new workloads are being added to HPC all the time that might be very different from the classic MPI applications in HPC and have different characteristics. The amount of computation in these new workloads is increasing at an alarming rate – so much so, that I think HPC is giving way to RC (research computing).

In this article, I gave two examples of new workloads that are helping to morph HPC into RC. The first example is an application class that needs to run on thousands of cores serially, doesn't run very long, and doesn't go a great deal of I/O, but all instances of the application need to run at the same time. The applications are varied, but they share these common aspects, particularly the need to run all the applications at about the same time.

Until a few years ago, I didn't hear too much about these applications, but in recent years, they've become more and more common at HPC centers. Improving the per-core performance will not help overall productivity because the applications run so quickly. What really improves productivity is running all instances of the application at the same time. This makes the researcher much more productive than having just a few applications run at a time.

In the second example, applications run on the web are being used for data post-processing, as well as data creation. These applications need web servers on which to share and investigate data and research results. In the past, these applications had to be run on IT department web servers, although they are really RC applications, and the IT departments don't really know how to handle these requests because their mission is a bit different. Consequently, these applications are increasingly run by the research computing team.

The second theme of this article is that many of the workloads in RC can be tackled by cloud computing that is not necessarily on-premises (rather, in the public cloud). The characteristics of some of the workloads are such that putting them in the cloud can save money relative to running them on traditional HPC hardware, and in many cases, it can save time because you can spin up very quickly a large set of resources larger than anything you might have in the HPC center. Moreover, moving these workloads to the cloud can also make your traditional HPC systems more efficient because you do not have large applications blocking the queues.

I consider cloud computing a tool or technique for solving research computing problems. Nothing more or less. It's not a panacea, nor should it be ignored. Issues that must be addressed include data movement and security, but it also can save you money and make your traditional HPC resources stretch further. If you examine your workloads and their characteristics carefully, I think you will be surprised how many can be run easily in the cloud.

Infos

nanoHUB: https://nanohub.org/
Galaxy: http://galaxyproject.org/
Cycle Computing: http://www.cyclecomputing.com/
Cycle Computing spins up 10,600 instances in Amazon's cloud: http://www.networkworld.com/news/2013/020713-cycle-computing-266512.html
Dell configuration tool: http://configure.us.dell.com/dellstore/config.aspx?oc=bemtx5b&model_id=poweredge-r815&c=us&l=en&s=bsd&cs=04
Dell Energy Smart Solution Advisor: http://essa.us.dell.com/DellStarOnline/DCCP.aspx?c=us&l=en&s=corp&Template=6945c07e-3be7-47aa-b318-18f9052df893

« Previous 1 2

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
New Pentesting Distribution to Compete with Kali Linux

Linux , open source , Tools , Ubuntu

SnoopGod is now available for your testing needs

HPC in the cloud

Cloud Computing for Research Computing

Summary

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Linux Servers Targeted by Akira Ransomware

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

XZ Gets the All-Clear

Canonical Collaborates with Qualcomm on New Venture

Kodi 21.0 Open-Source Entertainment Hub Released

Linux Usage Increases in Two Key Areas

Vulnerability Discovered in xz Libraries

Canonical Bumps LTS Support to 12 years

Fedora 40 Beta Released Soon

New Pentesting Distribution to Compete with Kali Linux

HPC in the cloud

Cloud Computing for Research Computing

Summary

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters