The fundamentals of an HPC cluster
Layer 2: Architecture and Tools
The next layer of software adds tools to help reduce cluster problems and make it easier to administer. Using the basic software mentioned in the previous section, you can run parallel applications, but you might run into difficulties as you scale your system, including:
- Running commands on each node (parallel shell)
- Configuring identical nodes (package skew)
- Keeping the same time on each node (NTP)
- Running more than one job (job scheduler/resource manager)
These issues arise as you scale the cluster, but even for a small two-node cluster, they can become problems.
First, you need to be able to run the same command on every node, so you don't have to SSH to each and every node. One solution would be to write a simple shell script that takes the command-line arguments as the "command" and then runs the command on each node using SSH. However, what happens if you only want to run the command on a subset of the nodes? What you really need is something called a parallel shell.
Several parallel shell tools [4] are available, and the most common is pdsh [5], which lets you run the same command across each node. However, simply having a parallel shell doesn't mean the cluster will magically solve all problems, so you have to develop some procedures and processes. More specifically, you can use a parallel shell to overcome the second issue: package skew.
Package skew can cause lots of problems for HPC admins. If you have an application that runs fine one day, but when you try it again the next day it won't run, you have to start looking for reasons why. Perhaps during the 24-hour period, a node that had been down suddenly comes back to life, and you start running applications on it. That node might not have the same packages or the same versions of software as the other nodes. As a result, applications can fail, and they can fail in weird ways. Using a parallel shell, you can check that each node has the package installed and that the versions match.
To help with package skew, I recommend that, after first building the cluster and installing a parallel shell, you start examining key components of the installation. For example, check the following:
- glibc version
- GCC version
- GFortran version
- SSH version
- Kernel version
- IP address
- MPI libraries
- NIC MTU – Use
ifconfig
- BogoMips – Although this number is meaningless, it should be the same across nodes if you are using the same hardware. To check this number, enter:
cat /proc/cpuinfo | grep bogomips
- Nodes have the same amount of memory (if they are identical):
cat /proc/meminfo | grep MemTotal
Many more package versions or system information can be checked, which you can store in a spreadsheet for future reference. The point is that doing this at the very beginning and then developing a process or procedure for periodically checking the information is important. You can quickly find package skew problems as they occur and correct them.
I also recommend keeping a good log so that if a node is down when you install or update packages, you can come back to it when the node is back up. Otherwise, you will start getting package skew in your nodes and subsequent problems.
The third issue to overcome is keeping the same time on each node. The Network Time Protocol synchronizes system clocks. Most distributions install ntp
by default and enable it, but be sure you check for it in each node in the cluster – and check the version of ntpd
as well.
Use chkconfig
, if the distribution has this package, to check that ntp
is running. Otherwise, you will have to look at the processes running on the nodes to see whether ntpd
is listed (hint – use your parallel shell). Configuring NTP can be a little tricky, because you have to pay attention to the architecture of the cluster.
On the master node, make sure that the NTP configuration file points to external servers (outside the cluster) and that the master node can resolve these URLs (try using either ping
to ping each server or nslookup
). Also be sure the ntpd
daemon is running.
For nodes that are on a private network that doesn't have access to the Internet, you should configure NTP to use the master node as the timekeeper. This can be done by editing /etc/ntp.conf
and changing the NTP servers to point to the master node's IP address. Roughly, it should look something like Listing 1. The IP address of the master node is 10.1.0.250. Be sure to check that the compute nodes can ping this address. Also be sure that ntp
starts when the nodes are booted.
Listing 1
/etc/ntp.conf
[root@test1 etc]# more ntp.conf # For more information about this file, see the man pages # ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5). #driftfile /var/lib/ntp/drift restrict default ignore restrict 127.0.0.1 server 10.1.0.250 restrict 10.1.0.250 nomodify
The last issue to address is the job scheduler (also called a resource manager). This is a key element of HPC and can be used even for small clusters. Roughly speaking, a job scheduler will run jobs (applications) on your behalf when the resources are available on the cluster, so you don't have to sit around and wait for the cluster to be free before you run applications. Rather, you can write a few lines of script and submit it to the job scheduler. When the resources are available, it will run your job on your behalf. (Resource managers allow HPC researchers to actually get some sleep.)
In the script, you specify the resources you need, such as the number of nodes or number of cores, and you give the job scheduler the command that runs your application, such as:
mpirun -np 4 <executable>
Among the resource managers available, many are open source, and they usually aren't too difficult to install and configure; however, be sure you read the installation guide closely. Examples of resource managers include:
With these issues addressed, you now have a pretty reasonable cluster with some administrative tools. Although it's not perfect, it's most definitely workable. However, you can go to another level of tools, which I refer to as the third layer, to really make your HPC cluster sing.
Layer 3: Deep Administration
The third level of tools gets you deeper into HPC administration and begins to gather more information about the cluster, so you can find problems before they happen. The tools I will discuss briefly are:
- Cluster management tools
- Monitoring tools (how are the nodes doing)
- Environment Modules
- Multiple networks
A cluster management tool is really a toolkit to automate the configuration, launching, and management of compute nodes from the master node (or a node designated a master). In some cases, the toolkit will even install the master node for you. A number of open source cluster management tools are available, including:
Some very nice commercial tools exist as well.
The tools vary in their approach, but they typically allow you to create compute nodes that are part of the cluster. This can be done via images, in which a complete image is pushed to the compute node, or via packages, in which specific packages are installed on the compute nodes. How this is accomplished varies from tool to tool, so be sure you read about them before installing them.
The coolest thing about these tools is that they remove the drudgery of installing and managing compute nodes. Even with four-node clusters, you don't have to log in to each node and fiddle with it. The ability to run a single command and reinstall identical compute nodes can eliminate so many problems when managing your cluster.
Many of the cluster management tools also include tools for monitoring the cluster. For example, being able to tell which compute nodes are up or down or which compute nodes are using a great deal of CPU (and which aren't) is important information for HPC administrators.
Monitoring the various aspects of your nodes, including gathering statistics on the utilization of your cluster can be used when it's time to ask the funding authorities for additional hardware, whether it be the household CFO, a university, or an agency such as the National Science Foundation. Regardless of who it is, they will want to see statistics about how heavily the cluster is being used.
Several monitoring tools are appropriate for HPC clusters, but a universal tool is Ganglia [15]. Some of the cluster tools come preconfigured with Ganglia, and some don't, requiring an installation [16]. By default, Ganglia comes with some predefined metrics, but the tool is very flexible and allows you to write simple code to attain specific metrics from your nodes.
Up to this point, you have the same development tools, the same compilers, the same MPI libraries, and the same application libraries installed on all of your nodes. However, what if you want to install and use a different MPI library? Or what if you want to try a different version of a particular library?
At this moment, you would have to stop all jobs on the cluster, install the libraries or tools you want, make sure they are in the default path, and then start the jobs again. This process sounds like an accident waiting to happen. The preventive is called environment modules.
Originally, environment modules [17] were developed to address the problem of having applications that need different libraries or compilers by allowing you to modify your user environment dynamically with module files. You can load a module file that specifies a specific MPI library or makes a specific compiler version the default.
After you build your application using these tools and libraries, if you run an application that uses a different set of tools, you can "unload" the first module file and load a new module file that specifies a new set of tools. It's all very easy to do with a job script and is extraordinarily helpful on multiuser systems.
Lmod [18] is a somewhat new version of environment modules that addresses the need for module hierarchies (in essence, module dependencies) so that a single module "load" command can load a whole series of modules. Lmod currently is under very active development.
Up to now, I have assumed that all traffic in the cluster, including administration, storage, and computation, use the same network. For improved computational performance or improved storage performance, though, you might want to contemplate separating the traffic into specific networks.
For example, you might consider a separate network just for administration and storage traffic, so that each node has two private networks: one for computation and one for administration and storage. In this case, the master node might have three network interfaces.
Separating the traffic is pretty easy by giving each network interface (NIC) in the node an IP address with a different address range. For example, eth0 might be on a 10.0.1.x network, and eth1 on 10.0.2.x network. Although, theoretically, you could give all interfaces an address in the same IP range, different IP ranges just make administration easier. Now when you run MPI applications, you use addresses in 10.0.1.x. For NFS and any administration traffic, you would use addresses in 10.0.2.x. In this way, you isolate computational traffic from all other traffic.
The upside to isolating traffic is additional bandwidth in the networks. The downside is twice as many ports, twice as many cables, and a little more cost. However, if the cost and complexity isn't great, using two networks while you are learning cluster administration, or even writing applications, is recommended.
Summary
Stepping back to review the basics is a valuable exercise. In this article, I wanted to illustrate how someone could get started creating their own HPC system. If you have any comments, post to the Beowulf mailing list [19]. I'll be there, as will a number of other people who can help.
Infos
- Fat tree topology: https://en.wikipedia.org/wiki/Fat_tree
- Open MPI: http://www.open-mpi.org/
- MPICH: https://www.mpich.org/
- Parallel shell tools: http://www.linuxpromagazine.com/Issues/2014/166/Parallel-Shells
- pdsh: https://code.google.com/p/pdsh/
- OpenLava: http://www.openlava.org/
- Slurm: http://slurm.schedmd.com/
- Torque: http://www.adaptivecomputing.com/products/open-source/torque/
- SGE: https://arc.liv.ac.uk/trac/SGE
- OGE: http://gridscheduler.sourceforge.net
- Warewulf: http://warewulf.lbl.gov/trac
- xCAT: http://sourceforge.net/projects/xcat/
- Oscar: http://svn.oscar.openclustergroup.org/trac/oscar
- oneSIS: http://onesis.org/
- Ganglia: http://ganglia.sourceforge.net/
- Installing Ganglia: http://www.admin-magazine.com/HPC/Articles/Monitoring-HPC-Systems
- Environment modules: http://modules.sourceforge.net/
- Lmod: http://www.admin-magazine.com/HPC/Articles/Lmod-Alternative-Environment-Modules
- Beowulf mailing list: http://www.beowulf.org/mailman/listinfo/beowulf
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.