Application development for the Cell processor

Cell Culture

© Dmitry Sunagatov, Fotolia

© Dmitry Sunagatov, Fotolia

Article from Issue 99/2009
Author(s):

The Cell architecPIture is finding its way into a vast range of computer systems – from huge supercomputers to inauspicious Playstation game consoles. We'll show you around the Cell and take a look at a sample Cell application.

Sony Computer Entertainment, Toshiba, and IBM started developing the innovative Cell Broadband Engine Architecture (CBEA) around 2001. The Cell architecture specializes in efficient processing of large data streams, such as the streams that occur in multimedia applications or computer games. The first implementation of the Cell architecture is the Cell Broadband Engine, also known as the Cell processor, which dates back to 2005 (Figure 1). Since it was introduced as the processor for the Sony PlayStation 3, the Cell CPU has attracted much attention. Although the Playstation (Figure 2) is certainly the most widespread application of the Cell architecture, the most spectacular application has to be the Roadrunner (Figure 3), which uses more than 12,000 Cell processors [1].

© IBM

Cell blades are available from both IBM and Mercury Computer Systems. Mercury has even built a PCI Express card with a full-fledged Cell processor computer. Toshiba uses a variant of the Cell processor in its Qosmio notebooks.

In addition to its power and flexibility, the Cell is also known for energy efficiency. Cell-based systems currently hold the top seven spots in the Green 500 List [2] of the most energy-efficient supercomputers. In this article, I explore the Cell architecture and describe an example application that will help you get started with programming for the Cell.

The Cell computer specializes in handling problems that need a large amount of computer power but are easily split into separate tasks. The individual Cell processor cores then process these separate tasks in parallel.

The Cell processor consists of a conventional processor core (Power Processing Element, PPE) with 64-bit IBM Power Architecture and eight Synergistic Processing Elements (SPE; see Figure 4). Each of the eight SPEs has 256KB of local memory and a DMA controller (Memory Flow Controller, MFC). All nine processors are linked by a data bus (Element Interconnect Bus, EIB) to each other, the main memory, and the peripheral devices.

While the operating system on the PPE manages system resources, the SPEs handle algebraic operations. Their 128-bit registers either manipulate four 32-bit numbers per operation (short integers or single-precision floating points), or two 64-bit figures (long integers, or double-precision floating points). This SIMD architecture (Single Instruction, Multiple Data) is similar to the PC processor's MMX extension.

One special thing about the SPEs is that they only work with code and data stored in their local memory; they do not access main memory or peripherals. Applications must ensure that the right code and data are available locally. The data transfer operations between main and local SPE memory are organized by the SPE's DMA controllers and do not cause SPU overhead.

Linux on the Playstation

In contrast to other console manufacturers, Sony officially supports the installation of Linux on the Playstation, and you will find many howtos on the web [6]. There are two things to note about running Linux on the PS 3. First, direct access to the hardware is not supported; to protect its proprietary firmware, Sony added a virtualization layer. Second, only six of the Cell processor's eight SPEs are available to Linux programs.

Developing for the Cell

Of course, developing applications for the Cell processor is more appealing for those who have access to a Cell-based machine. If you work on a Cell blade server, you will probably develop your applications directly on the Cell platform. If you have a Playstation 3, it makes more sense to use a Linux PC as your development platform. The Playstation doesn't have much in the way of RAM – just 256MB – and the low memory becomes fairly obvious when you work with an X11 interface.

IBM provides a free Software Development Kit (SDK) for the Cell architecture [3]. The Cell SDK will run on the x86, x86_64, and PowerPC platforms, as well as on Cell-based Linux machines. The latest version of the Cell SDK (3.1) supports Fedora 9 and Red Hat Enterprise Linux 5.2. The kit includes the Developer and Extras CD images and an RPM package with the installation script. Up to version 3.0, the Cell SDK for Fedora included a system simulator, which would let programmers test and optimize applications without physical Cell hardware. As of Version 3.1, the simulator is available separately from the IBM website [4]. The new Version 3.1 is still in beta, but it works perfectly on Fedora 9.

According to IBM, the minimum hardware requirement is an Intel Pentium 4 with 2GHz clock speed or an AMD Socket F Opteron. On top of this, the SDK needs 1GB RAM and 5GB free disk space. To install the Cell SDK on Linux, you also need the rsync, sed, TCL, and wget packages. Because the installation script downloads various packages from the Barcelona Supercomputer Center [5], you will need continuous Internet access throughout the installation.

The cell-install-3.1.0-0.0.-noarch.-rpm RPM creates an /opt/cell directory for the developer environment and documentation. The installation script expects the path to the CD images as an option:

/opt/cell/cellsdk --iso path install

This variant has the advantage that you can install the content of both images in a single process. If you have the Cell SDK CD images on separate CDs, you need to insert the Developer and Extras CD one after another and launch the installation separately by typing /opt-/cell/cellsdk install. If you have installed the system simulator, you can initialize it using the /opt/cell/cellsdk_sync_simulator script, which installs some required SDK elements. The ISOs contain several libraries that are not open source. For installation documentation, check out /opt/cell/sdk/docs/install.

PI on the Cell

Applications for the Cell processor consist of at least two parts: a program that runs on the Power PC core (PPE program), and at least one program that keeps the SPEs busy (SPE program). To allow the PPE program to control the SPE software, the PPE source code must include the libspe2.h header file from the Libspe2 library. The SPE program contains the actual calculating routines. An SPE program must include the spu_intrinsics.h and spu_mfcio.h header files for SIMD calculation functions and for communication with the PPE and the other SPEs.

The example program described in this article provides an approximation of PI using the Shotgun algorithm (see the box titled "The Mathematical Shotgun"). The program expects command-line parameters for the number of random pairs of figures to generate and the number of SPEs. After the main() routine in pi_libspe_ppe.c has parsed the command line for this information, the program dynamically allocates three memory areas. The first array stores a structure with the parameters that the PPE and SPE exchange for each SPE. The spe_par_t SPE structure type is declared in the pi_libspe.h header (Listing 1). The second array stores a structure with the SPE context for each SPE. This data contains everything the PPE needs to know about a program running on an SPE. The data type for this is declared in libspe2.h.

Listing 1

Header File pi_libspe.h

 

The Mathematical Shotgun

Several methods exist for calculating an approximate value for PI. The Shotgun algorithm involves the computer calculating pairs of random numbers between 0 and 1 (Figure 5). Each pair represents a point in a square with an edge length of 1, where the bottom left corner has the coordinates (0,0) and the top right corner the coordinates (1,1).

Assuming that the dots are spread evenly across the square, the ratio between the number of dots that lie inside a circle of radius 1 and the total number of dots is approximately equal to the ratio between the areas of a quarter circle with a radius of 1 and a square with an edge length of 1, which is exactly PI/4.

Addressing

The start addresses for variables that the PPE and SPE need to exchange later must be integer multiples of 16, or even integer multiples of 128 for best possible data transfer. Programmers can achieve this by using the posix_memalign() function instead of the conventional malloc(). The size of the individual blocks exchanged by the PPE and SPE also must be a multiple of 16. If inexplicable bus errors occur when you test the application, this is often a result of incorrect start address alignment or illegal block sizes in the data blocks transferred. The third array is only used internally by the PPE program and does not have to fulfill any special requirements with regard to start addresses or sizes.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus