This paper compares the performance of workstation clusters from DEC (Alpha Farm), HP, and IBM (SP2) for scientific computing on a selected collection of test suites. These test suites have been designed to evaluate both serial and parallel performance.
It is possible to enhance the power of workstations for scientific
computations by interconnecting them via a high speed communication network so
that they can be used to not only execute serial but also parallel programs.
Computers that use this mode of operation include
the IBM SP2,
the DEC Alpha Farm and clusters of HP workstations.
This study compares the performance of these computers
on a collection of test suites designed to evaluate serial
and parallel performance for scientific computing.
Parallelism is expressed in the parallel test suites by using
PVM (Parallel Virtual Machine) from
Oak Ridge National Laboratory .
IBM also provides an optimized version of PVM, called PVMe.
Performance results for both PVM and PVMe are reported for the IBM SP2.
Often a significant portion of the total execution time of large scientific applications is due to extensive I/O to/from temporary storage. An advantage of the workstation clusters considered in this report is that each node has its own local disc for fast storage of temporary data. The performance of I/O to the local disc is measured for reads and writes of various sized files.
Whenever feasible, test suites are designed to evaluate performance for small, medium and large problems so the dependence of performance on problem size can be seen. All performance results are obtained by using 64-bit real arithmetic. The Fortran and C compilers were set for high optimization, see Table 2. To ensure accurate timings, short tests are looped a sufficient number of times to obtain a time of at least one second. Wall-clock timers were used to measure elapsed time for all parallel and I/O test suites. Cpu timers were used for measuring single node performance. No effort is made to hand optimize any of the codes. For a given vendor, the same compiler option(s) were used for all tests.
A few of the factors which influence performance are:
The HP workstations used for this study are
interconnected via a FDDI ring.
A multistage communication
network is used to interconnect the SP2 nodes.
The IBM SP2 can be configured with thin and/or wide nodes where both nodes are
based on the 66.5 MHz RS6000 microprocessor;
only wide nodes are used for this study.
The DEC workstations are interconnected via a
FDDI crossbar switch.
Single node performance is compared by measuring the performance of a
collection of scientific kernels and application codes mostly selected from
For the vendor coded serial matrix multiply, the performance was nearly constant for the three problem sizes. However, HP was 40%slower than an SP2 node and a DEC node was about 30%slower, see Table 3.
Tables 4 and 5
illustrate the performance of
matrix multiplication for non-unit stride memory accesses.
Notice that the SP2 node outperformed the other vendors for
problem sizes 50 and 300; however, for problem size 1000, the
performance degraded sharply and the SP2 did not perform as well as
the other vendors.
Tables 4 and 5
also show that the performance of these Fortran variants is
significantly less than the vendor optimized routine, see Table 3.
Clearly, these compilers are not generating code that can efficiently
utilize the underlying hardware.
The SP2 node performed best on the meteorology code, the HP node performed
best on the astrophysics code, and the DEC node performed best on
MDG, FLO52Q and QCD codes, see Table 7.
The I/O tests for this study initialize a specified amount of data, write it to a local disk file and then read it back. Table 8 shows I/O transfer rates in MB/sec for files of sizes 5, 25, 100, and 200 MB. Initialization time is not included. The I/O performance results are mixed. For file sizes 5 and 25 MB, the SP2 node performed well on the READs and not well on the WRITEs when compared with the other vendors. For the 200MB file, the SP2 node's I/O performance degraded significantly for both the READ and WRITE operations.
Notice the large transfer rates
for the READ operation for file sizes up to 100MB on
the SP2 node. This is probably due to buffering being done during the
WRITE process thereby eliminating the need for a
READ from the local disk.
Evaluating the performance of the inter-node communication network is an important part of evaluating the performance of a parallel computer. There are so many different ways communication can occur among nodes that it is not feasible to measure the performance of all of them. Tests are designed to evaluate the performance of some of the communication patterns, under heavy and light loads, that we feel are likely to occur during the execution of parallel scientific application codes. Thus, communication performance is measured for each of the following scenarios for 2, 4 and 8 nodes (except item 1 which applies to only two nodes) using PVM with messages of size 8 bytes, 1 KB, 100 KB, and 10 MB. Performance result for both PVM and PVMe are reported. Test routines are written with one node designated as the PVM master and all other nodes designated as PVM slaves. Communications tests are divided into two categories: (1) node-to-node communication, and (2) concurrent communication.
Ideally, communication performance results for tests 1.a-1.c would be the same for a given machine. However, Table 20 shows that this is not the case. Communication rates for tests 1.a are not available for HP since the call to pvmfsetopt was accidentally commented out for this one test. This problem was not discovered until after the HP cluster was no longer available for dedicated usage. Notice that the communication rate drops when going from the 100KB message to 10MB for each vendor and for each of the three tests (except for the SP2 PVM results for test 1.b). This drop probably is a result of network saturation. In all cases, SP2 nodes with PVMe significantly outperforms the others.
To better evaluate the performance of the broadcast operation, we define a Normalized Broadcast Rate as where the total data rate is measured in KB/sec and N is the total number of nodes involved in the communication. Let be the data rate, in KB/second, when a message is sent from node 1 to node 2. Let be the data rate for broadcasting the same message from node 1 to the other N-1 nodes. If the broadcast operation and communication network were able to concurrently transmit messages to all other nodes, then . In this case, the Normalized Broadcast Rate would remain constant as N increases and hence the rate at which the Normalized Broadcast Rate decreases as N increases indicates how far from optimum the broadcast operation is actually performing.
Tables 9 thru 12 summarize the Normalized Broadcast Rate performance results (items 2.a - 2.d above) using 2, 4 and 8 nodes. Tests 2.a and 2.c are designed to determine if broadcasting from the master node gives different performance as compared to broadcasting from a slave node. Tables 9 and 11 show that this is not the case. This is also true for tests 2.b and 2.d, see Tables 10 and 12. For each vendor, the normalized communication rate drops as the number of nodes increases. DEC outperforms HP for most of the 2.a-2.d tests. However, in all these cases, the concurrent communication rate is significantly higher for the SP2 with PVMe.
As above, let N be the number of nodes numbered from 1 to N. With this numbering, tests 2.e - 2.g are designed to measure the performance of communication between neighboring nodes where nodes 1 and N are considered neighbors. Test 2.g, a variation of test 2.e, is chosen to determine the impact of node ordering on performance. Also observe that the data rate for these tests will increase proportionally with the number of nodes being utilized since communication can be done in parallel. Thus, in a manner similar to the Normalized Broadcast Rate, for these tests we define a Normalized Data Rate to be where the data rate is measured in KB/sec. In an ideal communication network, the Normalized Data Rate should be constant as N increases and hence the degree with which the rate is not constant indicates how far from ideal the given communication network is actually performing. Tables 13 thru 15 show that the Normalized Data Rate for the SP2-PVMe remains nearly constant as the number of processors increases, whereas this is not the case for others.
Ideally, the normalized data rates for tests 2.e and 2.g should be
the same for a given vendor.
This is in fact nearly true for each vendor (except some of the
SP2-PVM results), see
Tables 13 and 15.
This shows that the communication rate is independent of node ordering, at
least for the 2.e and 2.g tests for 2, 4 and 8 processors.
For all communication tests, PVMe on the SP2 significantly outperformed the
In this section, performance results of a parallel matrix multiply code and three application codes on various workstation clusters are presented.
The parallel matrix-times-matrix multiplication, , is evaluated for square matrices of sizes 10, 100, 500 and 1000 for 1, 2, 4 and 8 nodes. For these tests, matrix multiplication is parallelized as follows: Let be the size of each square matrix and let be the number of nodes being utilized. For ease of illustration, assume divides and let . First from node 1, broadcast all of to each of the other nodes and send the second columns of and to node 2, the next columns of and to node 3, ..., the last columns of and to node . Each node then computes times the appropriate column block of and adds these results to the appropriate column block of . All updated column blocks of are then sent back to node 1. The same pvm code is used for all these tests. Thus, for a single node, both the master and slave programs execute on the same node. Table 16 presents the performance in Mflops based on wall-clock timings. Notice that the fast communication rates of PVMe allow the IBM SP2 to perform very well compared with the other vendors.
The following scenario will typically occur when measuring parallel performance of application codes. For small problems the ratio of communication to computation time will usually be large, thus making performance results for small problems highly dependent on the performance of the communication network. In contrast, for large problems, the ratio of communication to computation time will usually be small, thus making performance results for large problems highly dependent on the performance of each node. For these reasons, the performance of the parallel application codes is measured for small, medium and large problem sizes. All three of the application codes considered in this study assume that the number of nodes is small compared with the problem size.
The first parallel application code considered was obtained from Peter Michielse . It is written in Fortran and is based on a two-dimensional oil reservoir simulation that uses multigrid and domain decomposition techniques. The master program distributes the initial domain decomposition, after which each processor handles part of the computational domain. Communication takes place in various stages of the program: during the computation of residuals, during the actual smoothing process (which is a variant of block Gauss-Seidel), and during the restriction to coarser multigrid levels. The coarsest levels are handled by applying a stepwise agglomeration/de-agglomeration technique. The results are summarized in Table 17. Notice that performance results are mixed with no machine outperforming the other in all cases. The HP cluster performs the worst for all tests. For a single node, the DEC Alpha cluster performs the best. For two and four nodes, the SP2 outperforms the Alpha cluster for low multigrid levels and vice versa for high multigrid levels.
The second parallel application code was obtained from Ruud van der Paas . It is written in C and is a generalized red/black Poisson solver. This application applies a general domain decomposition technique to a two-dimensional computational domain. Communication is needed across the internal boundaries between the subdomains, and consists of exchange of data in overlap-regions. Within each subdomain, a generalized red/black Poisson solver is applied, which has the flexibility to adjust the amount of so-called inner iterations to the number of data exchange sweeps. The results are summarized in Table 18. Notice that the SP2 with PVMe outperforms the other machines in most cases.
The third parallel application code is written in Fortran and was obtained
from Jean Castel-Branco from the Universite Catholique de Louvain, Belgium.
This code uses a finite difference method and domain decomposition to solve a
two-dimensional diffusion equation for hydrodynamic simulations. The PVM
master performs domain decomposition by breaking the 512x512, two-dimensional domain
into subdomains of size 512x(512/), where is the number of nodes
used. The PVM slaves solve the diffusion equation on subdomains and
pass messages to contiguous neighboring subdomains .
Table 19 summarizes the performance results for this code.
For this application code, the comparative performance results are mixed
with the DEC Alpha Farm outperforming the others in two out of three
The performance data contained in this study is from a limited set of
scientific kernels and application codes and extrapolating this data to other
applications may lead to incorrect conclusions.
The performance of a collection of workstations interconnected via
a communication network for execution of parallel application codes
will depend on the performance of each workstations node,
communication network and also on how the application code is parallelized.
Single node performance results were mixed with different vendors
outperforming the others depending on the test chosen. The I/O
performance results were also mixed with no single vendor outperforming
Notice that IBM's optimization of PVM for the SP2 (that is, PVMe) provides
significant improvement in performance over the non-optimized PVM.
PVMe on the SP2 significantly outperformed the others
on the communication tests.
The fast communication rate achieved by PVMe helped the SP2 to
outperform the other vendors for many of the test cases for
the parallel application codes;
however, DEC and HP did outperform the SP2 on several of these
For all of the machines evaluated, broadcast rates did not scale
well as the number of processors increased from two to eight. This is
also true for the other concurrent communication tests with
the exception of SP2-PVMe results.
The authors would like to thank Cornell Theory Center, the Pittsburgh
Supercomputing Center and the Maui High Performance Computing Center for
allowing us to use their machines for this study. The authors would also
like to thank Jean Castel-Branco from the Universite Catholique de Louvain
for allowing us to use his hydrodynamic code,
Ruud van der Paas for the generalized red/black Poisson solver
and Peter Michielse for the oil reservoir simulation code for this study.
We also thank Bill Celmaster from DEC for providing performance results
for the Alpha Farm,
and Dan Nordhues from HP for providing us
with performance results on HP workstations.