2011 Second International Conference on Intelligent Systems, Modelling and Simulation
Performance Evaluation of Intel and Portland Compilers Using
Intel Westmere Processor
Muhammed Al-Mulhem
Raed Al-Shaikh
Department of Computer Science, KFUPM mulhem@kfupm.edu.sa EXPEC Computer Center, Saudi Aramco raed.shaikh@aramco.com Abstract - In recent years, we have witnessed a growing interest in optimizing the parallel and distributed computing solutions using scaled-out hardware designs and scalable parallel programming paradigms. This interest is driven by the fact that the microchip technology is gradually reaching its physical limitations in terms of heat dissipation and power consumption. Therefore and as an extension to Moore’s law, recent trends in high performance and grid computing have shown that future increases in performance can only be reached through increases in systems scale using a larger number of components, supported by scalable parallel programming models. In this paper, we evaluate the performance of two commonly used parallel compilers, Intel and Portland’s PGI, using a state-of-the-art Intel Westmerebased HPC cluster. The performance evaluation is based on two sets of experiments, once evaluating the compilers’ performance using an MPI-based code, and another using
OpenMP. Our results show that, for scientific applications that are matrices-dependant, the MPI and OpenMP features of the
Intel compiler supersede PGI when using the defined HPC cluster. Westmere and Phenom II multi-core CPUs, respectively [7].
On the HPC interconnects side, there are several network interconnects that provide ultra-low latency (less than 1 microsecond) and high bandwidth (several gigabytes per second). Some of these interconnects may even provide flexibility by permitting user-level access to the network interface cards for performing communication, and also supporting access to remote processes’ memory address spaces [1]. Examples of these interconnects are Myrinet from Myricom, Quadrics and Infiniband [1]. The experiments in this paper are done on the Infiniband architecture, which is one of the latest industry standards, offering low latency and high bandwidth as well as many advanced features such as Remote Direct Memory Access
(RDMA), atomic operations, multicast and QoS [2].
Currently, available Infiniband products can achieve latency of 200 nanoseconds for small messages and a bandwidth of up to 3-4 GB/s [1]. As a result, it is becoming increasingly popular as a high-speed interconnect technology option for building high performance clusters.
On the parallel programming level, MPI and OpenMP have become the de facto standard to express parallelism in a program. OpenMP provides a fork-and-join execution model, in which a program begins execution as a single process or thread. This thread executes sequentially until a parallelization directive for a parallel region is found. At this time, the thread creates a team of threads and becomes the master thread of the new team. All threads execute the statements until the end of the parallel region. Work-sharing directives are provided to divide the execution of the enclosed code region among the threads. The advantage of
OpenMP is that an existing code can be easily parallelized by placing OpenMP directives around time consuming loops which do not contain data dependences, leaving the source code unchanged. The disadvantage is that it is a big challenge to scale OpenMP codes to tens or hundreds of processors. One of the difficulties is a result of limited parallelism that can be exploited on a single level of loop nest. Another program parallelization can be achieved through the message passing programming paradigm, which can be employed within and across several nodes. The Message
Passing Interface (MPI) [4] is a widely accepted standard for writing message passing programs. MPI provides the
user