One main focus of the IPACS-Project is the development of methods for the modeling and the
prediction of performance of commercial codes. The need for and the benefit of performance
prediction techniques can be roughly divided into two areas: On the one hand, there is software
development, where the understanding of the performance of an implemented algorithm
is important for the developer to find and perhaps cure the bottlenecks of the code. For this
aim, very precise modeling techniques are necessary, and detailed information and data from
hardware counters and/or source code analysis is mandatory. But in many situations and on
many systems such kind of information is hard or even impossible (e.g. for commercial software)
to gain. On the other hand, performance prediction is also important for the user, who
wants to find out, which hardware upgrade would improve the run-time of 'his' application most
effectively. In this case, the data, the modeling is based on, must be easily obtainable, while
an accuracy of ~ 10% for the prediction might be sufficient. But a reasonable estimate for the
performance of an architecture, which can not be accessed or is not even build yet, should be
possible. This is the area, that is addressed by the modeling methods in the IPACS-Project.
With this in mind a rather simple model has been developed. A detailed description and comparism
with experimental data can be found here
Here we will mention the main characteristics only: As a simplified approach it depends on some basic a
ssumptions and is thereby restricted to a special class of applications. It is assumed that the main part
of the run-time is spend in a repeated loop over a large number of small elementary building blocks
like points or cells, 'large' and 'small' with respect to the available fastest memory layer. This
can be expected to be fulfilled by general evolution problem applications and expecially by CFD
applications where the given results here are restricted to.
In the model the application and the case under consideration are described by a set of
characteristic numbers. These are the number of cache loads, stores, main memory accesses
and flops for a single processor run. For a parallel run the size of the boundary partitions per
process, the amount of data to be communicated and the number of communication steps have
to be added. These numbers are combined with the results of the low-level benchmarks and
performance metrics. These are the theoretical peak performance, the Cachebench read and
write bandwidths on different memory levels and the network bandwidth and latency from the
PMB benchmarks. These results are taken from the repository directly, making the performance
prediction depending automatically on the measured benchmark data.
|
|