Your Guide to Understanding System Performance

Meet Intel® VTune™ Amplifier’s Platform Profiler

Bhanu Shankar, Performance Tools Architect, and Munara Tolubaeva, Software Technical Consulting Engineer, Intel Corporation

Have you ever wondered how well your system is being utilized throughout a long stretch of application runs? Or whether your system was misconfigured, leading to a performance degradation? Or, most importantly, how to reconfigure it to get the best performance out of your code? State-of-the-art performance analysis tools, which allow users to collect performance data for longer runs, don’t always give detailed performance metrics. On the other hand, performance analysis tools suitable for shorter application runs can overwhelm you with a huge amount of data.

This article introduces you to Intel® VTune™ Amplifier’s Platform Profiler, which provides data to learn whether there are problems with your system configuration that can lead to low performance, or if there’s pressure on specific system components that can cause performance bottlenecks. It analyzes performance from either the system or hardware point of view, and helps you identify under- or over-utilized resources. Platform Profiler uses a progressive disclosure method, so you’re not overwhelmed with information. That means it can run for multiple hours, giving you the freedom to monitor and analyze long-running or always-running workloads in either development or production environments.

You can use Platform Profiler to:

Identify common system configuration problems
Analyze the performance of the underlying platform and find performance bottlenecks

First, the platform configuration charts Platform Profiler provides can help you easily see how the system is configured and identify potential problems with the configuration. Second, you get system performance metrics including:

CPU and memory utilization
Memory and socket interconnect bandwidth
Cycles per instruction
Cache miss rates
Type of instructions executed
Storage device access metrics

These metrics provide system-wide data to help you identify if the system―or a specific platform component such as CPU, memory, storage, or network―is under- or over-utilized, and whether you need to upgrade or reconfigure any of these components to improve overall performance.

Platform Profiler in Action

To see it in action, let’s look at some analysis results collected during a run of the open-source HPC Challenge (HPCC) benchmark suite and see how it uses our test system. HPCC consists of seven tests to measure performance of:

Floating-point (FP) execution
Memory access
Network communication operations

Figure 1 shows system configuration view of the machine where we ran our tests. The two-socket machine contained Intel® Xeon® Platinum 8168 processors, with two memory controllers and six memory channels per socket, and two storage devices connected to Socket 0.

Figure 2 shows CPU utilization metrics and the cycles per Instruction (CPI) metric, which measures how much work the CPUs are performing. Figure 3 shows memory, socket interconnect, and I/O bandwidth metrics. Figure 4 shows the ratio of load, store, branch, and FP instructions being used per core. Figures 5 and 6 show memory bandwidth and latency chart for each memory channel. Figure 7 shows a rate of branch and FP instructions over all instructions. Figure 8 shows L1 and L2 cache miss rate per instruction. Figure 9 shows memory consumption chart. On average, only 51% of memory was consumed throughout the run. A larger test case can be run to increase memory consumption.

In Figures 5 and 6, we see that only two channels instead of six are being used. This clearly shows that there’s a problem with the memory DIMM configuration on our test system that’s preventing us from making full usage of memory channel capacity―leading to a performance degradation of HPCC.

The CPI (Figure 2), DDR memory bandwidth utilization, and instruction mix metrics in the figures show which specific type of test―either compute or FP operation- or memory-based―is being executed at a specific time during the HPCC run. For example, we can see that during 80-130 and 200-260 seconds of the run, both the memory bandwidth utilization and CPI rate increase―confirming that a memory-based test inside HPCC was executed during that period of time. Moreover, the Instruction Mix chart in Figure 7 shows that between 280-410 seconds, threads execute FP instructions in addition to some memory access operations during 275-360 seconds (Figure 3). This observation leads us to the idea that a test with a mixture of both compute and memory operations is executed during this period. Another observation is that we may be able to improve the performance of the compute part in this test by optimizing the execution of FP operations using code vectorization.

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig1-1024x261.png

Figure 1 – System Configuration View

CPU Metrics

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig2-1024x680.jpg

Figure 2 – CPU utilization metrics

Throughput Metrics

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig3-1024x544.jpg

Figure 3 – Throughput metrics for memory, UPI and I/O

Operations Metrics

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig4-1024x441.jpg

Figure 4 – Types of instructions used in throughout program execution

Memory Throughput

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig5-1024x378.jpg

Figure 5 – Memory bandwidth chart at a memory channel level

Memory Latency

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig6-1024x372.jpg

Figure 6 – Memory latency chart at a memory channel level

Instruction Mix

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig7-1024x388.jpg

Figure 7 – Rate of branch and floating point instructions over all instructions

L1 and L2 Miss per Instruction

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig8-1024x391.jpg

Figure 8 – L1 and L2 miss rate per instruction

Memory Utilization

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig9-1024x100.jpg

Figure 9 – Memory consumption

HPCC doesn’t perform any tests that include I/O, so we’ll show Platform Profiler results specifically on disk access from a second test case, LS-Dyna*, a proprietary multiphysics simulation software developed by LSTC. Figure 10 shows disk I/O throughput for LS-Dyna. Figure 11 shows I/O per second (IOPS) and latency metrics for LS-Dyna application. The LS-Dyna implicit model periodically flushes the data to the disk, so we see periodic spikes in the I/O throughput chart (see read/write throughput in Figure 10). Since the amount of data to be written isn’t large, the I/O latency remains consistent during the whole run (see read/write latency in Figure 11).

Read/Write Throughput

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig10-1-1024x249.jpg

Read/Write Operation Mix

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig10-2-1024x220.jpg

Figure 10 – Disk I/O throughput for LS-Dyna

Read/Write Latency

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig11-1-1024x233.jpg

IOPS

https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/ygusp-fig11-2-1024x215.jpg

Figure 11 – IOPS and latency metrics for LS-Dyna

Understanding System Performance

In this article, we presented Platform Profiler, a tool that analyzes performance from the system or hardware point of view. It provides insights into where the system is bottlenecked and identifies whether there are any over- or under-utilized subsystems and platform-level imbalances. We also showed its usage and the results collected from the HPCC benchmark suite and the LS-Dyna application. Using the tool, we found that poor memory DIMM placement was limiting memory bandwidth. Also, we found a part of the test had a high FP execution, which we could optimize for better performance using code vectorization. Overall, we found that this specific test case for HPCC and LS-Dyna doesn’t put any pressure on our test system, and there’s more room for system resources―meaning we can run an even larger test case next time.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Education Ecosystem Blog

Featured in

Your Guide to Understanding System Performance

Meet Intel® VTune™ Amplifier’s Platform Profiler

About author

Dr. Michael J. Garbade