High Performance Computing - HPC FaQ 2 | IndianTechnoEra

In the realm of High Performance Computing (HPC), where speed and efficiency reign supreme, understanding the intricacies of system architecture and optimization techniques is paramount. In this comprehensive guide, we delve into various concepts and methodologies integral to HPC, ranging from hardware performance counters to C++ optimizations. Let's embark on this journey through the layers of computational prowess.

Section-A (Each question of 1 mark)

1. Define is Hotspots.

A common profiling strategy is to find out how much time is spent in the different functions, and maybe even lines, of a code in order to identify hot spots.

Hotspots refer to sections of code or areas within a program where the majority of execution time is spent during runtime. These are critical parts that significantly impact overall performance.

2. Give example of SIMD?

SIMD (Single Instruction, Multiple Data) is a parallel computing technique where multiple processing elements perform the same operation on multiple data points simultaneously.

For example, in the x86 (Intel/AMD) case, the “packed” SSE load and store instructions exist in aligned and unaligned flavors [V107, O54]. If an aligned load or store is used on a memory address that is not a multiple of 16, an exception occurs.

An example of SIMD is the use of SSE (Streaming SIMD Extensions) instructions in Intel processors to perform operations like adding or multiplying multiple pairs of floating-point numbers at once.

3. Simple Measures, Large Impact.

Simple measures like algorithmic improvements, code refactoring, and efficient memory usage can have a significant impact on the overall performance of a program. For instance, changing a quadratic algorithm to a linear one or reducing unnecessary memory allocations can greatly improve performance.

Different Techniques:

Elimination of Common Subexpressions:

Common subexpressions are portions of code that are computed multiple times unnecessarily. By identifying and eliminating them, we can improve performance.

Example:
```
 // Without elimination of common subexpressions
int result = (a + b) * (a + b) + (a + b);

// With elimination of common subexpressions
int temp = a + b;
int result = temp * temp + temp;
```
In the first example, the expression (a + b) is computed three times, leading to redundant calculations. By storing the result in a temporary variable temp, we eliminate redundant computations, enhancing efficiency.
Avoiding Branches:

Branches, especially in tight loops, can hinder performance due to branch mispredictions and pipeline stalls. Techniques like loop unrolling or restructuring can reduce branching.

Example:
```
 // Original code with branching
for (int i = 0; i < N; ++i) {
    if (condition) {
        // Branch A
        doSomethingA();
    } else {
        // Branch B
        doSomethingB();
    }
}

// Restructured code without branching
for (int i = 0; i < N; ++i) {
    if (condition) {
        doSomethingA();
    }
    // Move Branch B outside the loop
}
doSomethingB();
```
In the original code, the loop contains a conditional branch that may cause pipeline stalls. By restructuring the code, we move the branch outside the loop, reducing the number of branches executed per iteration.
Using SIMD Instruction Sets:

SIMD (Single Instruction, Multiple Data) instruction sets allow parallel processing of data elements, improving performance for certain operations.

Example:
```
 // Without SIMD
for (int i = 0; i < N; ++i) {
    result[i] = array1[i] + array2[i];
}

// With SIMD (using SSE intrinsics)
#include <emmintrin.h> // Header for SSE intrinsics
for (int i = 0; i < N; i += 4) {
    __m128i vec1 = _mm_loadu_si128(&array1[i]);
    __m128i vec2 = _mm_loadu_si128(&array2[i]);
    __m128i sum = _mm_add_epi32(vec1, vec2);
    _mm_storeu_si128(&result[i], sum);
}
```
In the SIMD example, we use SSE intrinsics to perform parallel addition of four integers at a time. This approach leverages the parallel processing capabilities of SIMD instruction sets, resulting in faster execution compared to scalar operations.

4. Advantage of using Inlining.

Inlining is a compiler optimization technique where a function call is replaced with the body of the function itself. This reduces the overhead of function calls and enables other optimizations like constant propagation and dead code elimination, leading to potentially faster execution.

Some major advantages of using inlining;

Reduction of Call Overhead: Inlining eliminates the overhead associated with function calls, such as passing arguments and managing the function's execution context. This reduction in overhead leads to more efficient code execution.

Efficient Resource Utilization: By inserting the complete code of a function or subroutine at the place where it is called, inlining eliminates the need to push arguments onto the stack. It also allows the compiler to use registers more flexibly, reducing register pressure. This efficient resource utilization contributes to improved performance.

Enablement of Compiler Optimizations: Inlining enables the compiler to view a larger portion of code and potentially employ optimizations that would not be possible otherwise. With access to the complete code of inlined functions, the compiler can perform more aggressive optimizations, leading to better overall performance.

Speedup for Frequently Called Small Functions: Small functions that are frequently called bear the highest potential for speedup if inlined. In many cases, inlining is essential for achieving good performance, particularly for frequently used small functions like overloaded operators for simple types in C++.

Control and Heuristics: Compilers offer various options to control the extent of automatic inlining, allowing users to specify parameters such as the maximum size of a subroutine eligible for inlining. However, it's important to note that the inline keyword in C++ is only a hint to the compiler, and the actual decision of whether to inline a function depends on the compiler's heuristics and optimization settings.

5. Role of Storage order.

Storage order refers to how data elements are arranged in memory. The role of storage order is crucial for performance, especially in applications dealing with multidimensional arrays or structures. Optimizing storage order can enhance cache locality, reducing memory access times and improving overall performance.

The role of storage order, particularly in multidimensional arrays, is crucial for optimizing data access patterns and enhancing performance in scientific computing. This involves aligning the memory layout of multidimensional data structures with the cache line-based, one-dimensional memory layout of standard computers to leverage spatial and temporal locality effectively.

In the context of optimizing data access patterns, storage order determines how elements in a multidimensional array are stored and accessed in memory. For instance, the difference between row-major order (as used in C programming) and column-major order (as used in Fortran programming) significantly impacts memory access patterns and cache utilization.

Row-major Order (C Programming):

In row-major order, matrix rows are stored consecutively in memory.
Access patterns involve incrementing memory addresses in steps of N * sizeof(element), where N represents the size of the array and sizeof(element) denotes the size of each array element.
The stride is optimal, making it efficient for memory access in C programs.

Column-major Order (Fortran Programming):

In column-major order, matrix columns are stored consecutively in memory.
Access patterns differ from row-major order, affecting how memory addresses are accessed and incremented.
Fortran follows column-major order, which may result in suboptimal stride patterns compared to C.

Therefore, when optimizing for data access, it's essential to ensure that the inner loop variable used as an index to a multidimensional array facilitates stride-one access. In Fortran, this means the first index, while in C, it corresponds to the last index. Optimizing data access patterns based on storage order helps maximize spatial and temporal locality, ultimately enhancing program performance in scientific computing applications.

Section-B (Each question of 2 mark)

6. Define Hardware Performance Counters?

Hardware Performance Counters are special registers built into modern processors that monitor various hardware events such as instructions executed, cache hits/misses, branch predictions, and other performance-related metrics. They are used for performance analysis and profiling to identify bottlenecks and optimize code.

Hardware performance counters are special on-chip registers present in modern processors, designed to track specific events or metrics related to program execution. These counters are crucial for performance profiling, as they provide detailed insights into the behavior of the code at runtime, beyond just clock ticks.

Performance counters monitor various events that occur during program execution, including:

Number of bus transactions, indicating cache line transfers.
Number of loads and stores, offering insights into cache line utilization.
Number of floating-point operations, though its importance may be overrated.
Mispredicted branches, indicating incorrect conditional branch predictions.
Pipeline stalls, revealing dependencies between processor pipeline stages.
Number of instructions executed, assessing hardware utilization efficiency.

Hardware performance counters play a pivotal role in performance analysis by enabling developers to identify performance bottlenecks, understand resource utilization patterns, and make informed optimization decisions. They allow for detailed examination of code behavior at the hardware level, facilitating the identification of hot spots and areas for improvement in software performance.

7. Discuss Elimination of common sub-expressions.

Elimination of common sub-expressions is a compiler optimization technique where redundant computations are identified and removed from the code. By recognizing expressions that have been computed previously and stored in memory, the compiler can replace subsequent occurrences with references to the stored result, reducing computational overhead and improving efficiency.

Here's how it works:

Identification of Common Sub-Expressions: During the compilation process, the compiler analyzes the code to identify expressions that are computed more than once within the same scope.

Replacement with Temporary Variables: Once a common sub-expression is identified, the compiler replaces subsequent occurrences of that expression with a temporary variable or register assignment. This temporary storage holds the result of the computation, eliminating the need to recompute the expression multiple times.

Code Transformation: By replacing redundant computations with temporary variables, the compiler transforms the code to perform the computation only once and reuse the result wherever necessary.

Common subexpression elimination is an optimization that is often considered a task for compilers. Basically one tries to save time by precalculating parts of complex expressions and assigning them to temporary variables before a code construct starts that uses those parts multiple times. In case of loops, this optimization is also called loop invariant code motion:

1 ! inefficient
2 do i=1,N
3 A(i)=A(i)+s+r*sin(x)
4 enddo

 --→

tmp=s+r*sin(x)
do i=1,N
A(i)=A(i)+tmp
enddo

A lot of compute time can be saved by this optimization, especially where “strong” operations (like sin()) are involved. Although it may happen that subexpressions are obstructed by other code and not easily recognizable, compilers are in principle able to detect this situation. They will, however, often refrain from pulling the subexpression out of the loop if this required employing associativity rules (see Section 2.4.4 for more information about compiler optimizations and reordering of arithmetic expressions). In practice, a good strategy is to help the compiler by eliminating common subexpressions by hand.

8. Role of Compiler?

The compiler plays a crucial role in translating high-level programming languages into machine code that can be executed by the computer's hardware. It performs various optimizations such as code restructuring, instruction scheduling, and register allocation to improve program performance and efficiency.

The role of compilers in software development and optimization is multifaceted and crucial for achieving high-performance code. Here are the key points regarding the role of compilers based on the provided content:

Mapping High-Level Code to Machine Code: The primary function of compilers is to translate source code written in high-level languages like C/C++ into machine code that the processor can execute. This process involves utilizing the processor's internal resources efficiently.

Compiler-Based Optimizations: Modern compilers come equipped with a variety of optimization techniques aimed at improving code performance. These optimizations range from simple code rearrangements to complex transformations that exploit hardware features for enhanced efficiency.

Fine-Grained Tuning: Compilers offer command-line switches that allow developers to fine-tune optimization options according to their specific requirements. Experimenting with different optimization levels and compiler options can sometimes reveal additional performance potential.

Automatic Optimization Strategies: Compilers employ automatic optimization strategies to enhance code performance. These strategies include common optimizations such as inlining, loop transformations, and register allocation, among others.

Awareness of Potential Limitations: While compilers can perform sophisticated optimizations, there are limitations to their capabilities. Programmers need to be aware of potential stumbling blocks that may prevent certain optimizations from being applied effectively.

Compiler Complexity: Compilers operate in a complex environment where they must balance various optimization goals while ensuring code correctness and adherence to language standards. Understanding the intricacies of compiler behavior can help developers write more efficient code.

Compiler Logs and Annotations: Some compilers provide detailed logs or annotated source code listings that describe the optimizations performed during compilation. These logs can offer insights into the compiler's decision-making process and help developers evaluate the quality of the generated machine code.

Section-C (Each question of 4 mark)

9. Discuss C++ Optimizations with examples?

C++ Optimizations include techniques like loop unrolling, function inlining, and memory access optimization. For example, loop unrolling involves replicating loop bodies to reduce loop overhead. Function inlining replaces function calls with the actual function body to eliminate call overhead. Memory access optimization aims to enhance cache usage and reduce memory latency by optimizing data layout and access patterns.

C++ optimizations are essential for achieving efficient code performance, especially in high-performance computing scenarios. Let's discuss some key optimization techniques and examples highlighted in the provided content:

Temporaries Optimization:

C++ allows for implicit programming styles, often leading to the creation of temporary objects, especially in overloaded operator expressions.

Example: In a vec3d class representing 3D vectors, expressions like a = x * b + y * c involve the creation of temporary objects.

Optimizations involve using compound assignment operators (+=) to minimize the creation of temporaries.
Dynamic Memory Management Optimization:

Frequent allocation and deallocation of memory can introduce performance bottlenecks.

Strategies like lazy construction and static construction can help mitigate these issues.
- Lazy construction defers object creation until necessary, reducing unnecessary allocations.
- Static construction minimizes overhead by instantiating objects outside loops or blocks.
Loop Kernels and Iterators Optimization:

Compiler optimizations for loop kernels are critical for performance.

Operator overloading, while convenient, may hinder loop optimization.

Use of iterators instead of overloaded operators can facilitate better optimization.

Iterators should be preferred for low-level loops, especially when interfacing with C code or optimizing for SIMD operations.

Examples:

Optimizing Expressions:

a = y * c;
a += x * b;

Lazy Construction:

if (rand() > threshold * RAND_MAX) {
    std::vector<double> v(obtain_data(length));
    std::sort(v.begin(), v.end());
    process_data(v);
}

Static Construction:

const int length = 1000;
static std::vector<double> v(length);

Iterator Usage for Loop Optimization:

for (int i = 0; i < s; ++i)
    result += ia[i] * ib[i];

These optimizations help streamline C++ code for better performance, especially in compute-intensive applications where efficiency is crucial. Understanding these techniques allows developers to write code that maximizes computational throughput and minimizes resource overhead.

10. Explain Balance analysis and light speed estimate?

Balance analysis and lightspeed estimates are methodologies used in performance modeling to assess the efficiency of code, particularly loop-based code constrained by memory bandwidth limitations. This analysis helps programmers determine if efforts to optimize code efficiency are worthwhile or if the code already utilizes resources optimally.

Balance analysis and lightspeed estimates - IndianTechnoEra

Bandwidth-based Performance Modeling

The central concept in balance analysis is the notion of balance, which is quantified by the machine balance \( B_m \). It represents the ratio of memory bandwidth to peak performance. Machine balance is calculated as:

\[ B_m = \frac{{\text{{memory bandwidth [GWords/sec]}}}}{{\text{{peak performance [GFlops/sec]}}}} = \frac{{b_{\text{{max}}}}}{{P_{\text{{max}}}}} \]

Here, "memory bandwidth" refers to the bandwidth to caches or network bandwidth, and "peak performance" signifies the maximum achievable computational performance.

Code Balance

The code balance \( B_c \) of a loop is defined as the ratio of data traffic to floating-point operations:

\[ B_c = \frac{{\text{{data traffic [Words]}}}}{{\text{{floating point ops [Flops]}}}} \]

The reciprocal of code balance is termed computational intensity.

Lightspeed of a Loop

The expected maximum fraction of peak performance of a code with balance \( B_c \) on a machine with balance \( B_m \) is given by:

\[ l = \min \left(1, \frac{{B_m}}{{B_c}}\right) \]

This fraction, termed the lightspeed of a loop, indicates the efficiency of the loop in utilizing available resources.

Performance Estimate

Performance in GFlops/sec is estimated as:

\[ P = l \cdot P_{\text{{max}}} = \min \left( P_{\text{{max}}}, \frac{{b_{\text{{max}}}}}{{B_c}} \right) \]

If \( l \approx 1 \), performance is not limited by bandwidth but by other factors.

Assumptions and Considerations

The model assumes optimal usage of arithmetic units and double-precision floating-point arithmetic.
It assumes perfect overlap of data transfer and arithmetic.
Latency effects are considered negligible.
The system operates in "throughput mode," and memory bandwidth is fully utilized.

Balance analysis provides insights into the potential performance of loop-based code and guides optimization efforts. However, it's essential to acknowledge the model's assumptions and limitations when applying it to real-world scenarios.

Section-D (6-mark question)

11. Explain Scalar Profiling in detail with suitable examples and graphs.

Scalar profiling involves analyzing the performance characteristics of individual scalar operations within a program. It aims to identify which scalar operations consume the most computational resources and contribute significantly to overall execution time. This information helps in pinpointing optimization opportunities and improving code efficiency. Graphs depicting execution times or profiling data can visually illustrate the impact of scalar operations on program performance, aiding in optimization decisions.

Scalar profiling is a fundamental technique used in high-performance computing to analyze a program's behavior and resource utilization, particularly focusing on runtime. By identifying hot spots, which are sections of code consuming the most runtime, developers can target these areas for optimization to enhance overall performance.

i. Function- and Line-based Runtime Profiling:

Two primary technologies are employed for function- and line-based profiling: code instrumentation and sampling.

Code Instrumentation: This technique involves the compiler modifying each function call by inserting code that logs the call, its caller, and possibly its execution time. However, this method incurs overhead, especially in functions with short runtimes.

Sampling: Sampling interrupts the program at periodic intervals and records the program counter. It's less invasive but statistical in nature. Sampling can provide execution time information down to the source line and machine code level.

a. Function Profiling: One commonly used profiling tool is gprof, which utilizes both instrumentation and sampling to collect a flat function profile and a call graph profile. The flat profile displays information about execution times and call frequencies for all functions in the program. The call graph profile illustrates the relationship between functions and their callers/callees, providing insights into inclusive and exclusive runtime contributions.

The flat profile contains information about execution times of all the program’s functions and how often they were called:

% cumulative	self seconds	total seconds	calls	ms/call	ms/call	name
70.45	5.14	5.14	26,074,562	0.00	0.00	intersect
26.01	7.03	1.90	4,000,000	0.00	0.00	shade
3.72	7.30	0.27	100	2.71	73.03	calc_til

Explanation:

% cumulative: Cumulative time as a percentage of the total execution time.
self seconds: Time spent exclusively in the function itself, excluding time spent in functions called by it.
total seconds: Total time spent in the function, including time spent in functions called by it.
calls: Number of times the function was called.
ms/call: Average time spent per call in milliseconds.
name: Name of the function being profiled.

Interpretation:

**intersect**: This function has the highest cumulative percentage (70.45%) and consumes the most time (5.14 seconds). It has been called 26,074,562 times but seems to have a very low average time per call.
**shade**: This function accounts for 26.01% of the cumulative time and has a total time of 1.90 seconds. It has been called 4,000,000 times with a negligible average time per call.
**calc_til**: This function contributes 3.72% to the cumulative time and has a total time of 0.27 seconds. It has been called only 100 times but has a relatively high average time per call of 2.71 milliseconds.

A flat profile already contains a lot of information, however it does not reveal how the runtime contribution of a certain function is composed of several different callers, which other functions (callees) are called from it, and which contribution to runtime they in turn incur. This data is provided by the butterfly graph, or callgraph profile:

Callgraph Profile Data

Index	% Time	Self	Children	Called	Name
1	0.27	7.03	100/100	main [2]	calc_tile [1]
2	1.90	5.14	4000000/4000000	shade [3]	---
3	---	---	---	---	*Spontaneous*
4	---	---	---	---	---
5	99.9	0.27	7.03	100	calc_tile [1]
6	99.9	0.00	7.30	main [2]	---
7	1.90	5.14	4000000/4000000	shade [3]	---
8	---	---	---	---	[1]
9	96.2	1.90	5.14	4000000+5517592	shade [3]
10	---	---	---	---	---
11	5.14	0.00	26074562/26074562	intersect [4]	---
12	---	---	---	---	---
13	70.2	5.14	0.00	26074562	intersect [4]

Explanation:

% Time: The percentage of overall runtime spent in this function, including its callees (inclusive time).
Self: Exclusive execution time for each indexed function. For callers, it denotes the inclusive time contributed by each callee to the caller.
Children: Inclusive minus exclusive runtime, i.e., the contribution of all callees to inclusive time.
Called: Number of times the function was called.
Name: Name of the function.

b. Line-based Profiling: For programs with large functions contributing significantly to runtime, line-based profiling becomes essential. Tools like OProfile can analyze code at the line level, providing insights into hot spots within functions.

% Cumulative	Self	Self Total	Calls	s/call	s/call	Name
1	73.21	13.47	13.47	1	13.47	18.40 MAIN__
2	6.47	14.66	1.19	21993788	0.00	0.00 mvteil_
3	6.36	15.83	1.17	51827551	0.00	0.00 ran1_
4	6.25	16.98	1.15	35996244	0.00	0.00 gzahl_

Here the MAIN function in a Fortran program requires over 73% of overall runtime but has about 1700 lines of code. If the hot spot in such functions cannot be found by simple common sense, tools for line-based profiling should be used.

Line	Count	Percentage	Code
1	DO 215 M=1,3	4292	0.9317 : bremsdir(M) = bremsdir(M) + FH(M)*Z12
2	215 CONTINUE	1462	0.3174
3
4
5			682 0.1481 : U12 = U12 + GCL12 * Upot
6
7	DO 230 M=1,3	3348	0.7268 : F(M,I)=F(M,I)+FH(M)*Z12
8	230 CONTINUE	1497	0.3250
9			501 0.1088 : Fion(M)=Fion(M)+FH(M)*Z12

This kind of data has to be taken with a grain of salt, though. The compiler-generated symbol tables must be consistent so that a machine instruction’s address in memory can be properly matched to the correct source line. Modern compilers can reorganize code heavily if high optimization levels are enabled.

ii. Hardware Performance Counters: Hardware performance counters are specialized registers on modern processors that increment each time a specific event occurs.

These counters offer insights into various performance metrics such as bus transactions, cache misses, loads, stores, floating-point operations, mispredicted branches, and pipeline stalls. Analyzing these metrics helps identify performance bottlenecks and resource limitations within the code.

iii. Manual Instrumentation: If the overhead of standard compiler-based instrumentation is too high, manual instrumentation can be considered. Programmers can insert calls to timing routines or profiling libraries into the code to gather performance data selectively.

In summary, scalar profiling encompasses a range of techniques aimed at understanding program behavior and resource utilization, enabling developers to optimize critical sections of code for improved performance. Through a combination of function profiling, line-based profiling, hardware performance counters, and manual instrumentation, developers gain insights into the runtime behavior of their applications, facilitating targeted optimizations.

IndianTechnoEra