High Performance Computing - HPC FaQ 4 | IndianTechnoEra

Section-A (Each question of 1 mark)

1. Write taxonomy of Parallel computing paradigm.

Flynn's taxonomy is a classification system for parallel computers and programs. It was proposed by Michael J. Flynn in 1966 and extended in 1972. The taxonomy is based on two factors: the number of instruction streams and the number of data streams.

The four categories in Flynn's taxonomy are:

SIMD (Single Instruction, Multiple Data):

Description: Single instruction stream operates on multiple data streams concurrently.
Examples: Vector processors, SIMD capabilities of modern superscalar microprocessors, Graphics Processing Units (GPUs).
Historical Example: Thinking Machines’ Connection Machine supercomputer.

MIMD (Multiple Instruction, Multiple Data):

Description: Multiple instruction streams on multiple processors operate on different data items concurrently.
Examples: Shared-memory and distributed-memory parallel computers.

SISD (Single Instruction, Single Data):

Description: Conventional, non-parallel, single-processor execution.
Example: Traditional sequential execution.

MISD (Multiple Instruction, Single Data):

Description: Not regarded as a useful paradigm in practice.
Example: Not commonly used.

2. Full form of UMA and ccNUMA.

UMA: Uniform Memory Access (UMA) systems exhibit a “flat” memory model

ccNUMA: On cache-coherent Nonuniform Memory Access (ccNUMA) machines, mem ory is physically distributed but logically shared.

3. What is Bi-section bandwidth?

In high performance computing (HPC), bisection bandwidth is the total bandwidth across an imaginary cut that divides a fabric evenly in half.

Here are some formulas for bisection bandwidth:

Only horizontal network division: bisection bandwidth = 2*n
Vertical/horizontal network division: bisection bandwidth = 2*min(m,n)

Neither horizontal nor vertical network division: bisection bandwidth is not defined

Bisection bandwidth refers to a crucial metric used to quantify the maximum aggregated communication capacity across an entire network. It represents the sum of the bandwidths of the minimal number of connections that are severed when the system is divided into two equal-sized parts, often represented by a dashed line in network diagrams.

4. Choosing the right scaling for parallelism.

Choosing the right scaling for parallelism involves understanding the hierarchical structure of parallel systems and selecting appropriate baselines for performance evaluation.

Today's high-performance computers feature massively parallel architectures with multicore chips, multisocket shared-memory nodes, and multilevel networks connecting them.

Therefore, parallel systems inherently involve multiple hierarchy levels.

5. What is refined performance model?

A refined performance model integrates communication overhead, load imbalance, and parallel startup overhead, which are not addressed by traditional laws like Amdahl's and Gustafson's.

This model includes a correction term for communication overhead in parallel runtime calculations, enhancing accuracy in parallel application performance evaluation.

Section-B (Each question of 2 mark)

6. Explain Parallel Scalability along with its metrics.

Parallel scalability refers to the ability of a parallel system to efficiently handle an increasing workload as the number of processors or workers is increased. It measures how effectively the performance of a parallel application improves as more resources are added.

Metrics for evaluating parallel scalability include:

Speedup: It measures how much faster a given problem can be solved with \( N \) workers compared to one worker. Mathematically, speedup (\( S \)) is defined as the ratio of the execution time on a single processor to the execution time on \( N \) processors. Ideally, speedup should be linear (\( N \)), indicating perfect scalability.
Work Increase: It quantifies how much more work can be accomplished with \( N \) workers compared to one worker. This metric evaluates the overall efficiency and productivity of parallel execution.
Communication Impact: It assesses the influence of communication requirements on the performance and scalability of parallel applications. Communication overhead can significantly affect scalability, especially in distributed computing environments.
Resource Utilization: It determines the fraction of resources that are effectively utilized for solving the problem. Efficient parallel scalability requires optimal resource allocation and utilization across all processing units.

7. Explain Distributed Memory/NORMA computer architecture.

Distributed Memory/NORMA (No Remote Memory Access) computer architecture is characterized by the absence of direct access from one CPU to another CPU's memory. Each processor in the system is connected to its exclusive local memory, which means that no other CPU can directly access it. In this architecture, communication between processors is achieved solely through message passing over a communication network.

Key features of Distributed Memory/NORMA architecture:

1. Exclusive Local Memory: Each processor has its own dedicated local memory, and there is no shared memory accessible by all processors.

2. Message Passing Communication: Processors communicate with each other by sending messages over a communication network. There is no direct remote memory access available.

3. Programming Model: The programming model for distributed-memory systems involves message passing interfaces like MPI (Message Passing Interface). Programs written for distributed-memory architectures must explicitly manage communication between processes.

4. Hybrid Systems: While pure distributed-memory architectures are less common in modern parallel computing, hybrid systems that combine distributed-memory nodes with shared-memory nodes are prevalent. Shared-memory nodes typically consist of multiple CPUs connected via a high-speed interconnect.

5. Interconnect Options: Distributed-memory systems utilize various interconnect technologies for communication between nodes. These technologies range from standard switched Ethernet to more advanced high-performance interconnects designed for parallel computing.

6. Impact of Network Performance: The performance of the communication network significantly impacts application performance in distributed-memory systems. Nonblocking, high-speed networks are preferred, but the cost and scalability of such networks may pose challenges for large-scale installations.

8. Serial vs. Strong scalability.

Serial scalability and strong scalability are two different ways of assessing the performance of parallel applications as the number of processors or cores is increased.

1. Serial Scalability:

• Serial scalability refers to the performance improvement achieved by optimizing the sequential, non-parallelizable part of the application.

• It measures how well the serial portion of the code scales with increasing processor count.

• In serial scalability, the goal is to optimize the performance of the non-parallelizable part of the code to enhance overall application performance.

• Serial scalability is relevant when the sequential portion of the code significantly impacts overall execution time, even in parallel environments.

2. Strong Scalability:

• Strong scalability assesses the ability of a parallel application to efficiently utilize additional processors or cores to solve a fixed-size problem.

• It measures the performance improvement as the number of processors increases while keeping the problem size constant.

• Strong scalability is crucial for parallel applications that aim to solve larger problems in less time by leveraging additional computational resources.

• The goal in strong scalability is to achieve linear or near-linear speedup as the number of processors increases, indicating efficient parallelization.

In practice, optimizing for strong scalability often involves parallelizing computationally intensive parts of the code and minimizing communication overhead between processors. On the other hand, optimizing for serial scalability focuses on improving the performance of the sequential portion of the code, which may involve applying scalar optimizations and reducing memory access latency.

The crossover point between optimizing the serial and parallel parts of an application depends on factors such as the relative performance improvement achievable in each part and the scalability characteristics of the application. Amdahl's Law provides a guideline for determining the optimal balance between serial and parallel optimization efforts based on the application's scalability characteristics.

Section-C (Each question of 4 mark)

9. Discuss MESI Protocol in with steps and diagrams.

The MESI protocol is a cache coherence protocol used in multi-processor systems to maintain cache consistency between multiple cache copies of the same memory block. MESI stands for Modified, Exclusive, Shared, and Invalid, which are the four states a cache line can be in.

Here's a brief explanation of each state:

Modified (M): The cache line is modified and not up-to-date with the main memory. It holds the only copy of the data.
Exclusive (E): The cache line is up-to-date with the main memory, and no other cache has a copy of it.
Shared (S): The cache line is unmodified and is shared with one or more other caches.
Invalid (I): The cache line is invalid or not present in the cache.

Now, let's discuss the MESI protocol steps and states transition using a simple example and diagrams:

Suppose we have two caches, A and B, both initially containing copies of a particular memory block. Here's how the MESI protocol works:

Read Operation:

When a processor wants to read a memory location, it first checks its cache.
If the cache contains the data in the Exclusive or Shared state, it can be read directly from the cache.
If the cache contains the data in the Modified state, it can be read directly from the cache, and no other cache can have a copy.
If the cache contains the data in the Invalid state, the processor must fetch it from the main memory.

Write Operation:

When a processor wants to write to a memory location, it must first acquire exclusive access to the cache line.
If the cache line is in the Modified state, the processor can write to it directly.
If the cache line is in the Shared state, the processor must first invalidate all other caches holding copies of the line and then change its own cache line to Modified.
If the cache line is in the Exclusive state, the processor can write to it directly and change its state to Modified.

Some diagram from the books

10. Explain Network Connections (Topologies) with suitable diagrams and examples.

Network connections, also known as topologies, are crucial in high-performance computing environments as they determine the efficiency and performance of data communication between various nodes or processors. In this context, various network technologies and topologies are utilized, each with its own set of characteristics and suitability for different applications.

Let's explore some of the basic network topologies along with suitable diagrams and examples.

1. Point-to-Point Connections:

Introduction: Point-to-point connections involve direct communication between two nodes or devices without any intermediate devices.

Features: Direct and dedicated communication path between two endpoints. Low latency and high bandwidth potential. Simple to implement and understand.

Limitations: Scalability issues as the number of connections increases. Limited to connecting only two devices at a time.

Drawbacks: Inefficient for large-scale systems requiring multiple simultaneous connections. Complex to manage as the number of connections grows.

Example: Direct connections between nodes in a peer-to-peer network.

2. Buses:

Introduction: A bus topology consists of a shared communication medium where all devices are connected.

Features: Simple and cost-effective to implement. Easy to understand and troubleshoot. Suitable for small-scale systems with low data transfer requirements.

Limitations: Limited bandwidth due to the shared nature of the medium. Susceptible to congestion and collisions, especially as the number of connected devices increases.

Drawbacks: Performance degradation as more devices are added to the bus. Single point of failure for the entire network.

Example: PCI (Peripheral Component Interconnect) bus used in desktop computers.

3. Switched and Fat-Tree Networks:

Introduction: Switched networks use switches to connect devices in a hierarchical manner, while fat-tree networks consist of multiple layers of switches.

Features: Hierarchical structure offers scalability and flexibility. Can support multiple simultaneous connections without congestion. Redundancy and fault tolerance can be built into the network.

Limitations: Costlier to implement compared to simpler topologies like buses. Complex configuration and management.

Drawbacks: Latency may increase with the number of switches and layers. Higher susceptibility to network failures due to the presence of multiple components.

Example: Data center networks often use fat-tree topologies for high-performance communication.

4. Mesh Networks:

Introduction: Mesh networks use multidimensional structures where each node is connected to its neighbors.

Features: Scalable and adaptable to different system sizes. Redundancy and fault tolerance due to multiple communication paths. Can handle high volumes of data traffic efficiently.

Limitations: Increased complexity with larger network sizes.Higher implementation and maintenance costs.

Drawbacks: Potential for network congestion if not properly managed. Higher latency compared to simpler topologies like buses.

Example: IBM Blue Gene and Cray XT systems utilize mesh networks for parallel communication.

5. Hybrids:

Introduction: Hybrid networks combine multiple topologies to achieve desired performance and scalability.

Features: Flexibility to tailor the network architecture to specific requirements. Can leverage the strengths of different topologies to optimize performance. Redundancy and fault tolerance can be enhanced by integrating diverse network elements.

Limitations: Increased complexity in design and implementation. Higher cost and resource requirements compared to single-topology networks.

Drawbacks: Challenges in managing and troubleshooting hybrid architectures. Compatibility issues between different network components.

Example: Clusters of shared-memory nodes with fat-tree interconnects for inter-node communication.

Section-D (6-mark question)

11. Discuss Data and Function parallelism along with scalability and other factors that limits parallel execution in detail.

Before diving into parallel programming, it's crucial to understand some basic rules. Many people think that adding more hardware automatically makes programs run faster, but this isn't always true. In reality, billions of CPU hours are wasted each year because people don't understand the limits of parallel execution.

Why Parallelize?

Parallelization, or using multiple processors to perform tasks simultaneously, is common in today's computers. But not everyone needs to write parallel programs. If a single processor can handle the job, there's no need to complicate things.

People usually turn to parallelization when:

A task takes too long to complete on a single processor.
The task requires more memory than what's available on one system.

As computers move towards using multiple cores, the first problem becomes more common. In the past, people used techniques like storing data on disk and loading it as needed to solve the second problem. But as computers get faster, this technique is becoming less effective.

Parallelism:

Parallel programming starts with finding parts of a task that can be done simultaneously. Different tasks require different approaches to parallelization. While we'll cover some basic methods here, there's a lot more to learn if you're interested.

Understanding parallelism is essential because it helps choose the right approach for a given task. It's like figuring out the best way to divide up a group project so everyone can work efficiently.

Data Parallelism:

Data parallelism involves distributing data across multiple computing resources to perform the same operation simultaneously on different data sets. It aims to exploit parallelism by dividing the data and processing it in parallel.

Features:

Allows simultaneous processing of multiple data elements using the same operations.

Well-suited for tasks that can be divided into independent units of work.

Effective for parallel processing on distributed systems and multicore architectures.

Limitations: Data depend, Scalable

Drawbacks:

Increased communication overhead for coordinating tasks and sharing data.

Limited by the amount of available data to parallelize effectively.

Function Parallelism

Function parallelism involves executing different functions or tasks concurrently. Each processor works on a distinct part of the overall task. This approach is beneficial when different parts of the task can be performed independently and do not require significant communication between processors.

Features:

Allows different functions or tasks to be executed simultaneously.

Can exploit diverse computational resources for different types of processing.

Provides flexibility in task allocation and resource utilization.

Limitations:

Coordination and synchronization overhead between parallel functions.

Complexity in managing dependencies and ensuring correct execution order.

Drawbacks:

Potential for resource contention and inefficient use of computing resources.

Limited by the availability of independent functions suitable for parallel execution.

Example:

Web server handling multiple requests concurrently through parallel processing.

Scalability and Limitations:

Scalability refers to the ability of a parallel system to efficiently handle increasing workloads or resources. However, several factors can limit parallel execution scalability:

Choosing the Right Scaling Baseline:

Parallel systems comprise multiple hierarchy levels, requiring careful consideration when scaling parallel codes.

Scalability analysis should be reported in relation to relevant scaling baselines, considering factors like intra-node and inter-node scalability.

Ignoring hierarchical structure can lead to inaccurate scalability assessments.

Case Study: Can Slower Processors Compute Faster?

The concept of using slower processors in parallel systems to improve scalability depends on various factors such as communication overhead and work distribution.

A performance model for slow computers considers factors like communication overhead and problem size to determine scalability and performance gains.

Load Imbalance:

Load imbalance occurs when some parallel workers idle while others perform useful work, leading to underutilization of resources.

Reasons for load imbalance include algorithmic issues, optimization problems, and coarse granularity of problems.

Mitigating load imbalance requires strategies for optimized work distribution and resource utilization.

OS Jitter:

OS jitter refers to delays caused by operating system activities that impact parallel program performance.

As the number of parallel workers increases, OS noise can lead to increased load imbalance and performance variability.

Strategies to reduce OS jitter include minimizing OS activity and synchronizing periodic activities across all workers.

IndianTechnoEra