代做CSCI-UA.0480-051: Parallel Computing Final Exam – Fall 2023代做Python编程

- 首页 >> Web

CSCI-UA.0480-051: Parallel Computing

Final Exam – Fall 2023

Total: 100 points

Problem 1

a. [6 points] Suppose we have a single core. Based on Flynn classification, in what category does it fit (SISD, SIMD, MISD, or MIMD)? Justify your answer. Make any assumptions you think are needed. Assume there are no caches at all in that system.

b. [6] Repeat the above problem but assume there is L1 cache of size 32KB with that core.

c. [12] We know that coherence has a negative effect on performance. State three reasons as to why.

Problem 2

Answer the following questions about MPI.

a. [6] Suppose we have an MPI program where we have four processes.

MPI_Comm_Split() is called once. How many ranks can each process have after that call? If there are several answers, specify them all and justify your answer.

b. For the following piece of code (assume very large number of cores):

int main(){

int globalnum = 0;  int numprocs, rank; int i = 0;

MPI_Init(NULL, NULL);

MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank);

#pragma omp parallel for reduction(+:globalnum) for( i = 0; i < 2+rank ; i++)

{

globalnum ++;

…rest of loop body  }

MPI_Finalize();

}

We execute the above code with: mpirun -n 5 ./progname

1.  [5] What is the maximum number of threads we will end up having in the whole system? Explain.

2. [5] Just before executing MPI_Finalize(), how many instances of globalnum do we have in the system? Justify.

3. [5] Just after executing MPI_Finalize(), and before exiting main() how many instances of globalnum  do we have in the system? Justify

c. [9] Suppose we have three processes. Each process has an array p of integers of 3 elements as follows. Process 0 has [1, 2, 3], process  1 has [4, 5, 7], and process 2 has [7, 8, 9]. Suppose all the processes execute the following API:

MPI_Reduce(p, q, 3, MPI_INT, MPI_BAND,  1, MPI_COMM_WORLD);

Where p is a pointer to the array and q is a pointer to a receiving array. Assume the receiving array q is initially [0, 0, 0] for all processes. After executing the above function, what will be the content of q for each process.

Problem 3

For the following code snippet (questions a to d are independent from each other):

1.         #pragma omp parallel num_threads(16)

2.         {

3.                     #pragma omp for schedule(static,2) nowait

4.                     for(i=0; i<N; i++){

5.                                   a[i] = ....

6.                     } 7.

8.                     #pragma omp for schedule(static,2)

9.                      for(i=0; i<N; i++){

10.                    ... = a[i]

11.                   }

12.       }

a. [6 points] For each one of the following situations, indicate whether there will be a possibility of race condition or not.

Scenario

Race

Condition (Y/N)

The code as it is above

 

We remove “nowait” from line 3

 

We keep the “nowait” in line 3 but change the schedule in line 3 to (dynamic, 2)

 

We keep the “nowait” in line 3 but change the schedule in line 8 to (dynamic, 2)

 

We remove the “nowait” from line 3 and put it in line 8

 

We remove the whole line  1  and  use  clause  “#pragma  omp parallel for in both lines 3 and 8 and remove the “nowait” of line 3

 

b. [7 points] How many threads exist at line 7 (that empty line between the two for- loops)? Explain how you reached that number in no more than two lines.

c. [7 points] How many threads exist just after line  12? Explain how you reached that number in no more than two lines.

d. [5 points] Suppose the above code executes on a multicore that has  16 cores and each core is superscalar and not hyperthreading. If N = 16, how many threads can execute in parallel on this multicore, assuming no other processes are running on this multicore? Justify your answer.

Problem 4

Answer the following questions related to GPUs and CUDA.

a.  [7]  Suppose we have two kernels: kernelA and kernel in the same program. The program launches kernelA. When kernelA finishes, it generates data to be used by kernel. What is the best way for kernelA to send data to kernelB? Justify.

b. [5] Can we make two kernels, from the same program, execute on the same GPU at the same time? If yes, explain how. If not, explain why not.

c.  [9]  If the warp concept did not exist, what would be the implications on GPU programming? State three implications and, for each one, explain in 1-2 lines.




站长地图