Consider the following code that adds two matrices A and B and stores the result in a matrix C:
for (i= 0 to 15) {
for (j= 0 to 31) {
C[i][j] = A[i][j] + B[i][j];}}
Two possible ways to parallelize this loop is illustrated below:
(a) For each Pk in {0, 1, 2, 3}:
for (i = 0 to 15) {
for (j = Pk*7 + Pk to (Pk+1)*7 + Pk){
// Inner Loop Parallelization
C[i][j] = A[i][j] + B[i][j];}}
(b) For each Pk in {0, 1, 2, 3}:
for (i= Pk*3 + Pk to (Pk+1)*3 + Pk) {
// Outer Loop Parallelization
for (j = 0 to 31) {
} C[i][j] = A[i][j] + B[i][j];}
Considering we have a quad-core multiprocessor and the elements of the matrices A, B, C are stored in a row major order, answer the following questions.
(1) Using the table below, show how the parallelization (a) and (b) would work and determine how many cycles it would take to execute them on a system with a quad-core multiprocessor, assuming addition takes only one cycle.
Cycle
Pk = 0
Pk = 1
Pk = 2
Pk = 3
Cycle
Pk = 0
Pk = 1
Pk = 2
Pk = 3
(2) Which parallelization is better and why?