Matrix Multiplication: C++ Multithreading Without Thread Synchronization

In my previous blog, I mentioned that “we may be better off writing multithreaded programs for our problems without explicitly using these (thread synchronization) mechanisms, if and whenever possible.” We saw a C++ multithreaded code in action using Boost.Asio Strand. In this blog, I am going to continue this topic and show you another working C++ multithreaded program for matrix multiplication without thread synchronization.

1. Multithreading Without Thread Synchronization: What Kind of Problems Can We Solve With This Approach?

The problems at hand must be able to be decomposed in such a way that its subtasks can be completed independently. From the perspective of working threads, this means that each thread can work on a subtask from start to the end without waiting for the results of other threads. In the eyes of data, it implies that each thread would only manipulate its own portion of the shared data. Typical problems in this category can be image processing (pixel manipulation), vector computation and matrix processing.

2. A C++ Multithreaded Program for Matrix Multiplication Without Thread Synchronization

The following code is a C++ class “Matrix” multiplying two matrices, each of SIZE by SIZE with elements of data type T. It stores matrices A, B, C as its private data members and compute C = A * B in its multiply() member function. As we shall see in the main(), this member function is also our thread function. The number of threads multiplying matrices is stored in its “num_thread” data member and initialized with “num” passed from its constructor. The output() member function is for us to check out the result. The code using this class is shown in the main().

#include<iostream>
#include<boost/thread/thread.hpp>
#include<boost/bind.hpp>
using std::cout;
using std::endl;
using std::cin;
// multiply matrice of SIZE by SIZE of type T
template<typename T, int SIZE = 10>
class Matrix
{
public:
  Matrix(int num)
  {
   init(A); 
   init(B);  
   //init(C); // initialize matrix C in multiply() instead
   num_thread = num;
  }
 

 void multiply(const int& slice)   
 {         
   // each thread works on its own separate slice only
   // so there is no need for synchronization among threads
   // note that this 'slicing' works fine
   // even if SIZE is not divisible by num_thread
   const int from = (slice * SIZE)/num_thread; 
   const int to = ((slice+1) * SIZE)/num_thread; 

   for (int i = from; i < to; i++)
   {
    for (int j = 0; j < SIZE; j++)
    {
     C[i][j] = 0; // initialize matrix C
     for (int k = 0; k < SIZE; k++)
      C[i][j] += A[i][k] * B[k][j];
    }   
   }
 }

 void output()
 {
   print(A);
   cout << endl<< "   * " << endl;
   print(B);
   cout << endl << "   = " << endl;
   print(C);
   cout << endl;
 }
private:
  T A[SIZE][SIZE];  // matrix multiplication
  T B[SIZE][SIZE];  // C = A * B 
  T C[SIZE][SIZE];
  int num_thread;   // number of threads
 
  void init(T M[SIZE][SIZE]) // initialize a matrix
  {  
   int value = 0;
   for (int i = 0; i < SIZE; i++)
   {
    for (int j = 0; j < SIZE; j++)
     M[i][j] = value++;
   }
  } 

  void print(const T M[SIZE][SIZE]) const // print out result
  {
    for (int i = 0; i < SIZE; i++)
    {
     std::cout << endl << "     |";
     for (int j = 0 ; j < SIZE; j++)
     {
      cout << M[i][j] << "  ";
     }
     cout << "|";
   }
  }
};

int main() {
   cout << "There are " << boost::thread::hardware_concurrency()
        << " cores/processors on your computer." << endl;
   cout << "How many threads do you want to use?" << endl;

   int thrd_num;
   do {
     cout << "Your answer: ";
     cin >> thrd_num;
   } while (thrd_num <= 0);

   // default matrix of 10 by 10 integers for multiplication
   // concurrently computed by thrd_mum of working threads
   Matrix<int> m(thrd_num);
 
   boost::thread_group threads;
   for (int i = 0; i < thrd_num; i++)
   {
     threads.create_thread(
     boost::bind(
         &Matrix<int>::multiply,
         boost::ref(m),
         i));
   }

   // main thread can do other things here concurrently. We just
   // simply let it wait for working threads to complete

   threads.join_all();

   m.output(); 
   return 0;
}

In its private section, the init() private member function initializes matrices A and B and the print() private member function prints out a matrix. As these functions are used multiple times in the public member functions, they are factored out and placed into its private section. It is also a very good practice to put implementation-specific stuffs into a class’s private part. Advantage: if you want to change your implementation after your code is published, you can change it without changing the interface of your class, thus no impact on how your users would use your code. Here, for example, you can change your mind and re-write the init() function so that each element in the matrix would have a random value of integer (when T is int).

3. Thead Function: multiply() in the Class

This thread function is where the task of matrix multiplication is decomposed into multiple subtasks that each thread can work on concurrently. It shows a solid example of the rationale of “Concurrency Without Thread Synchronization”. Can you ask yourself this: how many C++ books have you read that this topic was talked about in one page after another? I read too many of them. However, a solid, working example is worthy of a thousands of words. (Oh, your remind me: detoured!) Each thread is given a “slice” of data to work on that is passed in from its argument. Notice the “const” key words I used in the function: we do not want the thread to cross the border of its portion of data stepping on the data assigned to other threads. That would destroy the program.

4. What Is Going On Under Hood?

In the following code several lines for debugging prompts were added and the code should tell you more than I can with words. As mentioned in my previous blog, std::cout is not thread-safe, protecting it with a mutex is a must to see correct prompt information from each thread. Now the code is no long “Without Thread Synchronization”. I am taking the risk of ruining my reputation for you to understand what is going on under the hood. The good news is that you may just run and play with it. I am sure you will have more fun with this version.

#include<iostream>
#include<boost/thread/thread.hpp>
#include<boost/bind.hpp>
using std::cout;
using std::endl;
using std::cin;
// multiply matrice of SIZE by SIZE of type T
template<typename T, int SIZE = 10>
class Matrix
{
public:
   Matrix(int num)
   {
     init(A); 
     init(B);  
     //init(C); // initialize matrix C in multiply() instead
     num_thread = num;
   }
   void multiply(const int& slice)   
   {         
      // each thread works on its own separate slice only
      // so there is no need for synchronization among threads
      // note that this 'slicing' works fine
      // even if SIZE is not divisible by num_thread
      const int from = (slice * SIZE)/num_thread; 
      const int to = ((slice+1) * SIZE)/num_thread; 
      // the following 8 lines are added for debugging purpose
      io_mutex.lock();
          std::cout << "Thread (Id="
          <<boost::this_thread::get_id()
          << ") is computing slice "
          << slice << " from row "
          << from << " to colum "
           << to-1 << std::endl;
      io_mutex.unlock();
      for (int i = from; i < to; i++)
     {
       for (int j = 0; j < SIZE; j++)
       {
         C[i][j] = 0; // initialize matrix C
         for (int k = 0; k < SIZE; k++)
         C[i][j] += A[i][k] * B[k][j];
       }   
     }
     // the following 3 lines are added for debugging purpose
     io_mutex.lock();
       std::cout << "Finished slice " << slice << std::endl; 
     io_mutex.unlock();
   }
   void output()
   {
      print(A);
      cout << endl<< "   * " << endl;
      print(B);
      cout << endl << "   = " << endl;
      print(C);
     cout << endl;
   }
private:
   T A[SIZE][SIZE];  // matrix multiplication
   T B[SIZE][SIZE];  // C = A * B 
   T C[SIZE][SIZE];
   int num_thread;   // number of threads
   // the following mutex is used for debugging purpose
   boost::mutex io_mutex; 
 
   void init(T M[SIZE][SIZE]) // initialize a matrix
   {  
      int value = 0;
      for (int i = 0; i < SIZE; i++)
      {
         for (int j = 0; j < SIZE; j++)
         M[i][j] = value++;
      }
    } 
    void print(const T M[SIZE][SIZE]) const // print out result
    {
       for (int i = 0; i < SIZE; i++)
       {
         std::cout << endl << "     |";
         for (int j = 0 ; j < SIZE; j++)
        {
           cout << M[i][j] << "  ";
        }
        cout << "|";
    }
  }
};
int main()
{
   cout << "There are " << boost::thread::hardware_concurrency()
        << " cores/processors on your computer." << endl;
   cout << "How many threads do you want to use?" << endl;
   int thrd_num;
   do {
      cout << "Your answer: ";
      cin >> thrd_num;
   } while (thrd_num <= 0);
   // this line is added for debugging purpose
   cout << endl << "Main() thread (Id=" << boost::this_thread::get_id()
       << ")" << endl;
    // default matrix of 10 by 10 integers for multiplication
    // concurrently computed by thrd_mum of working threads
    Matrix<int> m(thrd_num);
 
    boost::thread_group threads;
    for (int i = 0; i < thrd_num; i++)
    {
      threads.create_thread(
      boost::bind(
         &Matrix<int>::multiply,
         boost::ref(m),
         i));
   }
   // main thread can do other things here concurrently. We just
   // simply let it wait for working threads to complete
   threads.join_all();
   m.output(); 
   return 0;
}

The following part of the output after I ran it with 4 threads:

There are 2 cores/processors on your computer.
How many threads do you want to use?
Your answer: 4

Main() thread (Id=00163E30)
Thread (Id=00163938) is computing slice 0 from row 0 to colum 1
Finished slice 0
Thread (Id=001640D0) is computing slice 2 from row 5 to colum 6
Finished slice 2
Thread (Id=00163F78) is computing slice 1 from row 2 to colum 4
Finished slice 1
Thread (Id=00164228) is computing slice 3 from row 7 to colum 9
Finished slice 3
// ... other output

We can clearly see that there are four working threads performing matrix multiplication concurrently with the idling main thread (Id=00163E30). As an exercise, I ask you to modify the code to make the main() working on slice 0, therefore keeping everybody equally busy.

5. Play With Matrices of Different Dimension (SIZE)

To do it, simply change the main() of above program with the following code to see the changes I made in two locations (lines 14 and 21). However, if SIZE is very large, the result shown in the command line window may be difficult for you to read. Anyway, have fun with it.

int main() {
   cout << "There are " << boost::thread::hardware_concurrency()
           << " cores/processors on your computer." << endl;
   cout << "How many threads do you want to use?" << endl;

   int thrd_num;
   do {
      cout << "Your answer: ";
      cin >> thrd_num;
   } while (thrd_num <= 0);

   // default matrix of 15 by 15 integers for multiplication
   // concurrently computed by thrd_mum of working threads
   Matrix<int,15> m(thrd_num);
 
   boost::thread_group threads;
   for (int i = 0; i < thrd_num; i++)
   {
      threads.create_thread(
          boost::bind(
         &Matrix<int,15>::multiply,
         boost::ref(m),
         i));
   }

   // main thread can do other things here concurrently. We just
   // simply let it wait for working threads to complete

   threads.join_all();

   m.output(); 
   return 0;
}

1 Response to Matrix Multiplication: C++ Multithreading Without Thread Synchronization

Lakshmana Maddineni says:

May 27, 2012 at 5:05 am

Can you ask yourself this: how many C++ books have you read that this topic was talked about in one page after another? I read too many of them. However, a solid, working example is worthy of a thousands of words.

You are right. None of the books and materials cover simple examples about multithreading. Few years back I have worked one critical bank project where one of the requirement is matrix multiplication C[10,000][10,0000] = A[10,000][10,0000] * B[10,000][10,0000]
and it is developed in ACE C++ Multithreads. After working that project, I am got to know the real use of threads and the benefits. Unfortunately none one the books cover these examples.

Very thankful to you for covering very good example on threads and implementing it in different ways.