代写CS3214 Fall 2024 Project 2 - “A Fork-Join Framework”代写留学生C/C++程序

- 首页 >> Algorithm 算法

CS3214 Fall 2024

Project 2 - “A Fork-Join Framework

Due: See website for due date. (Late days maybe used.)

What to submit: Upload a tar ball using the p2 identifier that includes the following files:

- partner . json with the SLO IDs in the format described for Project 1.

- threadpool .c with your code.

- threadpool .pdf with your project description. Use a suitable word processing pro- gram to produce the PDF file.

We will be using the provided fjdriver .py fileto test your code. Please see Section 5.3 for more information.

1 Background

The last two decades have seen the widespread use of processors that support multiple cores that can act as independent CPUs. Today, even processors used in smartphones con- tain 4 or more cores. Software has been slow to catch up, despite calls for programming models that make it easy to write scalable programs for multicore systems [1].

As a case study, consider the std::async function that is part of the C++11 standard. The reference documentation on cppreference.com provides the example shown in Fig- ure 1.

This toy example sums up the elements of a vector, which here are initialized to 1, using a recursive divide-and-conquer approach. At each level of recursion, the array is subdi- vided into two equal parts, one of which is passed to std::async to be executed in a separate thread, whereas the other part is recursively performed by the calling thread. std::async returns a handle of type std::future, which represents a reference to a result of a computation that is executed asynchronously. When the computation’s result is needed, a thread may invoke the future’s get() method. get() will return the result, arranging for—or waiting for—its computation as necessary.

#include  

#include  

#include  

#include  

#include  

template  

int  parallel_sum(RAIter  beg,  RAIter  end)

{

auto  len  =  std::distance(beg,  end);

if(len  < 1000)

return  std::accumulate(beg,  end,  0);

RAIter  mid  =  beg  +  len/2;

auto  handle  =  std::async(std::launch::async,

parallel_sum,  mid,  end);

int  sum  =  parallel_sum(beg,  mid);

return  sum  +  handle .get();

}

int  main()

{

std::vector  v(100000000,  1);

std::cout  << "The sum is " << parallel_sum(v .begin(), v .end())

<< ’\n’;

}

Figure 1:  A parallel sum implementation in C++11. This is a slightly modified version of the example published at http://en.cppreference.com/w/cpp/thread/async. Instead of 10,000, this program is summing up a vector with 100,000,000 elements.

Compiling and running this program under g++ 11.5.0 one obtains the following output:

$  g++  -pthread  -O2  cppasyncpsum .cc  -o  cppasyncpsum $   . /cppasyncpsum terminate  called  after  throwing  an  instance  of  ’std::system_error’

what():    Resource  temporarily  unavailable Aborted   (core  dumped)

The reason for this failure is that C++11’s std::async is implemented by blindly spawn- ing kernel-level threads (roughly 105  of them), without any regard to the amount of re- sources used by those threads.

This small example motivates the need for frameworks that do better than spawning one thread for each parallel task.

In this project, you will create a small fork/join framework that allows the parallel execu- tion of divide-and-conquer algorithms such as the one shown in the example in Figure 1 in a resource-efficient manner. To that end, you will create a thread pool implementation for dynamic task parallelism, focusing on the execution of so-called fork/join tasks. Your

Figure 2: A work stealing thread pool.  Worker threads execute tasks by popping them from the bottom of their deques.  If they run out of work, they first attempt to dequeue tasks from a global submission queue.  Failing that, they attempt to steal tasks from the top of other workers’ deques. New tasks maybe submitted externally to the global queue, but tasks spawned during the execution of a task are pushed onto the bottom of executing workers’ deques.

implementation should avoid excessive resource use in order to avoid crashes like the one seen in this example.

2 Thread Pools and Futures

Your fork-join thread pool should implement the following API:

/ **

*  threadpool .h

*

*  A  work-stealing,  fork-join  thread  pool .

*/

/ *

*  Opaque  forward  declarations .  The  actual  definitions  of  these

*  types  will  be  local  to  your  threadpool .c  implementation .

*/

struct  thread_pool;

struct  future;

/ *  Create  a  new  thread  pool  with  no  more  than  n  threads .

*  If  any  of  the  threads  cannot  be  created,  print

*  an  error  message  and  return  NULL .  */

struct  thread_pool  *  thread_pool_new(int  nthreads);

/ *

*  Shutdown  this  thread  pool  in  an  orderly  fashion .

*  Tasks  that  have  been  submitted  but  not  executed  may  or

*  may  not  be  executed . *

*  Deallocate  the  thread  pool  object  before  returning .

*/

void  thread_pool_shutdown_and_destroy(struct  thread_pool  *);

/ *  A  function  pointer  representing  a ’fork/join’ task .

*  Tasks  are  represented  as  a  function  pointer  to  a

*  function .

*  ’pool’  -  the  thread  pool  instance  in  which  this  task

*                      executes

*  ’data’  -  a  pointer  to  the  data  provided  in  thread_pool_submit

*

*  Returns  the  result  of  its  computation .

*/

typedef  void  *  ( *  fork_join_task_t)   (struct  thread_pool  *pool,  void  *  data);

/ *

*  Submit  a  fork  join  task  to  the  thread  pool  and  return  a

*  future .    The  returned  future  can  be  used  in  future_get()

*  to  obtain  the  result .

*  ’pool’  -  the  pool  to  which  to  submit

*  ’task’  -  the  task  to  be  submitted .

*  ’data’  -  data  to  be  passed  to  the  task’s  function

*

*  Returns  a  future  representing  this  computation .

*/

struct  future  *  thread_pool_submit(

struct  thread_pool  *pool,

fork_join_task_t  task,

void  *  data);

/ *  Make  sure  that  the  thread  pool  has  completed  the  execution

*  of  the  fork  join  task  this  future  represents .

*

*  Returns  the  value  returned  by  this  task .

*/

void  *  future_get(struct  future  *);

/ *  Deallocate  this  future .   Must  be  called  after  future_get()  */

void  future_free(struct  future  *);

2.1 Work Stealing

There are at least two common ways in which multiple threads can share the execution of dynamically created tasks: work sharing and work stealing. In a work sharing approach, tasks are submitted to a single, central queue from which all threads remove tasks.  The drawback of this approach is that this central queue can become a point of contention, particularly for applications that create many small tasks.

Instead, a work stealing approach is recommended [2] which has been shown to lead to better load balancing and lower synchronization requirements.  In a work stealing pool, each worker thread maintains its own local queue of tasks, as shown in Figure 2. Each queue is a double-ended queue (deque) which allows insertion and removal from both the top and the bottom.  When a task run by a worker spawns a new task, it is added (pushed) to the bottom of that worker’s queue. Workers execute tasks by popping them from the bottom, thus following a LIFO order.  If a worker runs out of tasks, it checks a global submission queue for tasks. If a task can be found in it, it is executed. Otherwise, the worker attempts to steal tasks to work on from the top of other threads’ queues.

In this assignment, you are asked to implement a work stealing thread pool. Since work stealing is purely a performance optimization, you may for reduced credit (corresponding to a B letter grade) implement a work sharing approach.

2.2 Helping

A naive attempt at implementing future get would have the calling thread block if the task associated with that future has not yet been computed.  “Blocking” here means to wait on a synchronization device such as a condition variable or semaphore until it is signaled by the thread computing the future.   However, this approach risks thread starvation:  if a worker thread blocks while attempting to call future get it is easily possible for all worker threads to be blocked on futures, leading to a deadlock because no worker threads are available to compute the tasks on which the workers are blocked!

Instead, worker threads that attempt to resolve a future that has not yet been computed must help in its execution.  If the future’s task has not yet started executing, the worker should steal it and execute it itself. If it has started executing, the worker has two choices: it could wait for it to finish, or it could help by executing tasks spawned by the task being joined, hoping to speed up its completion.

Note that we have used the word “help” in two contexts now: first,helping when an not- yet-started task is being joined. This mode of helping, also referred to as inline execution, is mandatory in order to avoid deadlock. Here, a thread “helps” themselves because no one else is around to take on the task they’ve spawned.

Second, we have said that a worker that is trying to join a task that’s already in progress  because it was stolen by another thread may help by taking on tasks that the “thief” needs to complete in order to complete the stolen task. This mode of “helping” the thief  represents a policy decision that can improve performance, but it is not mandatory for a  functioning thread pool.

For the purposes of this assignment, we assume a fully-strict model as defined in [2]. A fully-strict model requires that tasks join all tasks they spawn — in other words, every

call to submit a task has a matching call to  future get() within the same function invocation. In this sense, all tasks spawned by a task can be considered subtasks that need to complete before the task completes (eventhough your thread pool implementation will not need to keep track of which task is a subtask of which other task). All our tests will be fully strict computations, which encompass a wide range of parallel computations.

Restricting ourselves to fully-strict computation for this projectsimplifies helping thiefs because it is always safe for workers intending to help to steal any task as long as they steal from the top of any other worker’s queue. Safety here refers to the absence of exe- cution deadlock.

Note that in a fully-strict model, in combination with helping, worker threads that have just finished a task will never be in a situation where they are looking for tasks on their own queue: this is because any subtask spawned from a task they were working on will be joined before that task returns. In this situation, the worker will either directly execute that task via helping, or the task will have been stolen by some other worker. In no case will a worker return from executing a task and find other, unfinished tasks on its queue. (Recall that all tasks added to a worker thread’s queue are subtasks of a task the worker was executing at the time.)

External threads must not help. Only worker threads should help when they encounter not-yet-finished tasks. Do not let external threads help with tasks - these threads should wait for a worker to complete a submitted task.  This way, it becomes easier to tune the pool’s concurrency level to match the number of physical processors or core dedicated to the threadpool.

Thread pools should be good citizens. When no tasks have been submitted, your thread pool should not consume significant CPU resources — rather, all worker threads should  be in the BLOCKED state.

3 Implementation

Except for constraints imposed by the API and resource availability, you have complete freedom in how to implement your thread pool. Numerous strategies for stealing, help- ing, blocking, and signaling are possible, each with different trade-offs.

You will need to design a synchronization strategy to protect the data structures you use,  such as flags representing the execution state of each tasks, the local queues, and the  global submission queue, and possibly others. You will need a signaling strategy so that  worker threads learn about the availability of tasks in the global queue or in other threads’ queues.

3.1    Basic Strategy

A basic strategy would be to use locks, condition variables, and the provided list imple- mentation (known to you from prior projects), which allows constant-time insertion and removal of list elements and which can be used to implement a deque.

You will have to define private structures struct  future and struct  thread pool in threadpool .c. A future should store a pointer to the function to be called, any data to be passed to that function, as well as the result (when available).  You will have to define appropriate variables to record the state of a future, such as whether its execution has started, is in progress, or has completed, as well as which queue the future is into keep track of stealing.

A thread pool should keep track of a global submission queue, as well as of the worker threads it has started. In the work stealing approach, each worker thread requires its own queue. You will also need a flag to denote when the thread pool is shutting down.

You will need to create a static function that performs the core work of each worker thread.   You will pass this function to pthread create(),  along with an argument (such as a pointer to a preallocated struct) allowing the thread to identify its position in the worker pool and to obtain a reference to the pool itself.

thread pool submit(). You should allocate a new future in this function and submit it to the pool.  Since the same API is used for external submissions (from threads that are not part of the pool) and internal submissions (from threads that are part of the pool), you will need to use a thread-local variable to distinguish those cases.  The thread local variable could be used to quickly look up the information pertaining to the submitting worker thread for internal submissions.

future get(). The calling thread may have to help in completing the future being joined, as described in Section 2.2. Helping is required for both work sharing and work stealing approaches.

thread pool shutdown and destroy(). This function will shutdown the thread pool. Al- ready executing futures must complete; queued futures may or may not complete

The  calling  thread  must  join  all  worker  threads  before  returning.      Do  not  use pthread cancel() because this function does not ensure that currently executing fu- tures run to completion; instead, use a flag and an appropriate signaling strategy.

Upon destruction, a threadpool should deallocate all memory that was allocated on behalf of the worker threads.

future free(). Frees     the     memory    for     a    future     instance     allocated    in thread pool submit().    This  function  is  called  by  the  client.    Do  not  call  it  in your thread pool implementation.

3.2 Implementation/Debugging Guide

Debugging Strategies. There are many pitfalls in multithreaded programming that can result in bugs in your program, so it is important to know some debugging strategies.

•  Do not run the test driver fjdriver .py every time to test your program. You can build your code in the ‘tests’ directory and directly run the tests until they work.

•  Use the available data race detection tools: Helgrind and DRD. To use Helgrind, run your code with valgrind  --tool=helgrind  . . . followed by the name of the executable you would normally run.  If you are not properly waiting for work on your condition variable, it is possible that your program does not make progress un- derHelgrind. You can identify this case by changing Helgrind’sinternal scheduling policy–add the --fair-sched=yes switch to valgrind’s options. If your program makes progress under these settings, look for a bug that causes it to busy-wait.

•  If your program appears “stuck” (does not make progress), then this can be the case for multiple reasons.  A common reason is deadlock where all threads are in the BLOCKED state because they are waiting to either acquire a lock or waiting to be signaled on a condition variable or semaphore.  You can identify this case by observing zero CPU utilization and using the ‘ps‘ command to check the (Linux) process state of your threads. If your program has deadlocked, then attach gdb (or start the program under gdb) and examine the backtraces of all threads. Linux also allows you to identify which thread holds a lock by examining its internals with gdb’sprint command.  If your program has not deadlocked (or only some threads have), you can still use the same strategy to obtain information about each thread’s progress.

Locking Strategy. Make sure your design includes a clear strategy for mutual exclusion: decide on which lock protects which instance of which data structure. Recall that one lock can protect more than one piece of data. We recommend to start with a single global lock (perhaps as part of the thread pool), which allows you to implement your logic, and once your entire pool works, implement strategies to break that lock into multiple locks.

Signaling Strategy. You need to solve multiple signaling problems in this project: idle worker threads must be signaled when either a global task is submitted or when a task becomes available for stealing.  To that end, you should use a single condition variable. You can’t use more than one because a worker can wait on only one condition variable at a time, but an idle worker needs to be woken up in both cases.

An additional signaling problem to solve involves when a thread can’t help (the thief) and is blocked on an in-progress task.  This thread will need to be signaled when said task has completed, which can be accomplished using a condition variable or semaphore associated with this task/future.

Avoiding Atomicity Violations. Once you break the lock into multiple locks, a key chal- lenge will be to ensure that updates to the state of a future are done atomically with respect to the presence (or absence) of this future in its respective queue (global or per- worker, depending on approach).  You must avoid a situation in which a worker thread scanning the global queue or a peer worker’slocal queue “sees” a future in said queue, is about to steal it, while the worker executing that task’s parent task attempts to join it and execute it via the helping path. Only one thread must succeed in executing the task – if the thread stealing the task executes it, the helping thread must either wait or engage in help- ing the executor. If the helping thread executes it, the thread attempting to steal must act as if the task had not been in the queue. In particular, if the thread helping wins this race, the future maybe completed and immediately after deallocated via future free(), so any pointers obtained by and stored in local variables of the worker thread may no longer be valid if you allowed this situation to occur.

A recommended approach is to maintain the invariant that only tasks that are available for execution are contained in any queue/list. Moving a task from the “new” to the “being executed” state should be atomic with respect to the removal of this task from the queue in which it is contained. Recall that list remove() modifies a linked list and thus needs to be protected by the same mechanism as other operations on that list.

4 Optimization Guide

4.1 Per-future locks vs per-workqueue locks

A design that seems natural at first is to have locks for each future which protect a future’s fields such as its state, and a lock for each worker that protects its fields such as the queue of tasks this worker currently maintains. The problem with this design, however, is that it makes it more difficult to avoid the atomicity violation described above. A worker thread attempting to steal a task would lock the queue first and then lock the future it sees there. Concurrently, a thread calling  future get would need to lock the future first, but it cannot commit to running the future’s task unless it also has the lock that protects the queue.  It would need to acquire both, and moreover, after acquiring the queue lock it would need to make sure that no worker has started the task. We do not recommend this design because

it is complex and error prone

•  requires dealing with the difference in locking order to avoid deadlock (first in- stance, by making the second lock acquisition a trylock with a suitable strategy on failure).

•  does not yield the highest performance because it involves 2 lock acquisitions on the hot path that involves future get.

Instead, we recommend a design where at least a future’s state field is protected not by a lock associated with the future, but by the lock that protects the queue into which the future is inserted upon creation. Note that the lock protecting a future must be constant and cannot change throughout the life of a future, so this lock must be continued to be used even if the future is at some point stolen.

4.2    Semaphores vs Condition Variables

We do not recommend that you use semaphores to signal your worker threads regarding  the availability of new tasks, because semaphores do not perform. well in the presence of  large numbers of signals. Recall that signaling a semaphore entails incrementingits inter-  nal count, which requires an atomic read-modify-write operation on its internal state. In  the context of cache-coherent multiprocessors, this causes a transition into the “modified” state in the accessing core’s cache, which causes a frequently updated semaphore to ping  pong between caches.  By contrast, condition variables are implemented in a way that  handles the common case of signaling with no waiter present using the “shared” state  - the condition variable records if any waiters are present and does not require updat-  ing state when a call topthread cond signal does not actually signal (unblock) any  threads.  In addition, we recommend that you intentionally break the rule of signaling  with the lock held (which Helgrind otherwise warns you about) for this case, while still  making sure that your threadpool makes progress eventually.

You may still find semaphores useful, potentially, to implement waiting on individual tasks.

4.3 Avoiding False Sharing

As you tune the performance of your implementation, be on the outlook for false sharing. False sharing occurs when per-thread data structures are allocated within the same cache line, which may happen, for instance, for neighboring array elements. A potential mistake is to have a contiguous array of per-thread (per-worker) structures which are not meant to be shared but allocated closely together. To avoid this, add padding to ensure they are allocated in different cache lines. A cache line is 64 bytes long on our machines.

Additional information is posted on the class website, look for the performance optimiza- tion guide.



站长地图