讲解 Operating Systems 操作系统国外、辅导MapReduce 程序
- 首页 >> OS编程1. Project Introduction
MapReduce is a new paradigm for large-scale parallel data processing introduced by Google
engineers in 2004. It is generally used in the domain of databases. In this project, your task is
to understand Map Reduce and build a simplified version of it. Furthermore, you will have to
correctly understand and implement concurrency for this project. Construct functions that
efficiently treat data and work concurrently for the given application wordcount.c. By
accomplishing this project, you will be able to understand what MapReduce is and use threads
and the related functions to realize concurrency.
2. Implementations
2.0 Environment
The project should be performed in Ubuntu 16.04 environment.
You can install Ubuntu or use a virtual machine to carry out your project.
Evaluation will be carried out in Ubuntu environment with AMD Ryzen Threadripper 1950X
16-Core processor.
Three files and three folders will be given.
- Makefile: Use this file to compile wordcount.c with your functions.
- mapreduce.h: The header file for implementing your functions.
(It is different from the file mapreduce.h provided in the link given in 2.1 Concept.)
- wordcount.c: The C file to test your functions.
- input_simple: A simple test set.
- input_long_balanced: A complex test set that is balanced.
- input_long_unbalanced: A complex test set that is unbalanced.
2.1 Concept
The general idea of MapReduce is described at the following link.
https://github.com/remzi-arpacidusseau/ostep-projects/tree/master/concurrencymapreduce
Use the pthread APIs for this project.
(See 2.5 Tips)
You are expected to describe the concept of MapReduce in your report.
2.2 Things to Do
Your task is to complete the following files which will be compiled with the test file
wordcount.c.
SWE3004-41: Operating Systems (Spring 2018)
OS Term Project 2
- single_mapreduce.c: Contains your functions for the single-threaded MapReduce
- multi_mapreduce.c: Contains your functions for the multi-threaded MapReduce
You need to complete the functions of MR_Emit() and MR_Run().
When implementing the multi-threaded version, you should use the pthread APIs for
concurrency.
2.3 Compilation
We provide a Makefile which you can use to compile your program.
Commands for compiling
$ make all
- To compile wordcount.c for both single- and multi-threaded versions.
$ make single_mapreduce
- To compile wordcount.c for the single-threaded version.
$ make multi_mapreduce
- To compile wordcount.c for the multi-threaded version.
The compiling outputs are as follows:
single_mapreduce, multi_mapreduce
If you run the programs, the result and execution time will be shown.
2.4 Output
The following arguments will be passed as input:
$ ./single_mapreduce {folder name} {# of partitions}
folder name: the folder we will use to test
the number of partitions: the number of partitions we will use.
The user applications, single_mapreduce and multi_mapreduce, will print the following:
Key: words separated by blank characters.
Count: the number of words corresponding to the given key.
Execution time: the time for execution.
Output Example (single-threaded version)
1) At the start of the project.
2) After you have made your own files.
3) Compile all versions.
4) Files after compilation.
SWE3004-41: Operating Systems (Spring 2018)
OS Term Project 2
5) Print (key, count) pairs and the execution time.
2.5 Tips
The difference between this project and the original project described in 2.1 Concept:
- When executing the executable files, we will pass only a folder name including all the
target files instead of passing all the file names.
- You need to give the number of partitions as shown in 2.4 Output.
$ ./single_mapreduce folder/file1 folder/file2 folder/file3 // original
$ ./single_mapreduce folder 1000 // project #2
You do NOT need to implement the MR_DefaultHashPartition() function.
- It is already included in wordcount.c.
- The test program must pass the function pointer of hash function as an argument in
MR_Run().
The number of threads must not exceed the value specified as input in the multi-threaded
version.
Do NOT modify any files that we have provided including Map() and Reduce().
In output, keys should be sorted in ascending key order (ASCII code value).
You can refer to the following chapters of your textbook.
- Chapter 26. Concurrency: An Introduction
- Chapter 27. Interlude: Thread API
- Chapter 28. Locks
- Chapter 29. Lock-based Concurrent Data Structures
- Chapter 30. Condition Variables
You can also refer to the following pthread API instructions.
- https://www.joinc.co.kr/w/Site/Thread/Beginning/PthreadApiReference
- http://www.cs.wm.edu/wmpthreads.html
SWE3004-41: Operating Systems (Spring 2018)
OS Term Project 2
Each student will be ranked according to the processing performance of the test file, and
your ranking will affect the final evaluation score.
3. Evaluation Policy
Implementation (70 points)
- Single-threaded version of MapReduce (20 points)
- Multi-threaded version MapReduce (30 points)
- Ranking based on performance (20 points)
Report (30 points)
- Investigation on MapReduce (10 points)
- Detailed explanation of your implementations (20 points)
Task Singlethreaded
Multithreaded
Performance
Report
Total
Investigation Implementation
Points 20 30 20 10 20 100
4. Submission Guidelines
Submit a compressed zip file named [your_student_ID]_proj2.zip.
The zip file should contain three files: [your_student_ID].pdf, single_mapreduce.c,
and multi_mapreduce.c.
If the format of your work is wrong, -10 points will be applied.
Please describe the key parts of your code explicitly in your report.
Do NOT attach the whole source code in your report, just the key parts.
Submit by uploading your zip file on i-Campus.
If you have trouble uploading it on i-Campus, send it to tk1star2@nate.com. (The title should
be [SWE3004-41] Term project 2, [your_student_id], [your_name])
5. Notice
Your term project must be your original work.
If you have any questions, describe your problem along with a screenshot and send it to
tk1star2@nate.com, or visit 85465 (Research & Business Center).
(Please include [SWE3004-41] in the title.)
In the case of cheating, you will FAIL this course.
(If two similar source codes or reports are found, both students will fail this course.)
There are 15 points delay penalty per day in case of late submission.
Have fun!