代写Module 3 – Apache Hadoop MapReduce调试Python程序

2025.05.09 - 首页 >> Java编程

Module 3 – Apache Hadoop MapReduce

Assignment – Working with AWS S3 and Hadoop (32 total points)

1 Purpose

This assignment is designed to provide you with experience using three related big data technologies. We will first explore the AWS Simple Storage Service, a popular implementation of cloud object storage capability. Then we will configure and start a Hadoop (AWS EMR) cluster to gain some familiarity with two of the core services of this environment. The first Hadoop service we will investigate is the Hadoop Distributed File System (HDFS), a file-based alternative to cloud object storage. Then we will learn to apply the Hadoop MapReduce parallel execution engine and write some programs to process data we will copy to and from our big data stores.

2 Cautions

This assignment assume you are capable of coding in the Python programming language. There are some links to Python tutorials in the reading for Module 1 – Big Data Concepts. And numerous additional tutorials can be found on YouTube and other sites on the web.

Further, this assignment assumes you have successfully completed all the steps described in Module 1 – Assignment for setting up an AWS account for use during this course.

3 Assignment Submission

This assignment will be graded and is worth a maximum of 32 points in total. Your solutions to the following exercises must be contained in single a MS Word,” “pdf,” or “Google Docs” format file and will include screenshots, code, or other results as described below. Upload your file to our Coursera site.

Your solutions must be readable (using a reasonably sized font, even for screenshots), explained as needed, and clearly indicate the exercise with which they associated. Also, please include in your submission, our course name and number, your name and other information, such as a student identifier, to ensure you receive proper credit for your work.

4 Exercise #1 (2 points)

4.1 Background

Amazon Simple Storage Service (Amazon S3) is storage for the Internet. You can use Amazon S3 to store and retrieve any amount of data at any time, from anywhere on the web. Amazon S3 stores data as objects within buckets. An object is data (a sequence of bytes) and optional metadata that describes the data. To store data in Amazon S3, you could upload a Linux, MacOS, or Windows file to a bucket where it will be saved as an object.

Buckets are containers for objects. You can have one or more buckets in existence in your AWS account. You can control access for each bucket, deciding who can create, delete, and list objects in it. You can also choose the geographical Region where Amazon S3 will store the bucket and its contents and view access logs for the bucket and its objects.

This exercise will step you through the process of creating and then working with some S3 buckets. One bucket will be used only during this exercise, and then you will empty and delete it. A second bucket will remain in existence throughout the rest of this course and be used to hold files related to this and upcoming assignments.

You pay for storing objects in S3 buckets and not for the mere existence of buckets themselves. And object storage costs only $0.023 per GB per month.

4.2 Bucket Naming Conventions

Since each bucket name must be unique across all names assigned to buckets in the AWS cloud, our suggestion is that you apply a unique prefix to each bucket name. For example, a prefix could be the first three letters of your first name, followed by the first three letters of your last or family name followed by your four-digit birth year or something like this. Of course, some people might have very short names, or only a single name, so choose a prefix that makes sense to you.

We will refer to buckets by some generic name such as “userprefixwork.” But when you create the bucket, we expect that you will substitute a unique prefix for the string “userprefix”. For example, if you are asked to create a bucket named “userprefixwork” you would instead create a bucket named something like “josros1954work” or similar.

Some of the AWS services we will use impose some further constraints on bucket names, as listed below, so take this into account when you design your unique prefix:

· Names can consist of lowercase letters, numbers, periods (.), and hyphens (-).

· Names cannot end in numbers.

4.3 Creating a Temporary S3 Bucket

Here you will create a bucket which you will be used only for the duration of this exercise. The generic name of this bucket will be “userprefixtemp”

1. Sign in to the AWS Management Console (if you have previously signed out).

2. Enter “S3” into the search box at the top of the page and then select S3 from the Services list

3. At this point you might get a “marketing page” about S3 with S3 page menu panel minimized and accessible by clicking on three horizontal bars towards the upper left of the screen. If so, click on those bars to expose the S3 menu panel.

4. With the S3 menu panel displayed, select “Buckets”

5. At this point right hand Buckets panel should appear. This panel contains a section, General purpose buckets which lists all the buckets in your account created directly by you or by AWS services like EMR on your behalf.

6. Going forward, to return to the Buckets panel, get to an S3 service page, for example, by following bullet points 2 and 3 above, or if the Amazon S3 menu panel is visible just select “Buckets”

7. Choose the “Create bucket” button towards the top right of the General purpose buckets section of this panel to open the Create bucket panel.

8. In the General configuration section of this panel, under Bucket type accept the default General purpose type.

9. Now under Bucket name enter “userprefixtemp” as the name for your bucket. Recall that when you actually create the bucket you are going to replace “userprefix” with your own prefix to the name “temp” to ensure the bucket name is unique. So, you will enter something like “josros1954temp”

10. Scroll down to the next section, Object Ownership and accept the default ACL Disabled (recommended)

11. In the next section, Block Public Access settings for this bucket accept the default Block all public access.

12. Keep scrolling down until you see the Create bucket button. Choose Create bucket. You've created a bucket in Amazon S3

13. Upon bucket creation, you will be taken to the Buckets panel, and see a list including the name of the bucket you created.

4.4 Working with a Temporary S3 Bucket

4.4.1 Creating an Object

Now that you've created a bucket, you're ready to upload a file from your PC or Mac (or any other computer) and create an object. An object can hold any kind of file: a text file, a photo, a video, and so on.

1. In the General purpose buckets section of the Buckets panel, click on the name of the bucket to which you want to upload your file.

2. The right-hand panel should now display information about the selected bucket. You can always reach this information panel by clicking on the name of a bucket on the Buckets panel

3. On the right-hand side of the Objects section of this panel, click on Upload button.

4. On the following Upload panel, in the Files and folders section click on the Add files button. A file dialog box should appear.

5. Choose a file to upload from your PC or Mac, and then select Open. Just use any file you have handy (even this one).

6. Scroll down to the bottom of the Upload panel and click on the Upload button.

7. You should now see the Upload: status panel.

8. Click on the Close button on the right side of the panel.

The file you uploads should now exist and an object in the bucket you created.

4.4.2 Verifying and Object was Created (for Assignment Credit)

To receive credit for this question, include a screenshot in your submission document, labeling it “Exercise #1,” showing the bucket information panel listing some named object in the bucket you created. Note, this is the panel that appears after you choose Close from the Upload: status panel. You results should appear similar to this:

You can also return to this panel whenever you like by doing the following:

1. Enter “S3” into the search box at the top of the page and then select S3 from the Services list

2. From the menu panel on the left side of the page select the entry Buckets

3. This results in the display of a panel on the right side of the screen, which list the buckets in your account. Click on the name of the bucket you just created, and the panel for which you need to take a screenshot should appear.

Not now, but for future reference, to delete any individual object from a bucket:

1. In the General purpose buckets section of the Buckets panel, click on the name of the bucket from which you want to delete a file.

2. The right-hand panel should now display information about the selected bucket. You can always reach this information panel by clicking on the name of a bucket on the Buckets panel

3. In the Objects panel, choose the object that you want to delete (by clicking on the little box to the left of the object name), and then choose Empty.

4. To confirm that you want to delete the object, in the Delete Objects panel, enter the words “permanently delete.”

5. Choose Delete objects button

4.4.3 Emptying and Deleting the Temporary Bucket

We strongly recommend that you delete your “userprefixtemp” bucket so that charges do not accrue. Before you delete your bucket, you must empty the bucket (delete the objects in the bucket). After you delete your objects and bucket, they are no longer available.

1. In the Buckets panel, choose the bucket that you want to empty (by clicking on the little circle to the left of the bucket name), and then choose Empty.

· To confirm that you want to empty the bucket and delete all the objects in it, in Empty bucket, enter the words “permanently delete.”

· Now click on the button “Empty”

· Important, emptying the bucket cannot be undone. Objects added to the bucket while the empty bucket action is in progress will be deleted.

· From the left-hand panel menu select “Buckets” and the list of buckets should appear

· To delete a bucket, in the Buckets list, select the bucket (by clicking on the little circle to the left of the bucket name).

· Choose Delete.

· To confirm deletion, in Delete bucket, enter the name of the bucket.

· Now select the “Delete bucket” button

4.5 Creating a Permanent Bucket

Even though we will create clusters that support local Linux storage as well the Hadoop Distributed File System, you will keep assignment files in a second bucket. The reason is that when a cluster terminates, all its resources are decommissioned and anything stored within the cluster will be lost. And you will be terminating your clusters between assignments or when you pause working on a specific assignment.

The generic name of the bucket we will use going forward is called “userprefixwork”. Recall, when you create the bucket add your own prefix to the name “work”. To create this bucket just follow the steps outlined in section 4.3 of this document. The existence of an empty bucket costs nothing.

5 Exercise #2 (10 points)

5.1 Background

In this exercise we will create a Hadoop cluster in the Amazon cloud (AWS) and explore the use of the Hadoop Distributed file system that it provides.

5.2 Technical References

The operating system supported on the EMR cluster primary node, which you will connect to using a terminal via SSH, and with which you will interact, is Linux. If you are unfamiliar with how to work with Linux, do not worry, you can get by with knowing just a few general details and a handful of console commands. And you will never be examined about any aspect of the Linux environment. If you need some background, here are a few references, and more can be found on the web:

userprefix

Linux Fundamentals: A Training Manual

https://ww3.ticaret.edu.tr/aboyaci/files/2016/09/a_unix_primer.pdf

Linux Command Line Cheat Sheets

https://www.stationx.net/linux-command-line-cheat-sheet/

https://www.guru99.com/linux-commands-cheat-sheet.html

5.3 Creating an EMR (Hadoop) Cluster

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform. that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.

To create an EMR cluster, follow the companion document to this one on our Coursera site, as listed below. As you follow the instructions in that document to set up a Hadoop (EMR cluster), make sure to choose the application bundle “Core Hadoop”. This request about choosing an application bundle will become clear as you following the setup instructions.

“Module 3 - Getting Started with Amazon EMR”

Once you have launched a cluster, and connected to the EMR primary node via your terminal software, you can proceed to the next part of this exercise.

5.4 Additional Setup

5.4.1 Copy Files from Coursera

Download following files from our Coursera site to your personal computer:

· The “Data” zip file

· The “Programs” zip file

Now “unzip” (extract) these two files on your personal computer. Together they should expand to the following files:

· From the “Data” zip file

a. w.data

b. x.data

c. z.data

d. Salaries.tsv

· From the “Programs” zip file

e. WordCount.py

f. Salaries.py

5.4.2 Move the Downloaded Files to an AWS S3 Bucket

Upload the unzipped (extracted) files you copied to your PC or Mac to the “userprefixwork” bucket you created previously.

5.4.3 Move Objects (Files) Between an AWS Bucket and Your EMR Primary Node

Moving files between S3 buckets and Linux (local) file system on your EMR primary node requires that you use certain Amazon AWS specific commands. These commands are simple, and work as follows:

Assume you have an S3 bucket name “userprefixwork” holding an object named “myid.txt”:

To copy the object to the Linux directory “/home/hadoop” on the EMR primary node just enter the following on the terminal you have connected (via SSH) to that primary node:

aws s3 cp s3://userprefixwork/myid.txt /home/hadoop/myid.txt

Now assume you again have an S3 bucket name “userprefixwork” but this time you have a file called “myname.txt” on an EMR primary node in the Linux directory “/home/hadoop”:

To copy the file from the Linux directory “/home/hadoop” on the EMR primary node to the bucket “userprefixwork” just enter the following on the terminal you have connected (via SSH) to that primary node:

aws s3 cp /home/hadoop/myname.txt s3://userprefixwork/myname.txt

Note, this is the way you can copy any files between your “userprefixwork” bucket to the Linux (local) file system of the primary node.

Now, using a terminal connected via SSH to the EMR primary node copy the following from your “userprefixwork” bucket to the Linux (local) file system of the EMR primary node into the directory “/home/hadoop”:

· w.data

· x.data

· z.data

· Salaries.tsv

· WordCount.py

· Salaries.py

You might wonder “how can I exchange files or objects between S3 buckets and HDFS”. So, please recall that you can do so directly using arguments to the “hadoop fs” command from the terminal connected via SSH to the EMR primary node.

5.5 Working with HDFS

All the following interactions with HDFS should occur using a terminal connected via SSH to the EMR primary node.

For convenient reference, here is the HDFS command reference:

https://apache.github.io/hadoop/hadoop-project-dist/hadoop-common/FileSystemShell.html

To prevent confusion: the default directory of your Linux account on the Hadoop EMR primary node is “/home/hadoop”. But when we want to copy something to HDFS we will sometimes copy it to an HDFS directory beginning with “/user/hadoop.” Be aware, the Linux and HDFS file system path names have nothing to do with one another. Any similarity in naming (such as the use of the directory name “hadoop”) is just coincidental.

5.5.1 Exercise 2a (1 point)

Execute some HDFS command (you needed to figure out which one) to list the files and directories under the HDFS directory listed below:

/user

Write down the command you executed and also take a screen snapshot of the names of the files or directories that are listed and include it in your assignment submission. Label this “Exercise 2a”.

5.5.2 Exercise 2b (1 point)

Execute a command to create the following HDFS directory:

/user/hadoop/<userprefixtemp>

where <userprefixtemp> is the name of you previously assigned to your “userprefixtemp” bucket which should be something like josros1954temp. So, in this case you would create an HDFS directory called /user/hadoop/josros1954temp

Write down the command you executed and include it in your assignment submission. Label this “Exercise 2b.”

5.5.3 Exercise 2c (1 point)

Execute a command to create the following HDFS directory:

/user/hadoop/<userprefixtemp>V2

where <userprefixtemp> is again the name of you assigned to your “userprefixtemp” bucket which should be something like josros1954temp. So, in this case you would create an HDFS directory called /user/hadoop/josros1954tempV2

Record the command you executed and include it in your assignment submission. Label this “Exercise 2c.”

5.5.4 Exercise 2d (1 point)

Execute a command that copies a given local file (that is a file in the primary node’s Linux file system) to the given HDFS directory:

· Source local file: /home/hadoop/x.data

· Destination HDFS directory: /user/hadoop/<userprefixtemp>

where <userprefixtemp> is the name of you assigned to your “userprefixtemp” bucket which should be something like josros1954temp.

Now execute the following command:

hadoop fs –ls /user/hadoop/<userprefixtemp>

Record the command you executed to copy the local file into HDFS, and also take a screen snapshot of the files or directories listed when you executed the above “hadoop fs -ls …” command and include these in your assignment submission. Label this “Exercise 2d.”

5.5.5 Exercise 2e (3 points)

Note, Amazon EMR and Hadoop provide a variety of file systems that you can use with EMR. You specify which file system to use with a file system prefix. For example, s3://myawsbucket references an Amazon S3 bucket using EMRFS (EMR file system). See:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Execute a command that copies a given S3 object to the given HDFS directory. Hint, neither the source or destination of these files/objects is in the local file system, so the “-get” and “-put” commands will not work.

· Source S3 object: s3://<userprefixwork>/z.data

· Destination HDFS directory: /user/hadoop/<userprefixtemp>

where <userprefixtemp> is the name of you assigned to your “userprefixtemp” bucket which should be something like josros1954temp.

Now execute the following command:

hadoop fs –ls /user/hadoop/<userprefixtemp>

5.5.6 Exercise 2f (2 points)

Execute a command that copies a file from one HDFS directory to another HDFS directory:

· Source HDFS file: /user/hadoop/<userprefixtemp>/x.data

· Destination HDFS directory: /user/hadoop/<userprefixtemp>V2

where <userprefixtemp> is the name of you assigned to your “userprefixtemp” bucket which should be something like josros1954temp.

Now execute the following command:

hadoop fs –ls /user/hadoop/<userprefixtemp>V2

Record the command you executed to copy the file form. one HDFS directory to another, and also take a screen snapshot of the files or directories listed when you executed the above “hadoop fs -ls …” command and include these in your assignment submission. Label this “Exercise 2f.”

5.5.7 Exercise 2g (1 point)

Execute a command that removes a file from an HDFS directory

· HDFS file to remove: /user/hadoop/<userprefixtemp>/x.data

Now execute the following command:

hadoop fs –ls /user/hadoop/<userprefixtemp>

Record the command you executed to remove the file, and also take a screen snapshot of the files or directories listed when you executed the above “hadoop fs -ls …” command and include these in your assignment submission. Label this “Exercise 2g.”

5.6 Completing the Exercise

At this point you have two choices.

· Terminate your EMR cluster, if you do not plan on immediately working on the next exercise. This will ensure you are not charged unnecessarily for use of your inactive cluster.

· Leave your EMR cluster active, and continue on to the next exercise, which also require an active EMR cluster.

6 Exercise #3 (20 points)

6.1 Background

In this exercise we will create a Hadoop cluster in the Amazon cloud (AWS) and explore the use of the MapReduce execution engine it provides.

Note this exercise assumes you have completed the previous exercise.

6.2 Creating an EMR (Hadoop) Cluster

If you have an Amazon EMR cluster active from the previous exercise to use that. Otherwise continue as described below.

“Module 3 - Getting Started with Amazon EMR”

Once you have launched a cluster, and connected to the EMR primary node via your terminal software, you can proceed to the next part of this exercise.

6.3 Additional Setup

6.3.1 Files and Directories

If you have created a new EMR cluster, then follow the setup instructions in Section 5.4 of this document. If it does not exist, create the following HDFS directory: /user/hadoop/<userprefixtemp>

Make sure to copy the given local file (that is a file in the primary node’s Linux file system) to the given HDFS directory:

· Source local file: /home/hadoop/w.data

· Destination HDFS directory: /user/hadoop/<userprefixtemp>

If you are using the EMR cluster from exercise #2, you may have some of the HDFS directories described below already created and also some of the files listed below copied to those directories. If so, only create any additional HDFS directories and copy any additional files mentioned.

6.3.2 MrJob

Install the mrjob library on your EMR primary node:

1. If you have not done so, ssh to the EMR primary node

2. Enter the command listed below and follow any displayed instructions

sudo /usr/bin/pip3 install mrjob[aws]

Please review the information on MRJob in the Module 3 Readings, Lesson 5 – MapReduce Programming. Especially reference the The MRJob documentation site and read sections: Fundamental and Writing Jobs. But not every detail is important. I provide you with the exact commands needed to execute mrjob programs below.

6.4 Check Step

Here we ensure that your EMR cluster is configured appropriately to execute MapReduce jobs using MRJob.

Execute the following:

python WordCount.py -r hadoop hdfs:///user/hadoop/<userprefixtemp>/w.data

Note there must be three slashes in “hdfs:///” as “hdfs://” indicates that the file you are reading from is in HDFS and the “/user” is the first part of the path to that file. Also note that sometimes copying and pasting commands from the assignment document does not work and such commands need to be entered manually.

Upon competing execution, check that our MapReduce job produces some reasonable output. If all is well you should see information in the output somewhat similar to (but not exactly like) this when the program finishes correctly:

"well" 1

"when" 1

"will" 1

"within" 1

"writing" 2

"your" 5

Note, the above command will erase all output files in hdfs. If you want to keep the output use the following command instead:

python WordCount.py -r hadoop hdfs:///user/hadoop/,userprefixtemp>/w.data --output-dir /user/hadoop/words

Note, there are two hyphens (dashes) preceding the “output-dir”

If you do not see something like the about output, then carefully recheck your setup, possible starting from scratch with a new EMR cluster, or if this fails to resolve your issues, reach out via Coursera for help.

6.5 Working with MapReduce

6.5.1 Editing Python Files

Our MapReduce jobs will be coded using program written in the Python language. And there are two options for creating and editing them:

· You could create and update your files on a personal computer, using a text editor (and not a word processing program). Then you could copy these files to your “userprefixkork” bucket, and from there to your EMR primary node (/home/hadoop)

· Or, you could create or edit a python file on the EMR primary node itself. The editor that is available by default on your primary node is call “vim.” If you are unfamiliar with its use some tutorial material is suggested below (and more is available on the Web):

Vim Beginners Guide

https://www.freecodecamp.org/news/vim-beginners-guide/

Getting Started with Vim: The Basics

https://opensource.com/article/19/3/getting-started-vim

Remember unless you copy your files from the primary node to your userprefixwork bucket, they will be lost after you terminate your EMR cluster.

6.5.2 Exercise 3a (3 points)

Slightly modify the WordCount.py program. Call the new program WordCount2.py.

Instead of counting how many words there are in the input documents (w.data), modify the program to count how many words begin with the lower-case letters a-n (a through n, inclusive) and how many begin with anything else.

When you execute this program the output file should look similar to (but not exactly like):

a_to_n, 12

other, 21

So, your task is to write a MrJob MapReduce program which again accepts the following file as input

hdfs:///user/hadoop/<userprefixtemp>/w.data

and outputs just two key value pairs, one with key “a_to_n” and an integer value of how many words begin with these lower-case letters, and another key-value pair with key “other” and value how many words begin with some character other than lower-case a-n.

Provide a listing of the program you wrote, the command you used to execute it, and a screen snapshot of the output the program generated and include these in your assignment submission. Label this “Exercise 3a.”

6.5.3 Exercise 3b (5 points)

Modify the WordCount.py program again. Call the new program WordCount3.py.

Instead of counting words, calculate the count of words having the same number of letters. For example, if we have a file consisting of one record of the form.

hello there joe

our job should output key value pairs similar to (but not exactly like) the following:

3, 1

5, 2

Hint, the key in a key-value pair can be an integer just as well as a string.

So, your task is to write a MrJob MapReduce program which again accepts the following file as input

hdfs:///user/hadoop/<userprefixtemp>/w.data

and outputs key value pairs where each one has a key with is some number of characters, and the value a count of words having that many characters

6.5.4 Exercise 3c (7 points)

Modify the WordCount.py program. Call the new program WordCount4.py.

Now we will write a MRJob MapReduce job to calculate the count of unique per record word bigrams. A word bigram is a two-word sequence. For example, if we have a file consisting of records of the form.

hello there joe

hi there

there joe go

joe

Bigrams for these records are create by sliding a two word “window” across the words of the record. For example each record above has the following word bigrams:

hello there joe => “hello there”, “there joe”

hi there => “hi there”

there joe there => “there joe”, “joe there”

joe => Note, this record has no word bigrams

Notice, in the above example, there are 2 instances of the word bigram “there joe”.

So, your task is to write a MrJob MapReduce program which accepts the following file as input

hdfs:///user/hadoop/userprefixtemp>/

w.data

and outputs key value pairs where each one has a key which is some word bigram string, and the value a count of the number of occurrences of that word bigram. Note, please convert all words to lower case on input, so Hello and hello become the same word.

Our job should output key value pairs similar to (but not exactly like) the following:

“hello there”, 1

“hi there”, 1

“joe there”, 1

“there joe”, 2

6.5.5 Exercise 3d (5 points)

Now do the same as the above for the files Salaries.py and Salaries.tsv.

The “.tsv” file holds department and salary information for Baltimore municipal workers. Have a look at Salaries.py for the layout of the “.tsv” file and how to read it in to our MapReduce program.

Copy the Salaries.tsv file to the HDFS directory /user/hadoop/<userprefixtemp>.

Execute the Salaries.py program to make sure it works. It should print out how many workers share each job title. To do so execute the following:

python Salaries.py -r hadoop hdfs:///user/hadoop/<userprefixtemp>/Salaries.tsv

Now modify the Salaries.py program. Call it Salaries2.py

Instead of counting the number of workers per department, change the program to provide the number of workers having High, Medium or Low annual salaries. This is defined as follows:

High	100,000.00 and above
Medium	50,000.00 to 99,999.99
Low	0.00 to 49,999.99

The output of the program Salaries2.py should be something like (but not exactly like) the following (in any order):

High 20

Medium 30

Low 10

Some important hints:

· The annual salary is a string that will need to be converted to a float.

· The mapper should output tuples with one of three keys depending on the annual salary: High, Medium and Low

· The value part of the tuple is not a salary. (What should it be?)

7 Conclusion

Remember to…

· If you updated any of your Python files in place on your primary node and wish to save them, make sure to copy them back to your “userprefixwork” bucket.

· Terminate your EMR cluster (as described in “Getting Started with Amazon EMR”)

· Submit the assignment document with your exercise solutions.