CSci415留学生讲解、辅导Java编程、讲解Java、辅导Hadoop

2018.11.07 - 首页 >> Java编程

CSci415 Assignment 3

Due: Nov 7th, 11:59pm. You will also have to give a presentation/demo in class.

In this programming assignment, you will practice parallel programming. You are recommended

to form a group of three, although individual work is allowed. Do not accept late submission.

Objectives

The goal of this programming assignment is to enable you to gain experience in:

1. Basic features of Hadoop distributed file system and MapReduce

2. Writing a program for finding common friends between all pairs of nodes in a large Network.

Language:

You are required to use Java as the implementation language.

Final Submission and Demo:

3. Submit a report + code to blackboard

You should hand in a (1) report which presents your implementation, (2) a README file

explaining how to run your program, and of course (3) your source code and the

executable files.

4. You need to demonstrate working of your implementation to the class on Demo Day.

The grading will be based on all these parts.

Total Marks: 100

---------------------------------------------------------------------------

Description

In large-scale social networks, there are normally tens of millions of users. The task of this

assignment is to implement a MapReduce program to identify common friends among all pairs

of users. Let U be a set of all users: {U1, U2, ..., Un}. Then the goal is to find common friends

for every pair of (Ui, Uj) where i ≠ j.

The input files have the following format:

A B C D E

B A C D

C A B

D A B E

E A D

where the first token in the line is the user and the remaining tokens in the line are the friends.

So, for line 1, A has four friends, B, C, D, and E. For example, A, and B have C, and D as their

common friends. Also, A and E have only D as their common friend.

Use the social network profiles provided that is attached with this assignment as input to your

program. The network has been partitioned into three files: file01, file02, file03.

The output will be stored in a file in the output directory.

Below is a tutorial helping you getting started with the Hadoop installed in the department

cluster.

Getting Started with Hadoop

Logging In

First, make sure you can log in to the head node with SSH, currently at zoidberg.cs.ndsu.nodak.edu.

You can log in to this server with your CS Domain password or your Blackboard password.

Example: WordCount

Before we jump into the details, lets walk through an example MapReduce application to get a flavor

for how they work.

Source Code

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Usage

Set environment variables:

export JAVA_HOME=/path/to/your/jdk (path where java jdk is installed, e.g.,

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")

export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Compile WordCount.java and create a jar:

$ bin/hadoop com.sun.tools.javac.Main WordCount.java

(or hadoop com.sun.tools.javac.Main WordCount.java)

$ jar cf wc.jar WordCount*.class

Setting Up Input Files

This program can use the Hadoop Distributed File System (HDFS) that is set up in the CS

department. This file system spans all the Linux lab machines and provides distributed storage for

use specifically with Hadoop.

You can work with HDFS with UNIX-like file commands. The list of file commands can be

found here.

First, make a directory to store the input for the program (use your username).

yourName@zoidberg:~$ hadoop fs -mkdir /user/yourName/wordcount

yourName@zoidberg:~$ hadoop fs -mkdir /user/yourName/wordcount/input

To set up input for the WordCount program, create two files as follows:

file01:

Hello World Bye World

file02:

Hello Hadoop Goodbye Hadoop

Save these to your home folder on the head node. To move them into HDFS, use the following

commands:

yourName@zoidberg:~$ hadoop fs -copyFromLocal /home/yourName/file01

/user/yourName/wordcount/input/file01

yourName@zoidberg:~$ hadoop fs -copyFromLocal /home/yourName/file02

/user/yourName/wordcount/input/file02

Again, use your username where applicable.

The syntax here is “hadoop fs -copyFromLocal <LOCAL_FILE> <HDFS_FILE>”, in this case we're

going to copy file01 from the local system into HDFS under our HDFS user directory into the

wordcount/input/ directory.

Running the WordCount Program

You can now run the WordCount program using the following command:

hadoop jar wc.jar WordCount /user/yourName/wordcount/input

/user/yourName/wordcount/output

The command syntax is: “hadoop jar <JARFILE> <CLASS> <PARAMETERS…>”

In this case, we use the wc.jar JAR file, running the class 'WordCount' with two parameters, an input

directory and an output directory. The output directory must not already exist in HDFS, it will be

created by the program.

View Output

You can check the output directory with:

hadoop fs -ls /user/yourName/wordcount/output/

You should then see something similar to:

yourName@zoidberg:~$ hadoop fs -ls /user/yourName/wordcount/output

Found 2 items

-rw-r--r-- 3 yourName nogroup 41 2011-11-08 11:23

/user/yourName/wordcount/output/part-r-00000

The 'part-r-00000' file contains the results of the word counting. You can look at the file using the

'cat' command.

yourName@zoidberg:~$ hadoop fs -cat /user/yourName/wordcount/output/part-r-00000

Bye 1

Goodbye 1

Hadoop 2

Hello 2

World 2

Another Example: Bigram

The following example may help you write your assignment program.

Bigrams are simply sequences of two consecutive words. For example, the previous sentence contains the

following bigrams: "Bigrams are", "are simply", "simply sequences", "sequences of", etc.

A Bigram Counting application that can be composed as a two-stage Map Reduce job.

o The first stage counts bigrams.

o The second stage MapReduce job takes the output of the first stage (bigram counts).

You can read and run the attached Bigram source code to learn. (To run the program, you need to create

input file(s).