Implement one executable Hadoop MapReduce job that counts

Post New Homework

CST4070 Applied Data Analytics Tools, Practical Big Data Handling, Cloud Distribution - Middlesex University

Big Data

General information - You are required to submit your work via the dedicated Unihub assignment link by the specified deadline. This link will 'timeout' at the submission deadline. Your work may not be accepted as an email attachment if you miss this deadline. Therefore, you are strongly advised to allow plenty of time to upload your work prior to the deadline.

You are required to solve the Tasks illustrated below. Each Task should be accompanied by:

a. A short introduction where you describe the problem and your high level solution.

b. Your step-by-step process supported by screenshots. Each screenshot needs to be accompanied by a short explanatory text.

c. Eventually, if necessary, conclude each task with brief summary of what you have done.

Tasks - Follow the lab instructions to install Apache Hadoop into a virtual server running on Linux Ubuntu Server. Once you have Apache Hadoop installed and running, execute the following Task tasks.

Task 1 - Implement one executable Hadoop MapReduce job that counts the total number of words having an even and odd number of characters. As an example, if the text in input is Hello world , the output should be even:0, odd:2 , because both Hello and world contain an odd number of characters. Whereas, if the input us My name is Alice the output should be even: 3, odd: 1.

The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.

Task 2 - Implement one executable Hadoop MapReduce job that receives in input a .csv table having the structure 'StudentId, Module, Grade' and returns in output the minimum and maximum grade of each student along as her total number of modules she has passed.

Therefore, if your input is:

StudentId

Module

Grade

S001

Statistic

75

S002

Statistic

72

S001

Big Data

78

S003

Big Data

66

S001

Programming

70

S002

Programming

55

S001

Machine Learning

65

S002

Machine Learning

61

Your output need to be:

StudentId

MinGrade

MaxGrade

Modules

S001

65

78

4

S002

55

72

3

S003

66

66

1

The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.

Task 3 - Implement one executable Hadoop MapReduce job that receives in input two .csv tables having the structure:

User: UserId, Name, DOB

Follows: UserIdFollower, UserIdFollowing

The MapReduce job needs to perform the following SQL query:

select U.UserId, U.Name as NameFollower, F.Name as NameFollowing

from User as U

join Follows as F on U.UserId = F.UserId

where F.DOB <= '2002-03-01'

Therefore, if the two original tables are:

UserId

Name

DOB

U001

Alice

2005-01-05

U002

Tom

2001-02-07

U003

John

1998-06-02

U004

Alex

2006-02-01

 

UserIdFollower

UserIdFollowing

U001

U002

U001

U003

U002

U001

U002

U004

U003

U001

U004

U001

The final table needs to be

UserId

NameFollower

NameFollowing

U001

Alice

Tom

U001

Alice

John

The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.

Post New Homework
Captcha

Looking tutor’s service for getting help in UK studies or college assignments? Order Now