CST4070 Applied Data Analytics Tools, Practical Big Data Handling, Cloud Distribution - Middlesex University
Big Data
General information - You are required to submit your work via the dedicated Unihub assignment link by the specified deadline. This link will 'timeout' at the submission deadline. Your work may not be accepted as an email attachment if you miss this deadline. Therefore, you are strongly advised to allow plenty of time to upload your work prior to the deadline.
You are required to solve the Tasks illustrated below. Each Task should be accompanied by:
a. A short introduction where you describe the problem and your high level solution.
b. Your step-by-step process supported by screenshots. Each screenshot needs to be accompanied by a short explanatory text.
c. Eventually, if necessary, conclude each task with brief summary of what you have done.
Tasks - Follow the lab instructions to install Apache Hadoop into a virtual server running on Linux Ubuntu Server. Once you have Apache Hadoop installed and running, execute the following Task tasks.
Task 1 - Implement one executable Hadoop MapReduce job that counts the total number of words having an even and odd number of characters. As an example, if the text in input is Hello world , the output should be even:0, odd:2 , because both Hello and world contain an odd number of characters. Whereas, if the input us My name is Alice the output should be even: 3, odd: 1.
The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.
Task 2 - Implement one executable Hadoop MapReduce job that receives in input a .csv table having the structure 'StudentId, Module, Grade' and returns in output the minimum and maximum grade of each student along as her total number of modules she has passed.
Therefore, if your input is:
StudentId
|
Module
|
Grade
|
S001
|
Statistic
|
75
|
S002
|
Statistic
|
72
|
S001
|
Big Data
|
78
|
S003
|
Big Data
|
66
|
S001
|
Programming
|
70
|
S002
|
Programming
|
55
|
S001
|
Machine Learning
|
65
|
S002
|
Machine Learning
|
61
|
Your output need to be:
StudentId
|
MinGrade
|
MaxGrade
|
Modules
|
S001
|
65
|
78
|
4
|
S002
|
55
|
72
|
3
|
S003
|
66
|
66
|
1
|
The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.
Task 3 - Implement one executable Hadoop MapReduce job that receives in input two .csv tables having the structure:
User: UserId, Name, DOB
Follows: UserIdFollower, UserIdFollowing
The MapReduce job needs to perform the following SQL query:
select U.UserId, U.Name as NameFollower, F.Name as NameFollowing
from User as U
join Follows as F on U.UserId = F.UserId
where F.DOB <= '2002-03-01'
Therefore, if the two original tables are:
UserId
|
Name
|
DOB
|
U001
|
Alice
|
2005-01-05
|
U002
|
Tom
|
2001-02-07
|
U003
|
John
|
1998-06-02
|
U004
|
Alex
|
2006-02-01
|
UserIdFollower
|
UserIdFollowing
|
U001
|
U002
|
U001
|
U003
|
U002
|
U001
|
U002
|
U004
|
U003
|
U001
|
U004
|
U001
|
The final table needs to be
UserId
|
NameFollower
|
NameFollowing
|
U001
|
Alice
|
Tom
|
U001
|
Alice
|
John
|
The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.