CST4070 Applied Data Analytics - Tools, Practical Big Data Handling, Cloud Distribution - Middlesex University
Assignment - Big Data
You are required to submit your work via the dedicated Unihub assignment link by the specified deadline. This link will ‘timeout' at the submission deadline. Your work may not be accepted as an email attachment if you miss this deadline. Therefore, you are strongly advised to allow plenty of time to upload your work prior to the deadline.
You are required to solve the tasks illustrated below. Each task should be accompanied by:
A short introduction where you describe the problem and your high level solution. Your step-by-step process supported by screenshots. Each screenshot needs to be accompanied by a short explanatory text.
Eventually, if necessary, conclude each task with brief summary of what you have done.
Your submission needs to be unique
When solving your tasks, you are required to name your files by using your first name (e.g., if your name is Alice, you may name your task 1 file as ) so to make your submission unique. Obviously, also your explanatory text needs to be unique.
Tasks
Follow the lab instructions to install Apache Hadoop into a virtual server running on Linux Ubuntu Server. Once you have Apache Hadoop installed and running, execute the following tasks.
Task 1
Implement one executable Hadoop MapReduce job that counts the total number of words having an even and odd number of characters. As an example, if the text in input is
Hello world , the output should be
, because both
and
world contain an odd number of characters. Whereas, if the input us
My name is Alice the output should be .
The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.
Task 2
Implement one executable Hadoop MapReduce job that receives in input a .csv table having the structure 'StudentId, Module, Grade' and returns in output the minimum and maximum grade of each student along as her total number of modules she has passed.
Therefore, if your input is:
The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs
to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.
Task 3
Implement one executable Hadoop MapReduce job that receives in input two .csv tables having the structure:
User: UserId, Name, DOB
Follows: UserIdFollower, UserIdFollowing
The MapReduce job needs to perform the following SQL query:
Therefore, if the two original tables are:
The final table needs to be
The job needs to be executed by a mapper and a reducer. Both mapper and reducer needs to be written in Python and tested in Linux Ubuntu before running them on Hadoop MapReduce.
Attachment:- Applied Data Analytics.rar