Design and implementation of a big data solution
Learning Outcomes
1. Critically evaluate modern big data processing paradigms.
2. Develop and implement a big data solution for a provided dataset.
3. Analyse use cases, visualise and report the results of a big data solution.
4. Assess how ethics govern the design choices in devising a big data enabled solution.
Rationale:
Hadoop can be thought of as a set of open source programs and procedures which anyone can use as the backbone of their big data operations.Today, it is the most widely used system for providing data storage and processing across commodity hardware, off-the-shelf systems linked together, as opposed to expensive, bespoke systems made for the job in hand. In fact it is claimed that more than half of the companies in the Fortune 500 make use of it.
Description:
To get an insight of the state-of-the-art in big data management each student (individually) is tasked to:
• Design, install and configure a 3 nodesApache Hadoop cluster on top of Lubuntu OS. Each student will be provided with access to a 3 virtual machines to deploy the cluster.
• Install and configure MongoDB to work as an interface to the aforementioned cluster.
• Install and configure Apache Spark to work as an analytics engine on top of the aforementioned cluster.
• Download the latest Wikimedia dump dataset and PUT it on the cluster.
• Identifya unique and acceptably challenging data analysis problem that can result of a factual insight for the Wikimedia dataset. The student will choose an insight to look for in the dataset and identify an appropriate method for the analysis process.
• Utilise MongoDB for a better performance data operations.
• Visualise and explain the resulted insights.
• Write a detailed 3000words report about all previous steps and show evidence (such as screenshots or lab work) for each one. The report must be written in an excellent style of academic writing.
Attachment:- Assignment Brief.rar