Big Data Analytics: Coursework
Marking Scheme for Report
1. Log Data Conversion using PySpark to DataFrame:
2. Advanced Spark SQL queries for data analysis
3. Spark RDD queries for data analysis (3 queries per student, at least two operations/queries should be advanced): 50 marks (10 (not advanced) + 20 (advanced) + 20 (advanced))
4. Discussion on LSEP considerations:
5. Overall clarity, organization, and quality of the report
Big Data Analytics using Hadoop and Spark
(1) Big Data Analysis using Spark DF
1. We possess a web log file of considerable size, with the following data description. Each line is structured as shown below, and the table provides a description for each row.
2. Download data from here. You need your UEL ID for permission.
3. Using Spark DF, convert the web.log unstructured file into DataFrame. Your data conversion should be in the format of above table. Doing the conversion is a part of your self-study that you need to complete.
4. Individually, each student should write two advanced SQL queries on the DataFrame to extract specific insights. The queries should showcase your understanding of Spark SQL functionalities and demonstrate your ability to handle real-world data analysis tasks. Ensure that the queries provide meaningful insights and go beyond basic operations.
5. Each student should provide the working solution for each query in the HTML report.
6. You can utilize Python, specifically libraries such as matplotlib or seaborn, for data visualization. By doing so, you will have the opportunity to achieve the maximum mark.
Basic Queries: Basic queries demonstrate a basic understanding of SQL syntax and perform simple operations on the data. These queries typically involve basic SELECT, WHERE, and GROUP BY clauses without complex joins or subqueries. Basic queries will be awarded the minimum mark for this section.
Advanced Queries: Advanced queries demonstrate a deeper understanding of SQL and involve more complex operations and techniques. These queries may include the use of advanced SQL features such as window functions, subqueries, joins, and aggregations. Advanced queries demonstrate creativity and the ability to extract meaningful insights from the data. These queries will be awarded higher marks based on the complexity, efficiency, and effectiveness of the analysis.
(2) Big Data Analysis using Spark RDD
1. Use Spark RDD for reading the same unstructured data, web.log data.
2. Each student should write 3 RDD queries using Spark RDD transformation and action operators. At least two queries should be advanced for each student.
3. The queries should be different from the ones written in Task 1, showcasing the use of Spark RDD capabilities. Emphasize the importance of using RDD-specific operations rather than SQL queries.
4. Each student should provide the working solution for each RDD query in the HTML report.
5. You can utilize Python, specifically libraries such as matplotlib or seaborn, for data visualization.
Marking Scheme for Spark RDD Queries:
Basic Spark RDD Queries: Basic queries demonstrate a basic understanding of Spark RDD transformations and actions. These queries typically involve simple operations such as filtering, mapping, and basic aggregations using RDD functions. Basic queries may not fully leverage the power and capabilities of Spark RDDs and may not showcase advanced techniques. Basic queries will be awarded the minimum mark for this section.
Advanced Spark RDD Queries: Advanced queries showcase a deeper understanding of Spark RDD transformations and actions. These queries involve more complex operations and techniques, leveraging the full capabilities of Spark RDDs. Advanced queries may include operations like joins, aggregations, sorting, and complex data manipulations using RDD functions. Advanced queries demonstrate creativity and the ability to extract meaningful insights from the data using Spark RDDs. These queries will be awarded higher marks based on the complexity, efficiency, and effectiveness of the analysis.
(3) LSEP considerations
For all analyses performed, critically analyze the legal, social, ethical, and professional implications associated with the data and the analysis. Consider factors such as data privacy, data protection, bias, fairness, transparency, and the potential impact of the analysis on individuals or society as a whole.
Every student should choose one of these factors to contribute to.
Attachment:- Big Data Analytics.rar