CN7022 - Big Data Analytics - University of East London -

Post New Homework

Big Data Analytics using Hadoop and Spark

Tasks:

(1) Understanding Dataset:

The raw network packets of the UNSW-NB151 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. Tcpdump tool used to capture 100 GB of the raw traffic (e.g., Pcap files). This data set has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. The Argus and Bro-IDS tools are used and twelve algorithms are developed to generate totally 49 features with the class label.

a) The features are described here(attached).

b) The number of records per traffic type are described here(attached).

c) In this coursework, we use the total number of 2,540,044 records that was stored in the CSV file (download). The total size is 560MB, which is big enough to employ big data methodologies for analysis. As a big data specialist, firstly, we would like to read and understand its features, then apply modeling techniques. If you want to see a few records of this dataset, you can import it into Hadoop HDFS, then make a Hive query for printing the first 5-10 records for your understanding.

(2) Big Data Query & Analysis by Apache Hive
This task is using Apache Hive for converting big raw data into useful information for end users. To do so, firstly understand the dataset carefully. Then, make at least four Hive queries to be able to get information from this big dataset. Apply appropriate visualization tools to present your findings numerically and graphically. Interpret shortly your findings. Finally, take screenshot of your scripts/codes into the report.

Tip: the mark for this section depends on the level of Hive queries' complexities, for instance using simple select query is not supposed for full mark.

(3) Advanced Analytics using PySpark
In this section, you will conduct advanced analytics using PySpark.

Analyze and Interpret Big Data
a) We need to learn and understand the data through 3-4 descriptive analysis methods. You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly to help end-users for getting insights.

b) Apply 3-4 advanced statistical analysis methods (e.g., correlation, hypothesis testing, density estimation and so on) to interpret data precisely. You need to write down a report of your methods, their configurations and interpret your findings.

Design and Build a Classifier
a) Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations.

b) How do you evaluate the performance of the model?

c) How do you verify the accuracy and the effectiveness of your model?

d) Apply a multi-class classifier to classify data into ten class: one normal and nine attack (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statement on its parameters, accuracy and effectiveness.

(4) Individual Assessment
Discuss (1) what did you learn from this coursework, (2) what other alternative technologies are available for tasks 2 and 3 and how they are differ (use academic references), and (3) what was surprisingly new thinking evoked and/or neglected at your end?
Tip: add individual assessment of each member in a same report.

(5) Documentation
Document all your work. Your final report must follow 5 sections detailed in the "format of final submission" section (refer to next page). Your work must demonstrate appropriate understanding of academic writing and integrity.

Attachment:- Big Data Analytics.rar

Post New Homework
Captcha

Looking tutor’s service for getting help in UK studies or college assignments? Order Now