Machine Learning on Big Data
Big Data Analytics using ML and Streaming methods
Big Data Analytics using PySpark
Develop one multi-class classifier and one clustering.
Explain the features and configurations you wish to apply.
Evaluate and visualize the accuracy/performance and the working solution for each method you applied.
Data Streaming analytics using PySpark
Complete two tasks for data streaming analytics. You should put the screenshot of the working solution in the
report.
Documentation Write down a scientific report.
Implementation Project
Task 1
Find a data set involving an interesting sequence of symbols: perhaps text, color sequences in images, or event logs from some device. Use word2vec to construct symbol embeddings from them, and explore through nearest neighbor analysis.
What interesting structures do the embeddings capture?
Task 2
Experiment with different discounting methods estimating the frequency of words in English. In particular, evaluate the degree to which frequencies on short text files (1000 words, 10,000 words, 100,000 words, and 1,000,000 words) reflect the frequencies over some large text corpora, say, 10,000,000 words.
Tip: You can use the interesting YouTube Video - Mining Big Data with Apache
SparkURL from Week 2 as the example of implementation on these types of ML modelling.
Implementation Presentation
The Presentation Part is a Good presentation based on the Report you will produce. Please follow the marking scheme so you will know how your presentation should be presented.