Industrial Programming
Data Analytics
Data Analysis of a Document Tracker
The aim of this coursework is to develop a simple, data-intensive application in Python.
This is a pair project, and you will have to submit your own, original solution for this coursework specification, consisting of a report, the source code and an executable.
The learning objective of this coursework is for students to develop proficiency in advanced program- ming concepts, stemming from both object-oriented and functional programming paradigms, and to apply these programming skills to a concrete application of moderate size. Design choices regarding languages, tools, and libraries chosen for the implementation need to be justified in the accompanying report.
This coursework will develop personal abilities in using modern scripting languages as a "glueware" to build, configure and maintain a moderately complex application and deepen the understanding of integrating components on a Linux system.
In a dedicated section, the report needs to critically reflect on the software used for implementing this application, and discuss advantages and disadvantages of this choice. The report should also contain a discussion, contrasting software development on Windows and Linux systems and comparing software de- velopment in scripting vs. systems languages (based on the experience from the two pieces of coursework).
Lab Environment
Software environment: You should use Python 3 as installed on the Linux lab machines or on the Linux MACS VM for the implementation. This installation also provides the pandas, tkinter, and matplot libraries. These Linux lab machines are available remotely using thex2go client for remote desktops, running on jove (and from there use ssh to log into the lab machines). For technical HOWTOs about accessing software of relevance for this course, see the resources section of the Canvas course page.
Data Analysis of a Document Tracker
In this assignment, you are required to develop a simple Python-based application, that analyses and displays document tracking data from a major web site.
Theissuu.complatform is a web site for publishing documents. It is widely used by many on-line publishers and currently hosts about 15 million documents. The web site tracks usage of the site and makes the resulting, anonymised data available to a wider audience. For example, it records who views a certain document, the browser used for viewing it, the way how the user arrived at this page etc. In this exercise, we use one of these data sets to perform data processing and analysis in Python.
The data format uses JSON and is described onthis local page, describing the data spec. Note that the data files contain a sequence of entries in JSON format, rather than one huge JSON construct, in order to aid scalability. Familiarise yourself with the details of the data representation before you start implementation. As the assignment opens, two data-sets are available: a small data set (10k lines), and a tiny sample dataset for use for testing. At a later stage, larger data-sets, in the range of 100k-5M lines will be posted, and your final implementation should be able to cope with these sizes of input data.
The application needs to run on an up-to-date Linux platform (Ubuntu 22.04 or equivalent). The ap- plication should be developed in Python 3.10, using appropriate libraries for input, data processing and visualisation. Possible choices are the json library for parsing, the pandas library for processing the input data (optional), the tkinter library for GUI functionality and the matplot library for visualising the results. You need to identify the advantages of your choice of libraries.
The application must provide the following functionality:
1. Python: The core logic of the application should be implemented in Python 3.10.
2. Views by country/continent: We want to analyse, for a given document, from which countries and continents the document has been viewed. The data should be displayed as a histogram of countries,
i.e. counting the number of occurrences for each country in the input file.
(a) The application should take a string as input, which uniquely specifies a document (a document UUID), and return a histogram of countries of the viewers. The histogram can be displayed using matplotlib.
(b) Use the data you have collected in the previous task, group the countries by continent, and generate a histogram of the continents of the viewers. The histogram can be displayed using matplotlib.
3. Views by browser: In this task we want to identify the most popular browser. To this end, the application has to examine the visitor useragent field and count the number of occurrences for each value in the input file.
(a) The application should return and display a histogram of all browser identifiers of the viewers. (b)In the previous task, you will see that the browser strings are very verbose, distinguishing
browser by e.g. version and OS used. Process the input of the above task, so that only the main browser name is used to distinguish them (e.g. Mozilla), and again display the result as a histogram.
4. Reader profiles: In order to develop a readership profile for the site, we want to identify the most avid readers. We want to determine, for each user, the total time spent reading documents. The top 10 readers, based on this analysis, should be printed.
5. "Also likes" functionality: Popular document-hosting web sites, such as Amazon, provide informa- tion about related documents based on document tracking information. One such feature is the "also likes" functionality: for a given document, identify, which other documents have been read by this document's readers. The idea is that, without examining the detail of either document, the informa- tion that both documents have been read by the same reader relates two documents with each other. Figure 1gives an example of this functionality. In this task, you should write a function that generates such an "other readers of this document also like" list, which is parametrised over the function to determine the order in the list of documents. Display the top 10 documents, which are "liked" by other readers.
To achieve this task you will need to do the following:
(a) Implement a function that takes a document UUID and returns all visitor UUIDs of readers of that document.
Figure 1: Example of identifying also-likes documents. Starting from the current reader and document (green), all readers are identified, who have also read the input document (blue). From the other documents, read by these readers, the top 10 documents, counted by number of readers are identified and displayed. In this example the red document is top of this list, and the two pink documents are also on the result list. The automatically generated graph should display all three result documents, but doesn't have to distinguish between "best" and "others" by shading. The unused, black users and documents shouldn't be shown in that graph.
(b) Implement a function that takes a visitor UUID and returns all document UUIDs that have been read by this visitor.
(c) Using the two functions above, implement a function to implement the "also like" functionality, which takes as parameters the above document UUID and (optionally) visitor UUID, and addi- tionally a sorting function on documents. The function should return a list of "liked" documents, sorted by the sorting function parameter. Note: the implementation of this function must not fix the way how documents are sorted, and use the sorting function parameter instead.
(d) Use this function to produce an "also like" list of documents, using a sorting function, based on the number of readers of the same document. Provide a document UUID and visitor UUID as input and produce a list of top 10 document UUIDs as a result.
6. "Also likes" graph: For the above "also like" functionality, generate a graph that displays the rela- tionship between the input document and all documents that have been found as "also like" documents (and only these documents). Highlight the input document and user by shading in that graph, and use arrows to capture the "has-read" relationship (i.e. arrow from reader to document). In the graph shorten all visitor UUIDs and document UUIDs to the last 4 hex-digits. As an example, the graph below uses document b4fe and reader 6771 as input (shaded green) and displays 7 "also like" doc- uments, together with the readers that relate these documents with the input document. For added clarity, shade the documents according to how many other readers also read them:
Hint: Use the .dot formatas graph representation. Use the graphviz packagewith the dot tool to translate the .dot into a .ps format (and then optionally in .pdf format). For a detailed description see this dot User Manual. You can install the graphviz package on an Ubuntu machine by typing (in a terminal window): sudo apt-get install graphviz
As an example of graphviz/dot usage, the source file for the above graph is available in Canvas. You can generate the resulting graph as follows:
7. GUI usage: To read the required data and to display the statistical data, develop a simple GUI based on tkinter or another package of your choice that reads the user inputs described above, and with buttons to process the data as required per task. In case you are using a package other than tkinter, document its requirements in detail in the report.
8. Command-line usage: The application shall provide a command-line interface to test its functionality in an automated way, like this:
to check the results of implementing task task_id using inputs user_uuid for the user UUID and doc_uuid for the document UUID; file_name is the name of the JSON file with the input data. The task ids should be: 2a, 2b, 3a, 3b, 4, 5d, 6, 7, matching the tasks above (task id 7 should run Task 6 and automatically launch a GUI with fields to input document and (optionally user ids and show the resulting also-likes graph).
The report should have between 8-12 pages and use the following format (if you need space for additional screenshots, put them into an appendix, not counting against the page limit, but don't rely on the screenshots in your discussion):
1. Introduction: State the purpose of the report, your remit and any assumptions you have made during the development process.
2. Requirements' checklist: Here you should clearly show which requirements you have delivered and which you haven't.
3. Design Considerations: Here you should clearly state what you have done to your application to make it more usable and accessible.
4. User Guide: Use screen shots of the running application along with text descriptions to help you describe how to operate the application.
5. Developer Guide: Describe your application design and main areas of code in order to help another developer understand your work and how they might develop it. You may find it useful to supplement the text with code fragments.
6. Testing: Show the results for testing all cases and prove that the outputs are what are expected. If certain conditions cause erroneous results or the application to crash then report these honestly.
7. Reflections on programming language and implementation: Based on your experience in imple- menting this application, reflect which language features and technologies have been most helpful, identify limitations of your application and suggest ways how to overcome this limitations. Also re- flect on the usability of the (kind of) language (either system or scripting language) for this application domain, and on its wider applicability.
8. What did I learn from CW1? A short discussion on lessons learnt from the feedback given on CW1 and a discussion how you integrated this feedback into CW2. Cover both coding and report writing, possibly more (project management, preparing for interview style questions etc).
9. Conclusions: Reflect on what you are most proud of in the application and what you'd have liked to have done differently.
Attachment:- Data Analysis.rar