Information Retrieval Assignment - Indexing for Web Search
The assignment involves building a processing pipeline that turns a Website into structured knowledge. All the instructions and questions for the task are given in the PDF file attached below. Detailed explanations needed Specific requirements.
The Task - Your task is to apply your IR skills to build a processing pipeline that turns a Web site into structured knowledge (thus enhancing your chances of getting the job outlined above). Your system should take HTML pages as input, process them using the kind of techniques that we have been looking at in the module, and output an index of terms identified in the documents.
This assignment comes in stages. Marks are given for each stage. You may choose not to attempt some stages. You might also implement a system that does not strictly follow the stages but will work in the same way. The stages are as follows:
- Engineering a Complete System - The system you develop must be able to read Web pages from a specified set of URLs and produce appropriately formatted output. The Web pages should be processed one at a time using the steps outlined below. The final system should have control over all the individual components so that there is a single call and all the steps outlined below will be performed.
- HTML Parsing - Before the text can be analyzed it is necessary to get rid of the HTML tags. The result will be plain text. Note that if you simply delete all HTML tags, you will lose information such as meta tag keywords. Use an appropriate tool to perform this task.
- Pre-processing - Sentence Splitting, Tokenization and Normalization (10%) The next step should be to transform the input text into a normal form of your choice. This should include the identification of sentences, bullet points and cells in tables.
- Part-of-Speech Tagging - The input should be tagged with a suitable part-of-speech tagger, so that the result can then be processed in the next steps.
- Selecting Keywords - One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes. Your system should remove words which are not useful, such as very frequent words or stop words, and identify phrases suitable as index terms. Apply tf.idf as part of your selection and weighting step.
- Stemming or Morphological Analysis - Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, i.e. bus and busses refer to exactly the same thing even though they are different words.
The report for the Assignment should be written in Microsoft Word format with the following information:
a. Description of the implementation
b. Output produced when the system is applied to the 2 web pages given in the assignment.
c. Output produced by each stage of the processing pipeline for each of the two files.
d. Discussion of your solution focusing on functionality implemented and possible improvements/extensions.
All this information has been listed out in the assignment instructions.
Attachment:- Assignment File.rar