Data Mining
Coursework
Suppose that the following table of instances (cases) were recorded for an insurance company's promotions for its life assurance product. The attributes are self-explanatory, and the values in the two product promotion attributes should be read as follows: a Yes means that the individual was offered that particular promotion only if s/he would take out the insurance and No not offered the promotion.
ID |
Income Range |
Gender |
Age Range |
Holiday Promotion |
Wine Promotion |
Life Insurance Take Up |
1 |
40-50K |
Male |
30-40 |
No |
Yes |
Yes |
2 |
30-40K |
Female |
30-40 |
No |
Yes |
No |
3 |
40-50K |
Male |
30-40 |
No |
No |
No |
4 |
30-40K |
Male |
30-40 |
Yes |
Yes |
Yes |
5 |
50-60K |
Female |
20-30 |
No |
No |
No |
6 |
20-30K |
Female |
40-50 |
No |
No |
No |
7 |
30-40K |
Male |
20-30 |
Yes |
No |
No |
8 |
20-30K |
Male |
20-30 |
No |
Yes |
Yes |
9 |
30-40K |
Male |
30-40 |
No |
Yes |
Yes |
10 |
30-40K |
Female |
30-40 |
No |
No |
Yes |
11 |
40-50K |
Female |
30-40 |
No |
No |
No |
12 |
20-30K |
Male |
20-30 |
No |
Yes |
Yes |
13 |
50-60K |
Female |
20-30 |
No |
No |
No |
14 |
40-50K |
Male |
40-50 |
No |
Yes |
No |
15 |
20-30K |
Female |
20-30 |
Yes |
Yes |
No |
16 |
40-50K |
Female |
30-40 |
No |
No |
No |
17 |
50-60K |
Male |
40-50 |
Yes |
Yes |
Yes |
18 |
20-30K |
Female |
30-40 |
No |
Yes |
No |
19 |
20-30K |
Male |
40-50 |
Yes |
Yes |
Yes |
20 |
30-40K |
Female |
20-30 |
Yes |
Yes |
No |
Questions:
Use the ID3 decision tree induction method available in the Weka package (with the default setting) to derive a classifier (decision tree) from this set of data. The class attribute is Life Assurance Take-up.
What should be the class value for the following unseen case based on the derived tree? Justify your answer.
Income Range |
Gender |
Age Range |
Holiday Promotion |
Wine Promotion |
Life Insurance Take-up |
40-50K |
Male |
20-30 |
No |
Yes |
? |
How would you deal with such cases in general? Outline your solution algorithmically using the structure given below:
algorithm DT-based Classification
# traversing the tree to reach a leaf node N
if N's class value is null then
:
: write your pseudo code to implement your solution here
:
else
return the class value
end
A decision tree derived from data can be used not only to predict class values for unseen cases, but also to summarize data for analysis. Based on the tree derived in 1), comment on whether the company has conducted its promotion effectively.
In the default setting in Weka, there is a setting of "Cross-Validation Folds 10" in the test options. Briefly explain how Cross Validation tests a model derived from training data and why we use it for testing.
Now perform the following tests: you vary "fold" from 2 to 10, run ID3 and observe classification accuracy for each setting. You then change the test options setting to "Use training set" and run ID3 and observe classification accuracy. You can record and present these test results as a table or a bar chart. Comment on your test results: which method (cross validation or using training set) is better for testing your derived tree and why?
Use the JRip rule induction method available in the Weka package (with the default setting) to derive a classifier (classification rules) from this set of data.
What observations do you have on the two classifiers you have obtained in terms of using them for business analysis (as in 3) and for classification of an unseen case (as in 2)?
Attachment:- data.rar