An explanation of why Storing-all-rules BruteDL gives the ``surprising'' results can be provided by the examination of the decision lists generated by the two different versions of BruteDL. Figure 5 and Figure 6 show the rules that appeared in the two different versions' outputs and their performance on the testing data in one of the 25 trials on the 10% sized training data set on cancer domain.
Figure 5: The Rules and Their Performance in the Fifth Trial of BruteDL on
Cancer Data Using 10% Sized Training Set
Figure 6: The Rules and Their Performance in the Fifth Trial of
Storing-All-Rules BruteDL on Cancer Data Using 10% Sized
Training Set
As displayed in the figures, there are more rules in the decision list
generated
by the storing-all-rules BruteDL, which is what we expected. However, it is
surprising that some of the ``best'' rules found by BruteDL are not even
included in the list of ``all'' the rules stored by our variant. For example,
the first and seventh rules in the original
BruteDL's output do not exist in Figure 6.
This is because
the storing-all-rules BruteDL does not really store all the rules. It
actually stores only all the rules that have passed the homogeneity and
minimality test. In the test, a rule is checked against all its specializations
to see if its Laplace accuracy is not significantly different from those of all
its specializations. By storing only the ``best'' rules, the number of the
specializations of a single rule may be less than that when
storing all rules.
Therefore, the first rule in Figure 5 may pass the
-test against all its specializations among the ``best'' rules. However,
among all the rules, there may be one specialization that is not the best
for any training example, (i.e., the one that would not be stored and
thereby not be checked against in the original BruteDL.) which gives a
Laplace accuracy that is
statistically different from rule 1 and rule 1 has to be
discarded. Thus, some of the ``best'' rules
are eliminated from the all-rule-list. This may lead to a decreased
system
prediction accuracy if the eliminated ``best'' rule covers many future examples.
Because of the elimination of some ``best'' rules, the frequency that the
default rule is used in classifying future observations may not
decrease as we might think it should. In the case of our example
in Figure 5, rule 1 provides correct predictions
on 62.9% of the testing data. These observations are likely to be
misclassified and the
accuracy of the whole system may be effected when rule 1 is removed in the
storing-all-rule version. This explains why the default rule is used on
some data sets (e.g., 50% and 70% sized iris training sets) although it
was not used by the original BruteDL -- the only best-rule matching
some test observations disappeared in the all-rule list and those
observations have to be classified by the default rule.
Furthermore, the position of the new rules in the output of storing-all-rules BruteDL also affects the system's performance. For instance, the third rule in Figure 6 is not included in Figure 5 because it is not a best rule for any training example. However, it has a relatively high Laplace accuracy over the training examples. When BruteDL sorts rules in the decision list according to their Laplace accuracies, this not-best rule is placed at a position towards the beginning of the list. Then, during the later classification phase of the testing data, some of the testing observations may be misclassified by this not-best rule although it would be correctly classified by some ``best'' rules that are placed after the not-best rule. This is what happened in our example. In Figure 6, rule 3 provides incorrect predictions for 4.3% of the testing observations before other rules get a chance to correctly classify them. Rule 7 in Figure 6 is equivalent to the rule 5 in Figure 5. It correctly classifies 2.4% of the training data while none is covered by it in the storing-all-rules BruteDL. Those 2.4% testing data may be misclassified by not-best rules like rule 3, 5 or 6 in Figure 6.
Jing Lin