Hunt, Martin, and Stone (1966) were among the first to study machine concept learning from examples. Their Concept Learning System (CLS) nonincrementally builds decision trees that discriminate observations of different classes. CLS first divides the observations by their values along the `best' descriptive attribute; it uses a primitive frequency measure to determine the attribute whose values are most uniquely associated with different classes. The values of this divisive attribute are used to label arcs from the decision tree root and segregate observations into disjoint subsets. Each subset is treated as a child of the root, and CLS recursively builds a subtree for each. Decision tree expansion terminates when all observations at a (sub)node are members of the same class.
A new observation is classified by following the labeled arcs that correspond to the observation's values. When a leaf is reached, the class of the observations residing there is asserted as the class of the new observation. If this prediction is correct, the new observation is saved (along with the original set). If erroneous, the tree is reconstructed using previous observations and the misclassified one. This exemplifies a revolutionary approach to learning - a revolution occurs with each misclassification.
There is a sense in which the revolutionary procedure appears to be incremental, since observations are processed serially. However, each misclassification incurs subsequently larger amounts of processing. Hunt et al. explored variants of CLS that restricted the memory of past observations to a constant by randomly replacing prior observations with new ones. Limiting the number of saved observations also limits tree reconstruction costs, but experiments showed that it slowed learning rates as well; that is, more observations were required to form a decision tree that perfectly discriminated the observations.
In a similar vein, Michalski and Larson (1978) investigated the utility of restricting the observations used during learning. The basis of their strategy is AQ (Michalski, 1973), a nonincremental learning from examples system that does not build decision trees, but a `flat' set of logical (i.e., DNF) concept descriptions. Michalski and Larson adapt AQ to behave incrementally by limiting the number of observations used to reconstruct a faulty knowledge base. Unlike CLS, retained observations are not selected randomly, but on the basis of a Euclidean distance-like measure that identifies `good' concept representatives. In addition, incremental AQ limits reconstruction to those portions of the knowledge base (i.e., individual concept descriptions) that led to a misclassification. This `locality' constraint reduces computational costs and sets it further apart from CLS, which constructs the entire knowledge base (i.e., decision tree) after each incorrect prediction.
Reinke and Michalski (1986) carried the locality constraint a step further. Instead of recomputing a complete concept with each misclassification, their GEM system rederives only a small part of the concept. As in AQ, concepts are in DNF; if new observations are inconsistent with a concept, the faults are traced to individual conjunctive terms within the concept's description. Each faulty term is submitted to a generalization procedure along with the observations it currently covers and those that triggered inconsistency. However, unlike AQ, GEM saves and may reuse all previous observations.
Reinke and Michalski empirically compared their method with a nonincremental version of AQ in three domains. Each experiment measured concept description complexity, classification performance, and computational expense. The results tentatively indicate that: (a) the incremental method yields more complex concept descriptions than the nonincremental method, (b) the incrementally formed concept descriptions classify novel observations almost as well as the nonincremental ones, and (c) knowledge base update is less expensive using the incremental method than with the nonincremental method.
CLS, incremental AQ, and GEM differ in the extent to which they repair a faulty knowledge base: the entire knowledge base, entire concept descriptions, or partial concept descriptions, respectively. By reducing the scope of repair, these systems gain computational advantages over their nonincremental counterparts. In addition, each of these systems forms logical concept descriptions (CLS decision trees are tree-structured DNF concepts), and not coincidentally they insist on perfect consistency between the knowledge base and the environment. To maintain flexibility during incremental learning, they retain observations so that inconsistent portions of the knowledge base can be recomputed following each misclassification. Similarly, a system by Michalski (1985) retains observations that are misclassified by a knowledge base of production rules. In contrast, Winston's (1975) well-known system incrementally learns conjunctive concepts without retaining observations. However, Mitchell (1982) and Vere (1980) point out that Winston's system cannot insure consistency between a learned concept and observations.
Schlimmer's (1987a) STAGGER system (also Schlimmer and Granger, 1986a, b) is a recent addition to the line of incremental learning from examples systems. Like these previous systems (except CLS), STAGGER builds a knowledge base of `flat' concept descriptions and makes local knowledge base repairs. However, STAGGER departs in significant ways from these earlier systems. In part, these differences are motivated by the fact that real-world systems must be resistant to `noise' (i.e., incorrectly described observations due to faulty perception). As such, the system does not insist on perfect consistency between the knowledge base and the environment, nor does it make abrupt repairs following each misclassification. Rather, repair is triggered by a variable amount of inconsistency. To implement this strategy, STAGGER represents concepts as a probabilistic summary of important concept subcomponents. Like GEM, it is these components that are subject to repair, but repairs are made conservatively - only after a number of misclassifications indicate that revision is appropriate. Repairs are made by chunking primitive components, an ability that adds new terms to the concept language and improves learning (Schlimmer, 1987b). Even after making a repair though, the revised knowledge competes with the previous representation and is retracted if new observations prove the repair unwarranted. STAGGER does not reuse observations to effect repairs as do earlier systems. Rather, probabilistic information summarizes the training set and guides repair.
Computer experiments demonstrate that probabilistic representations and a conservative revision strategy enables STAGGER to deal effectively with noise. Furthermore, these characteristics enable the system to discern and respond to long-term environmental changes. STAGGER is relatively novel in addressing the problem of tracking environmental drift. Not coincidentally, other systems that address this problem (Hampson & Kibler, 1983; Holland, 1975; Langley, 1987a) also rely on the flexibility afforded by probabilistic representations.
A system that appears quite different than STAGGER at a cursory level, but that draws important principles from it, is Schlimmer and Fisher's (1986) ID4. ID4 descends from CLS by way of Quinlan's (1986) ID3, and like CLS, ID4 constructs decision trees. Its control structure is similar to CLS's, but it uses a more sophisticated evaluation measure for selecting the `best' divisive attribute: the `best' attribute maximizes the expected information gained from the attribute's values. Intuitively, this reflects how confidently one can predict an observation's class by knowing an attribute's value.
ID4 is incremental and updates a decision tree with each new observation. At the core of this incremental ability is the observation that the information-theoretic evaluation measure need not be computed directly from the set of observations, but a probabilistic summary of the observations is sufficient (i.e., co-occurrence counts for all attribute-value/class membership combinations). As with STAGGER, the use of a probabilistic representation frees ID4 from saving observations. For initial division at the root, the values of each new observation are used to update the counts necessary for computing the information measure. When a statistically significant comparison between the attributes can be made (based on the chi-square measure), a root attribute is chosen. The new subtrees are not constructed immediately because no observations have been saved. Instead, after each subsequent observation has updated counts at the root, it is routed to the appropriate subtree to update the counts there before being discarded. Subtrees are gradually grown in this manner. Over time a new attribute may come to be preferred at the root of a subtree. In this case the current root is supplanted by the attribute with the superior information gain, and the nodes under the original subtree are discarded and regrown. Subtree repair (versus full-tree reconstruction as in CLS) illustrates the same locality principles as GEM and STAGGER do for flat representations.
Schlimmer and Fisher used formal and empirical methods to characterize ID4 in terms of learning quality and cost. Their general findings are that ID4 converges on decision trees equivalent in quality to ID3's. The cost of updating a decision tree in ID4 is at worst logarithmic in the number of previous observations, while revolutionary application of ID3 (similar to full-memory CLS) requires polynomial time. Finally, ID4 may require more observations to converge on the same decision tree as the revolutionary application of ID3, but the `total work' (number of observations cost per observation) required to reach equivalent trees is considerably less for ID4.
Incremental learning systems exhibit great variety in the strategies they use for balancing the cost/quality tradeoff. CLS is a nonincremental system forced into frequently rebuilding a decision tree from scratch. Update costs are controlled by limiting the number of retained observations, but only at the expense of slowing learning speed. Incremental AQ and GEM lessen costs further by localizing knowledge base repairs, again based on saved observations. Work with AQ is novel in its use of `good' concept representatives in knowledge base update. Schlimmer's STAGGER localizes knowledge base revision as well, but drops the assumption that perfect consistency is desired or even possible. As a result, repairs are not triggered after each classification. ID4 is a descendant of CLS, but shares STAGGER's assumptions and strategies: probabilistic evidence and conservative knowledge base repair are the main sources of learning robustness.
Section 3.3 imposes further structure on these different strategies and suggests some directions for future work. However, we first turn to conceptual clustering, which drops the requirement of a `teacher' that preclassifies observations. Instead, the system must identify `useful' classes. This new specification has some important implications for performance, knowledge base structure, and learning strategy.