Research

Knowledge Discovery from Databases


ITERATE Conceptual Clustering System

ITERATE

The ITERATE system is designed to (i) avoid the ordering effect of data which has been exhibited in a number of other incremental conceptual clustering systems, and (ii) achieve a better clustering by using an iterative redistribution operator to optimize partition structure after initial partitions are generated by hierarchical classification. The overall control structure is summarized below:
  1. Generation of a concept tree using a breadth-first type of control structure. This generates the tree level by level. A heuristic dissimilarity ordering scheme is used to order data objects at each node before its child partition is created.
  2. Choose a representative set of concepts from the hierarchy to create an initial class partition using the Category Utility measure, and
  3. consider objects one by one, and based on category match measure redistribute objects to the most similar class. Repeat this step till no objects change class.

The intuition behind the ITERATE control structure is to use the results of a hierarchical sorting scheme to choose a starting point for a partitional clustering scheme. Step 1 of the algorithm hierarchically sorts the pre-ordered objects. Steps 2 and 3 reflect the partitional clustering aspect of ITERATE. Step 2 chooses an initial set of classes from the concept tree and creates a flat partition as a good starting point for iterative redistribution in Step 3. Step 3 is the optimizing step: it uses a measure of similarity between an object and a class to determine the most appropriate class for the object. This step is repeated until a stable partition is derived, i.e., no objects move from one class to another. Iterative redistribution adopts a global perspective in trying to mitigate data-order dependency effects by allowing objects to redistribute anywhere in the partition.

Discretization

The basic version of ITERATE works only with nominal-valued attributes. It can be adapted to work with numeric-valued or nominal-numeric-valued-mixed data through pre-discretizing numeric-valued attributes. A discretization algorithm based on a thresholding mechanism adapted from image-processing techniques has been developed. The approach retains more of the characteristics of the original continuous-valued attributes than other discretizing methods.


Knowledge-based Equation Discovery from Geological Databases

Cen Li

In qualitative terms, good recoverable reserves having high hydrocarbon saturation, are trapped by highly porous sediments(reservoir porosity), and are surrounded by hard bulk rocks that prevent the hydrocarbon from leaking away. A large volume of porous sediments is crucial to finding good recoverable reserves. The task of this project is to derive qualitative equation models for porosity as a function of a number of geological phenomena, such as pore geometries, permeability, rock types, depositional setting, etc..

The equation discovery process consists of two main steps: context definition and equation derivation. Context definition properly defines and formulates homogeneous regions, each of which is likely to produce a unique and meaningful analytic formula for the response variable. Clustering techniques and a suite of visualization and interpretation routines make up a tool box that assists the context definition task. Within each context, multi-variable regression analysis is conducted to derive analytic equations between the response variable and a set of relevant predictive variables, starting with one or more of the initial base models. Domain knowledge, plus a heuristic search technique called component plus residual plots dynamically guide the equation refinement process.

The discovery system architecture is illustrated below :





Top of page

home