Department of Computer Science, Vanderbilt University
Box 1679, Station B,
Nashville, TN 37235
Proc. The First Intl. Conf. on Knowledge Discovery from Databases, Montreal, Canada, pp. 204-209, Aug. 1995.
A framework for knowledge-based scientific discovery in geological databases has been developed. The discovery process consists of two main steps: context definition and equation derivation. Context definition properly defines and formulates homogeneous regions, each of which is likely to produce a unique and meaningful analytic formula for the goal variable. Clustering techniques and a suite of visualization and interpretation routines make up a tool box that assists the context definition task. Within each context, multi-variable regression analysis is conducted to derive analytic equations between the goal variable and a set of relevant independent variables, starting with one or more of the initial base models. Domain knowledge, plus a heuristic search technique called component plus residual plots dynamically guide the equation refinement process. The methodology has been applied to derive porosity equations for data collected from oil fields in the Alaska Basin. Preliminary results demonstrate the effectiveness of this methodology.
Keywords: knowledge discovery from databases, scientific discovery, clustering, regression analysis, component plus residual plots.
This research is supported by grants from Arco Research Labs, Plano, TX.