ITERATE: A Conceptual Clustering Method for Knowledge Discovery in Databases

Gautam Biswas, Jerry Weinberg, and Cen Li

Dept. of Computer Science
Box 1679, Station B
Vanderbilt University
Nashville, TN 37235.

Artificial Intelligence in the Petroleum Industry, Paris, France, pp. 111-139, 1995.


As the field of Artificial Intelligence (AI) matures, researchers are turning more and more to real world applications. With the widespread use of computers, it is estimated that the amount of information collected in the world doubles every 20 months. A substantial amount of this data comes from studying the operations of complex engineering systems (e.g., manufacturing lines, nuclear plants), geological operations in oil and mineral prospecting, data collected by satellites and space missions, and medical data collected from patients and laboratory experiments. It is becoming increasingly important to devise sophisticated schemes for finding interesting concepts and relations between concepts from this large amount of potentially useful data.

Frawley, et al. cite examples of a number of forward looking companies that are developing tools and techniques to analyze their databases for interesting and useful patterns. For example, American Airlines uses knowledge discovery techniques to periodically search its frequent flyer database to find profiles of its better customers and target them for specific promotions. General Motors uses classification methods to study its automotive troubleshooting databases and derive diagnostic expert systems for its different car models. A.C. Nielsen is working with packaged-goods manufacturers to study the effects of targeted promotions on sales at the super markets.

Are there potential applications in the domain of hydrocarbon exploration and geology? It is common knowledge that drilling costs for new offshore prospects are in the range of $30-40 million and that chances of the site being an economic success are 1 in 10. Advances in drilling technology and data collection methods have led to oil companies and ancilliary companies collecting large amounts of geophysical and geological data from exploration sites and production wells in the last 20-25 years. These are now being organized into large company databases. The question is: can this vast amount of previous history from previously explored fields be systematically applied to evaluating new plays and prospects. A possible solution may be in developing the capability for retrieving historical data for the purpose of finding analogs for current prospects. Statistical risk analysis techniques may then be employed for computing distributions of possible hydrocarbon volumes for these prospects, and this may form the basis for developing more formal and objective prospect evaluation and ranking schemes.

We use this scenario as a motivation for studying discovery techniques applied to databases. Section 2 reviews important concepts that apply to knowledge discovery or database mining techniques. Section 3 briefly summarizes numeric and non numeric clustering schemes. Section 4 discusses ITERATE, a conceptual clustering algorithm, that forms an important component of a database mining system. Section 5 illustrates the effectiveness of ITERATE in concept formation, and Section 6 contains a summary and directions for future work in this area.

Partially supported by grants from Amoco Research Labs, Tulsa, OK, Arco Labs, Richardson, TX, and Ecopetrol, Colombia.

Full Paper (PDF 344064 bytes).