Data Mining

Executive Summary

The Data Mining process currently suffers from several difficulties. Providing clean input to the process can be an onerous and lengthy task, to the point where Data Mining is not attempted. Another difficulty is making timely use of the results of the data mining. There are several possible modes of data mining output. One mode uses rules, another uses neural networks. Rules provide human-readable output, but implementation can be slow and difficult because of the limited applicability of each rule. An artificial neural network can be useful when an output is a complex probabilistic function of several inputs, but it is hard to decide on the inputs for the neural network, it is tedious to build, hard to verify and impossible to alter or integrate with other systems.

Tupai has introduced a new Data Mining approach, based on eCognition technology, which integrates analytic and probabilistic methods seamlessly. The new approach supports the whole data into action process, the transformation of input, through data mining, to implementation of output, and dramatically reduces the turnaround time from raw data input to output embedded in operational systems, allowing data mining to become an integral (and essential) part of an organization's response to change.

Tupai's eCognition technology can be used for online learning in systems responding to customers, complex data cleanup and transformation tasks, information compression and storage and undirected knowledge discovery. The same technology is used in Tupai's Time Series Analysis, allowing close integration of what are usually two separate activities.

Introduction to Tupai's Approach to Data Mining

The application that drove the development of Tupai's technology is the mining of customer engagements on the web. Often there is only partial information, so it is necessary to use probability in the mining process, the data is coming from a multiplicity of sources and needs to be transformed into concepts that relate to the purpose of the web site, and the overall speed of response of Internet systems is critical, requiring turnaround in hours.

The data cleanup phase that usually precedes data mining has some of the same characteristics as web mining - there is partial information, the data needs to be transformed into a different form, and the response time in detecting invalid data needs to be short. A data cleanup phase also occurs when new strategies are to be adopted, invalidating much of the experience an organization has, so the experience has to be transformed into new experience based on the new strategies.

Tupai's Data Mining Technology is implemented in the Trait Miner. It uses a knowledge modeling methodology to connect columns in the database to business concepts in a network structure, then transmits values from individual records through this structure. The structure consists of variables and operators, some of which are analytic, like a simple plus operator to add two variables together or an IF...THEN...ELSE to switch information paths, while other operators can be probabilistic and store experience of how objects or concepts are related. The structure can support the transmission of complex messages, so one-to-many and many-to-many transformations can be used to unravel data which does not directly map into a single value in a data warehouse column. The mining process itself can make use of probability distributions obtained from a preliminary mining stage to "massage" and validate the data coming from the database. The output of the process can be clean data into a data warehouse, or active probability maps that can be immediately embedded in operational systems, or the system can search through the first layer of mined results to build more structure and mine more complex relationships in a process of knowledge discovery, guided by the organization's business knowledge.

Tupai's Data Mining Technology maintains the flexibility and undirectedness of the relational database in the output of the data mining process.

The Data Mining Process

The typical data mining process can be broken down into several phases:

Data Cleanup and Transformation

Mining the Data

Implementation of the results

Different applications of data mining can have time scales for the established cleanup-mine-implement process that range from a few minutes to a few years. Through the rest of this paper, we will assume the Trait Miner is required to handle the whole spectrum of response times. The same network objects that store the results of the Trait Miner can also be used for short term learning, with an update time scale of seconds.

Data Cleanup

Data mining relies on clean valid data, so this becomes the first hurdle in setting up a data mining process. The usual problems are:

missing data

different granularity of data among systems

mismatch of data definitions among different legacy systems

invalid data

The effort involved in cleanup of the data can be so daunting that data mining is put off altogether. A better approach is to use Tupai's mining technology to reduce the magnitude of the problem. A structure is created through which the raw data is pushed, unraveling and transforming it in the process.

The result is mined to find significant relations. The contents of these relations can be examined, and invalid clusters of data removed. The combination of the transformation structure and the relations can be used to clean the data - suggesting consistent subsets for fields in bad records to an analyst, or learning from responses of an analyst on a sample set before continuing by itself, or using the relations to clean the records completely automatically.

If there are several nulls in the same record and the approach is simplistic, new categories can emerge. Using probability distributions connected through analytic structure, the choice of a data value for one null will constrain the range of possibilities for other nulls in the same record, ensuring the data is consistent and no false categories are created.

When mining from several sources, there can be different granularity or different meanings between the data definitions. One database may hold information by state, another by geographic area - say the "West Coast". When mining the second database, the West Coast reference can be turned into a list of states. The list is propagated through the knowledge model, is pruned or filtered by any other factors in the model, then one value from the list is selected at random based on a distribution relevant to the pruned subset of states taken from the first database. If there are multiple instances of granularity or definition mismatch, the same consistency mechanism as for nulls will ensure that the values selected will be consistent. This step doesn't require a layer of rules using some other technology, it just means that some parts of the knowledge handling structure are handling multiple values and are interacting with each other.

Databases have ways of determining invalid data, but they tend to be edit values for one column or they may be a function of a few columns without probability estimates. The Trait Miner can use the combination of the probability of the data in every column in the record, together with calculation of probability of more complex concepts constructed using the total database data and other sources, to assess the validity of each record.

See a more detailed Data Cleaning example.

Mining the Data

The Trait Miner first adaptively constructs distributions for each variable in its model, then uses these distributions as keys into multidimensional relations connecting variables. The variables in the model may refer directly to columns in tables, or refer to business concepts constructed out of references to column variables and other information sources. The relation operators use probability maps to store the correlation among variables that is extracted from the database. Some of the incoming data may itself be ranges or probability distributions - this information is easily spread into the relations.

Once a layer of relations has been built, there is sufficient information in the relations to permit searching for more complex relations without going back to the database. As a simple example, there may be a relation between A, B and C of

A + B = C

If the relations

A - B
A - C
B - C

have been filled during mining , the system can find the relation A + B = C by combining the information in the three 2-dimensional relations to produce the 3-dimensional relation, and so on to higher dimensional relations. The effect is that a layer is built, and then new layers are built on it, with a fast in-memory search for potential relations with better correlations. This search can be directed, by defining a concept variable like Profitability, or it can be undirected. The search will not identify new relations that already exist in the model as business knowledge, as these will guide the search.

The data to be mined is rarely monolithic, and may need to be mined into different relations in response to an automated Campaign Manager or other control element which has controlled the generation of the data and now must stratify and segregate the data in accordance with external or its internally generated logic and events.

For many businesses, Time Series Analysis becomes an important part of the data mining process. Tupai's Time Series Analysis uses the same techniques as for data mining, where information about the whole is used to guide analysis of the particular, and the result of analysis can be directly embedded in operational systems - see Time Series Analysis.

Implementation of Output

A Little More About Other Technologies

Output As Rules

A typical rule output might be

IF A < 5 THEN B = 6 Probability 0.85 Confidence 0.30

IF B < 7 THEN C = 3 Probability 0.70 Confidence 0.60

A rule has a narrow field of applicability, forced on it by the fact that the condition must cut at a single value and the consequent of the rule must resolve to a single value. These requirements mean that specification of customer or other diverse behavior through rules is coarse and clumsy. Rules will rarely have multidimensional conditions, because the field of applicability would become far too small. It can be very difficult to tell what the combination of several rules, such as the two shown here, means.

In a real application, where natural variations in the data smear along every dimension, someone (usually the programmer) has to attempt to translate the data mining output into rules with sufficient business applicability to be useful. Some questions

	what is an appropriate probability cutoff?
	should there be multiple rules describing parts of the same probability distribution?
	the mined rules represent history - do they need to be changed for new business?

The determination of a useful rule set can be a time consuming task, requiring months of work. If the conditions that gave rise to the data change as the work is in progress, obsolescence before implementation is guaranteed. When the number of rules climbs to a few hundred, maintaining consistency among them becomes very difficult.

The rules are being inferred from probability distributions among variables. Tupai uses a different technology that directly stores the distributions in active multi-dimensional relations. This ensures consistency and precision by taking into account the changing ranges on the variables. The multiplicity of potentially inconsistent rules and the delays in implementation are eliminated.

Output As Neural Networks

Sometimes a multi-dimensional approach is unavoidable - a customer preference may be some complex function of their age, buying power, gender and lifestyle. The Artificial Neural Network, which uses a set of weightings from each input variable, can "learn" the complex function, using a combination of multiple levels of weighting, and tuning of the individual weightings. The method lacks identifiability - no internal point can be compared with a supposed internal point in the real process. Without identifiability, strategic modification of the behavior of the neural network is not possible, and the turnaround speed is slow, as it may take thousands of training cycles on a carefully selected training set and trying different techniques for the network to learn a complex function. Worse, the analyst has to determine what the appropriate inputs to the neural network should be, as too many inputs will "over determine" and degrade the response.

Tupai's mining technology determines the appropriate inputs and builds an active multi-dimensional relation that can be integrated with other types of knowledge and used immediately in operational systems.

Tupai's Technology

The Trait Miner makes available several ways of making use of the output of the data mining:

Simulation

An immediate use is for simulation of the unknown processes that generated the data in the database. All of the distributions and relations become active and interact with any business logic also present in the model. The analyst can study the behavior of the system, reducing ranges on variables, varying probability (the operators holding the distributions have a control on their probability - varying the probability causes the output distribution and range of the variable to change), and observe the interactions. The simulation may have been assembled from many tables and databases, the only connection being through business concepts. This is also a good way of studying how an operational system using the same relations will behave.

Operational System

The distributions and relations filled in the mining process can be immediately embedded in an operational system. For web mining, using the relations directly is ideal, as the whole process of mining and implementation can be automated. The turnaround from mining to operational use collapses to a few minutes. As information becomes available about variables in a particular application or engagement and their ranges decrease, the populations shrink, leading to better prediction of future behavior.

If legacy applications are unable to make use of the relations directly, they can be replaced by piecewise linear approximations or automatically generated analytic structure, using plus, multiply and all the other analytic operators.

Knowledge Discovery

The distributions and relations hold the data in the database in greatly compressed form, making an ideal platform on which to automatically find and build more complex relations.

For many areas in organizations, there is no choice to be made - all three ways of benefiting from the data mining output will be used, the particular emphasis changing with changing business needs.

Tools

The Trait Miner provides a set of graphical and other tools for control of the mining process, for entry of business knowledge, and for visualization, modification and reporting of the mined information. Modification of the mined output may sound strange, but where the output of the miner is to be directly embedded in operational systems, some easy and effective method of override and control needs to be exercised by strategic planning and marketing departments. The probability maps or the analytic structure derived from them describe past behavior, and will need modification if they are to guide and alter the future behavior of customers.

Conclusion

Tupai's Data Mining based on eCognition technology dramatically changes the speed of response of data mining, and allows it to be part of a rapid and intelligent response by organizations to their changing environment. The mined output is especially useful for commercial systems that attempt to predict customer psychology or business systems that need to tolerate change in an electronic marketplace.

Home

	Data Cleanup and Transformation
	Mining the Data
	Implementation of the results

	missing data
	different granularity of data among systems
	mismatch of data definitions among different legacy systems
	invalid data