DISTRIB - control range of variable distrib.gif (8434 bytes)

The format of the DISTRIB function is:

         DISTRIB( Control@, MinerState, Probability)          

The function is usually built automatically by the Data Miner, and provides a range to a variable, indicating the values present in a list, internally constructed during the Learning phase of Web or Data Mining.

The parameters control the operation of the DISTRIB function:

Control@ - if TRUE, the function will respond to the state of the MinerState connection, either storing information or outputting it. If not TRUE, the operator does nothing.

MinerState - the MinerState can be Quiescent, Learning or Running, or an intermediate state. If Learning, the information coming in the value pin is constructed into a probability distribution, for later output during the Run state. That is, during the Run state the variable takes on the range found during the Learn state.

Probability - if no Probability is specified externally, the initial value of Probability will be set to 0..100 (indicating a probability of 100%), and the full range will be output at the value pin. Reducing the range of the Probability will reduce the range of alternatives at the value pin, based on frequency of occurrence.

The DISTRIB function handles logical, string and numeric distributions, and creates a distribution with no holes, or nil occurrences, in it.

Logicals
The distribution only contains TRUE and FALSE, but can output a Bayesian value indicating the ratio of these occurrences. For a Bayesian value, the logical state is UKE with a value ranging from 0 (FALSE) to 1.0 (TRUE).
Strings
For strings, the actual strings are stored, together with a count of occurrence. When in Running state, a list of alternative string values is output. The alternatives are dependent on the probability, a lower probability threshold pruning the alternatives. As an example, the following strings are read from the database, together with their frequency of occurrence:
ABC	25
DEF	12
GHJ	3
XYZ	1

At completion of the learning phase, the strings are ordered in decreasing frequency. A request for all possible alternatives (Probability of 0..100) would result in

    ABC, DEF, GHJ, XYZ

whereas a request for a probability of 0..90 would result in

    ABC, DEF

If the number of strings would exceed 100, a catchall string of '*' is used.

Numbers
For integers and reals, the frequency of occurrence for individual numbers is stored. If the number of different numbers would exceed 100, clumping into ranges is used where frequency of occurrence is low. The combination of ranges where there is low frequency of occurrence with single values where there is high frequency keeps the overall number of objects around 100 while maintaining precision.


The contents of the distribution operator can be inspected and edited using the PieceLine interface.

Relation operator