DISTRIB Function

DISTRIB - control range of variable

The format of the DISTRIB function is:

DISTRIB( Control@, MinerState, Probability)

The function provides a range to a variable, indicating the values present in a list, internally constructed during the Learning phase.

The parameters control the operation of the DISTRIB function:

Control@ - if TRUE, the function will respond to the state of the Miner, either storing information or outputting it. A single variable can have many DISTRIB functions, only one of which can have its Control@ pin TRUE.

MinerState - the MinerState can be Quiescent, Learning or Running, or an intermediate state. If Learning, the information coming in the value pin is constructed into a list, for later output during the Running state.

Probability - if no Probability is specified externally, the initial value of Probability will be set to 0..100 (indicating a probability of 100%), and the full range will be output at the value pin. Reducing the range of the probability will reduce the range of alternatives at the value pin, based on frequency of occurrence.

The DISTRIB function handles both string and numeric distributions:

For strings, the actual strings are stored, together with a count of occurrence. When in Running state, a list of alternative string values is output. The alternatives are dependent on the probability, a lower probability threshold pruning the alternatives. As an example, the following strings are read from the database, together with their frequency of occurrence:

ABC	25
DEF	12
GHJ	3
XYZ	1

At completion of the learning phase, the strings are ordered in decreasing frequency. A request for all possible alternatives (Probability of 0..100) would result in

ABC, DEF, GHJ, XYZ

whereas a request for Probability of 0..90 would result in

ABC, DEF

If the number of strings would exceed 96, either a catchall string of '*' is used or merging is handled using a HIERARCHY operator.

For integers, the frequency of occurrence for individual integers is stored. If the number of different integers would exceed 100, clumping into ranges is used where frequency of occurrence is low. The combination of ranges where there is low frequency of occurrence with single values where there is high frequency keeps the overall number of objects around 100 while maintaining precision.

The list of values is normally destroyed during the StartLearn phase, built from scratch during the Learn phase and sorted during the FinishLearn phase. If the list has been re-ordered, the list becomes locked, and only the count is initialised during the StartLearn phase. The list can be sorted and the lock removed using the facilities of Edit Stochastic Operators.

Related Operators

EXTRACT

HISTOGRAM

RELATION