Optimize by Generation (YAGGA2) (RapidMiner Studio Core)

Synopsis

This operator may select some attributes from the original attribute set and it may also generate new attributes from the original attribute set. YAGGA2 (Yet Another Generating Genetic Algorithm 2) does not change the original number of attributes unless adding or removing (or both) attributes proves to have a better fitness. This algorithm is an improved version of YAGGA.

Description

Sometimes the selection of features alone is not sufficient. In these cases other transformations of the feature space must be performed. The generation of new attributes from the given attributes extends the feature space. Maybe a hypothesis can be easily found in the extended feature space. This operator can be considered to be a blend of attribute selection and attribute generation procedures. It may select some attributes from the original set of attributes and it may also generate new attributes from the original attributes. The (generating) mutation can do one of the following things with different probabilities:

Probability p/4: Add a newly generated attribute to the feature vector.
Probability p/4: Add a randomly chosen original attribute to the feature vector.
Probability p/2: Remove a randomly chosen attribute from the feature vector.

Thus it is guaranteed that the length of the feature vector can both grow and shrink. On average it will keep its original length, unless longer or shorter individuals prove to have a better fitness.

In addition to the usual YAGGA operator, this operator allows more feature generators and provides several techniques for redundancy prevention. This leads to smaller ExampleSets containing less redundant features.

A genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. For studying the basic algorithm of a genetic algorithm please study the description of the Optimize Selection (Evolutionary) operator.

This operator is a nested operator i.e. it has a subprocess. The subprocess must return a performance vector. You need to have basic understanding of subprocesses in order to apply this operator. Please study the documentation of the Subprocess operator for basic understanding of subprocesses.

Differentiation

Optimize by Generation (YAGGA)

The YAGGA2 operator is an improved version of the usual YAGGA operator, this operator allows more feature generators and provides several techniques for redundancy prevention. This leads to smaller ExampleSets containing less redundant features.

Input

example set in (Data Table)
This input port expects an ExampleSet. This ExampleSet is available at the first port of the nested chain (inside the subprocess) for processing in the subprocess.

Output

example set out (Data Table)
The genetic algorithm is applied on the input ExampleSet. The resultant ExampleSet is delivered through this port.
attribute weights out (Attribute Weights)
The attribute weights are delivered through this port.
performance out (Performance Vector)
This port delivers the Performance Vector for the selected attributes. A Performance Vector is a list of performance criteria values.

Parameters

limit_max_total_number_of_attributesThis parameter indicates if the total number of attributes in all generations should be limited. If set to true, the maximum number is specified by the max total number of attributes parameter. Range: boolean
max_total_number_of_attributesThis parameter is only available when the limit max total number of attributes parameter is set to true. This parameter specifies the maximum total number of attributes in all generations. Range: integer
use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Using the same value of local random seed will produce the same randomization. Range: boolean
local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer
show_stop_dialogThis parameter determines if a dialog with a stop button should be displayed which stops the search for the best feature space. If the search for the best feature space is stopped, the best individual found till then will be returned. Range: boolean
maximal_fitnessThis parameter specifies the maximal fitness. The optimization will stop if the fitness reaches this value. Range: real
population_sizeThis parameter specifies the population size i.e. the number of individuals per generation. Range: integer
maximum_number_of_generationsThis parameter specifies the number of generations after which the algorithm should be terminated. Range: integer
use_plusThis parameter indicates if the summation function should be applied for a generation of new attributes. Range: boolean
use_diffThis parameter indicates if the difference function should be applied for a generation of new attributes. Range: boolean
use_multThis parameter indicates if the multiplication function should be applied for a generation of new attributes. Range: boolean
use_divThis parameter indicates if the division function should be applied for a generation of new attributes. Range: boolean
reciprocal_valueThis parameter indicates if the reciprocal function should be applied for a generation of new attributes. Range: boolean
use_early_stoppingThis parameter enables early stopping. If not set to true, always the maximum number of generations are performed. Range: boolean
generations_without_improvalThis parameter is only available when the use early stopping parameter is set to true. This parameter specifies the stop criterion for early stopping i.e. it stops after n generations without improvement in the performance. n is specified by this parameter. Range: integer
tournament_sizeThis parameter specifies the fraction of the current population which should be used as tournament members. Range: real
start_temperatureThis parameter specifies the scaling temperature. Range: real
dynamic_selection_pressureIf this parameter is set to true, the selection pressure is increased to maximum during the complete optimization run. Range: boolean
keep_best_individual If set to true, the best individual of each generation is guaranteed to be selected for the next generation. Range: boolean
p_initializeThe initial probability for an attribute to be switched on is specified by this parameter. Range: real
p_crossoverThe probability for an individual to be selected for crossover is specified by this parameter. Range: real
crossover_typeThe type of the crossover can be selected by this parameter. Range: selection
use_heuristic_mutation_probabilityIf this parameter is set to true, the probability for mutations will be chosen as 1/n where n is the number of attributes. Otherwise the probability for mutations should be specified through the p mutation parameter Range: boolean
p_mutationThe probability for an attribute to be changed is specified by this parameter. If set to -1, the probability will be set to 1/n where n is the total number of attributes. Range: real
use_square_rootsThis parameter indicates if the square root function should be applied for a generation of new attributes. Range: boolean
use_power_functionsThis parameter indicates if the power (of one attribute to another attribute) function should be applied for a generation of new attributes. Range: boolean
use_sinThis parameter indicates if the sine function should be applied for a generation of new attributes. Range: boolean
use_cos This parameter indicates if the cosine function should be applied for a generation of new attributes. Range: boolean
use_tanThis parameter indicates if the tangent function should be applied for a generation of new attributes. Range: boolean
use_atanThis parameter indicates if the arc tangent function should be applied for a generation of new attributes. Range: boolean
use_expThis parameter indicates if the exponential function should be applied for a generation of new attributes. Range: boolean
use_logThis parameter indicates if the logarithmic function should be applied for a generation of new attributes. Range: boolean
use_absolute_valuesThis parameter indicates if the absolute function should be applied for a generation of new attributes. Range: boolean
use_minThis parameter indicates if the minimum function should be applied for a generation of new attributes. Range: boolean
use_maxThis parameter indicates if the maximum function should be applied for a generation of new attributes. Range: boolean
use_sgnThis parameter indicates if the signum function should be applied for a generation of new attributes. Range: boolean
use_floor_ceil_functionsThis parameter indicates if the floor and ceiling functions should be applied for a generation of new attributes. Range: boolean
restrictive_selectionThis parameter indicates if the restrictive generator selection should be used. Execution is usually faster if this parameter is set to true. Range: boolean
remove_uselessThis parameter indicates if useless attributes should be removed. Range: boolean
remove_equivalentThis parameter indicates if equivalent attributes should be removed. Range: boolean
equivalence_samplesn number of samples are checked to prove equivalency where n is the value of this parameter. Range: integer
equivalence_epsilonTwo attributes are considered equivalent if their difference is not bigger than epsilon. Range: real
equivalence_use_statisticsIf this parameter is set to true, attribute statistics are recalculated before equivalence check. Range: boolean
unused_functionsThis parameter specifies the space separated list of functions which are not allowed in arguments for the attribute construction. Range: string
constant_generation_probThis parameter specifies the probability for a generation of random constant attributes. Range: real
associative_attribute_mergingThis parameter specifies if post processing should be performed after the crossover. It is only possible for runs with only one generator. Range: boolean

Tutorial Processes

Applying YAGGA2 on the Polynomial data set

The 'Polynomial' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can have a look at the ExampleSet. You can see that the ExampleSet has 5 regular attributes other then the label attribute. The Optimize by Generation (YAGGA2) operator is applied on the ExampleSet. It is a nested operator i.e. it has a subprocess. It is necessary for the subprocess to deliver a performance vector which is used by the underlying Genetic Algorithm. Have a look at the subprocess of this operator. The Split Validation operator is used there which itself is a nested operator. Have a look at the subprocesses of the Split Validation operator. The Linear Regression operator is used in the 'Training' subprocess to train a model. The trained model is applied using the Apply Model operator in the 'Testing' subprocess. The performance is measured through the Performance (Regression) operator and the resultant performance vector is used by the underlying algorithm. Run the process and switch to the Results Workspace. You can see that the ExampleSet that had 5 attributes now has 7 attributes. All attributes were selected from the original attribute set and the attributes 'gensym5' and 'gensym6' were generated. The number of resultant attributes is not less than the number of original attributes because YAGGA2 is not an attribute reduction operator. It may (or may not) increase or decrease the number of attributes depending on what proves to have a better fitness.