Optimize Weights (Evolutionary) (RapidMiner Studio Core)

Synopsis

This operator calculates the relevance of the attributes of the given ExampleSet by using an evolutionary approach. The weights of the attributes are calculated using a Genetic Algorithm.

Description

The Optimize Weights (Evolutionary) operator is a nested operator i.e. it has a subprocess. The subprocess of the Optimize Weights (Evolutionary) operator must always return a performance vector. For more information regarding subprocesses please study the Subprocess operator. The Optimize Weights (Evolutionary) operator calculates the weights of the attributes of the given ExampleSet by using a Genetic Algorithm. The higher the weight of an attribute, the more relevant it is considered.

A genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.

In genetic algorithm 'mutation' means switching features on and off and 'crossover' means interchanging used features. Selection is done by the specified selection scheme which is selected by the selection scheme parameter. A genetic algorithm works as follows:

Generate an initial population consisting of p individuals. The number p can be adjusted by the population size parameter.

For all individuals in the population Perform mutation, i.e. set used attributes to unused with probability p_m and vice versa. The probability p_m can be adjusted by the corresponding parameters. Choose two individuals from the population and perform crossover with probability p_c. The probability p_c can be adjusted by the p crossover parameter. The type of crossover can be selected by the crossover type parameter. Perform selection, map all individuals according to their fitness and draw p individuals at random according to their probability where p is the population size which can be adjusted by the population size parameter. As long as the fitness improves, go to step number 2.

If the ExampleSet contains value series attributes with block numbers, the whole block will be switched on and off. Exact, minimum or maximum number of attributes in combinations to be tested can be specified by the appropriate parameters. Many other options are also available for this operator. Please study the parameters section for more information.

Input

  • example set in (IOObject)

    This input port expects an ExampleSet. This ExampleSet is available at the first port of the nested chain (inside the subprocess) for processing in the subprocess.

  • attribute weights in (Average Vector)

    This port expects attribute weights. It is not compulsory to use this port.

  • through (IOObject)

    This operator can have multiple through ports. When one input is connected with the through port, another through port becomes available which is ready to accept another input (if any). The order of inputs remains the same. The Object supplied at the first through port of this operator is available at the first through port of the nested chain (inside the subprocess). Do not forget to connect all inputs in correct order. Make sure that you have connected the right number of ports at the subprocess level.

Output

  • example set out (IOObject)

    The genetic algorithm is applied on the input ExampleSet. The resultant ExampleSet with reduced attributes is delivered through this port.

  • weights (Average Vector)

    The attribute weights are delivered through this port.

  • performance (Performance Vector)

    This port delivers the Performance Vector for the selected attributes. A Performance Vector is a list of performance criteria values.

Parameters

  • population_sizeThis parameter specifies the population size i.e. the number of individuals per generation. Range: integer
  • maximum_number_of_generationsThis parameter specifies the number of generations after which the algorithm should be terminated. Range: integer
  • use_early_stoppingThis parameter enables early stopping. If not set to true, always the maximum number of generations are performed. Range: boolean
  • generations_without_improvalThis parameter is only available when the use early stopping parameter is set to true. This parameter specifies the stop criterion for early stopping i.e. it stops after n generations without improvement in the performance. n is specified by this parameter. Range: integer
  • normalize_weightsThis parameter indicates if the final weights should be normalized. If set to true, the final weights are normalized such that the maximum weight is 1 and the minimum weight is 0. Range: boolean
  • use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Using the same value of local random seed will produce the same randomization. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer
  • show_stop_dialogThis parameter determines if a dialog with a stop button should be displayed which stops the search for the best feature space. If the search for the best feature space is stopped, the best individual found till then will be returned. Range: boolean
  • user_result_individual_selectionIf this parameter is set to true, it allows the user to select the final result individual from the last population. Range: boolean
  • show_population_plotterThis parameter determines if the current population should be displayed in the performance space. Range: boolean
  • population_criteria_data_fileThis parameter specifies the path to the file in which the criteria data of the final population should be saved. Range: filename
  • maximal_fitnessThis parameter specifies the maximal fitness. The optimization will stop if the fitness reaches this value. Range: real
  • selection_schemeThis parameter specifies the selection scheme of this evolutionary algorithms. Range: selection
  • tournament_sizeThis parameter is only available when the selection scheme parameter is set to 'tournament'. It specifies the fraction of the current population which should be used as tournament members. Range: real
  • start_temperatureThis parameter is only available when the selection scheme parameter is set to 'Boltzmann'. It specifies the scaling temperature. Range: real
  • dynamic_selection_pressureThis parameter is only available when the selection scheme parameter is set to 'Boltzmann' or 'tournament'. If set to true the selection pressure is increased to maximum during the complete optimization run. Range: boolean
  • keep_best_individual If set to true, the best individual of each generations is guaranteed to be selected for the next generation. Range: boolean
  • save_intermediate_weightsThis parameter determines if the intermediate best results should be saved. Range: boolean
  • intermediate_weights_generationsThis parameter is only available when the save intermediate weights parameter is set to true. The intermediate best results would be saved every k generations where k is specified by this parameter. Range: integer
  • intermediate_weights_fileThis parameter specifies the file into which the intermediate weights should be saved. Range: filename
  • mutation_varianceThis parameter specifies the (initial) variance for each mutation. Range: real
  • 1_5_ruleThis parameter determines if the 1/5 rule for variance adaption should be used. Range: boolean
  • bounded_mutationIf this parameter is set to true, the weights are bounded between 0 and 1. Range: boolean
  • p_crossoverThe probability for an individual to be selected for crossover is specified by this parameter. Range: real
  • crossover_typeThe type of the crossover can be selected by this parameter. Range: selection
  • use_default_mutation_rateThis parameter determines if the default mutation rate should be used for nominal attributes. Range: boolean
  • nominal_mutation_rateThis parameter specifies the probability to switch nominal attributes between 0 and 1. Range: real
  • initialize_with_input_weights This parameter indicates if this operator should look for attribute weights in the given input and use them as a starting point for the optimization. Range: boolean

Tutorial Processes

Calculating the weights of the attributes of the Polynomial data set

The 'Polynomial' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can have a look at the ExampleSet. You can see that the ExampleSet has 5 regular attributes other than the label attribute. The Optimize Weights (Evolutionary) operator is applied on the ExampleSet which is a nested operator i.e. it has a subprocess. It is necessary for the subprocess to deliver a performance vector. This performance vector is used by the underlying Genetic Algorithm. Have a look at the subprocess of this operator. The Split Validation operator has been used there which itself is a nested operator. Have a look at the subprocesses of the Split Validation operator. The SVM operator is used in the 'Training' subprocess to train a model. The trained model is applied using the Apply Model operator in the 'Testing' subprocess. The performance is measured through the Performance operator and the resultant performance vector is used by the underlying algorithm. Run the process and switch to the Results Workspace. You can see that the ExampleSet that had 5 attributes has now been reduced to 2 attributes. Also take a look at the weights of the attributes in the Results Workspace. You can see that two attributes have weight 1 and the remaining attributes have weight 0.