Logistic Regression (Evolutionary) (RapidMiner Studio Core)

Synopsis

This operator is a kernel logistic regression learner for binary classification tasks.

Description

Logistic regression is a type of regression analysis used for predicting the outcome of a categorical (a variable that can take on a limited number of categories) criterion variable based on one or more predictor variables. The probabilities describing the possible outcome of a single trial are modeled, as a function of explanatory variables, using a logistic function. Logistic regression measures the relationship between a categorical dependent variable and usually a continuous independent variable (or several), by converting the dependent variable to probability scores

This operator supports various kernel types including dot, radial, polynomial, sigmoid, anova, epachnenikov, gaussian combination and multiquadric. An explanation of these kernel types is given in the parameters section.

Input

training set (Data Table)
This input port expects an ExampleSet. This operator cannot handle nominal attributes; it can be applied on data sets with numeric attributes. Thus often you may have to use the Nominal to Numerical operator before application of this operator.

Output

model (Kernel Model)
The Logistic Regression model is delivered from this output port. This model can now be applied on unseen data sets.
example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

kernel_typeThe type of the kernel function is selected through this parameter. Following kernel types are supported: dot, radial, polynomial, sigmoid, anova, epachnenikov, gaussian combination, multiquadric
- dot: The dot kernel is defined by k(x,y)=x*y i.e. it is inner product of x and y.
- radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma, it is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
- polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of polynomial and it is specified by the kernel degree parameter. The polynomial kernels are well suited for problems where all the training data is normalized.
- sigmoid: The sigmoid kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
- anova: The anova kernel is defined by raised to power d of summation of exp(-g (x-y)) where g is gamma and d is degree. gamma and degree are adjusted by the kernel gamma and kernel degree parameters respectively.
- epachnenikov: The epachnenikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1 and kernel degree.
- gaussian_combination: This is the gaussian combination kernel. It has the adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
- multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2. It has the adjustable parameters kernel sigma1 and kernel sigma shift.
Range: selection
kernel_gammaThis is the kernel parameter gamma. This is only available when the kernel type parameter is set to radial or anova. Range: real
kernel_sigma1This is the kernel parameter sigma1. This is only available when the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric. Range: real
kernel_sigma2This is the kernel parameter sigma2. This is only available when the kernel type parameter is set to gaussian combination. Range: real
kernel_sigma3This is the kernel parameter sigma3. This is only available when the kernel type parameter is set to gaussian combination. Range: real
kernel_shiftThis is the kernel parameter shift. This is only available when the kernel type parameter is set to multiquadric. Range: real
kernel_degreeThis is the kernel parameter degree. This is only available when the kernel type parameter is set to polynomial, anova or epachnenikov. Range: real
kernel_aThis is the kernel parameter a. This is only available when the kernel type parameter is set to sigmoid Range: real
kernel_bThis is the kernel parameter b. This is only available when the kernel type parameter is set to sigmoid Range: real
CThis is the complexity constant which sets the tolerance for misclassification, where higher C values allow for 'softer' boundaries and lower values create 'harder' boundaries. A complexity constant that is too large can lead to over-fitting, while values that are too small may result in over-generalization. Range: real
start_population_typeThis parameter specifies the type of start population initialization. Range: selection
max_generationsThis parameter specifies the number of generations after which the algorithm should be terminated. Range: integer
generations_without_improvalThis parameter specifies the stop criterion for early stopping i.e. it stops after n generations without improvement in the performance. n is specified by this parameter. Range: integer
population_sizeThis parameter specifies the population size i.e. the number of individuals per generation. If set to -1, all examples are selected. Range: integer
tournament_fractionThis parameter specifies the fraction of the current population which should be used as tournament members. Range: real
keep_bestThis parameter specifies if the best individual should survive. This is also called elitist selection. Retaining the best individuals in a generation unchanged in the next generation, is called elitism or elitist selection. Range: boolean
mutation_typeThis parameter specifies the type of the mutation operator. Range: selection
selection_typeThis parameter specifies the selection scheme of this evolutionary algorithms. Range: selection
crossover_probThe probability for an individual to be selected for crossover is specified by this parameter. Range: real
use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Using the same value of local random seed will produce the same randomization. Range: boolean
local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer
show_convergence_plotThis parameter indicates if a dialog with a convergence plot should be drawn. Range: boolean

Tutorial Processes

Introduction to the Logistic Regression (Evolutionary) operator

The 'Sonar' data set is loaded using the Retrieve operator. The Split Validation operator is applied on it for training and testing a regression model. The Logistic Regression (Evolutionary) operator is applied in the training subprocess of the Split Validation operator. All parameters are used with default values. The Logistic Regression (Evolutionary) operator generates a regression model. The Apply Model operator is used in the testing subprocess to apply this model on the testing data set. The resultant labeled ExampleSet is used by the Performance operator for measuring the performance of the model. The regression model and its performance vector are connected to the output and it can be seen in the Results Workspace.