Support Vector Machine (RapidMiner Studio Core)

Synopsis

This operator is an SVM (Support Vector Machine) Learner. It is based on the internal Java implementation of the mySVM by Stefan Rueping.

Description

This learner uses the Java implementation of the support vector machine mySVM by Stefan Rueping. This learning method can be used for both regression and classification and provides a fast algorithm and good results for many learning tasks. mySVM works with linear or quadratic and even asymmetric loss functions.

This operator supports various kernel types including dot, radial, polynomial, neural, anova, epachnenikov, gaussian combination and multiquadric. Explanation of these kernel types is given in the parameters section.

Here is a basic description of the SVM. The standard SVM takes a set of input data and predicts, for each given input, which of the two possible classes comprises the input, making the SVM a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite- dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mapping used by the SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function K(x,y) selected to suit the problem. The hyperplanes in the higher dimensional space are defined as the set of points whose inner product with a vector in that space is constant.

Input

training set (Data Table)
This input port expects an ExampleSet. This operator cannot handle nominal attributes; it can be applied on data sets with numeric attributes. Thus often you may have to use the Nominal to Numerical operator before application of this operator.

Output

model (Kernel Model)
The SVM model is delivered from this output port. This model can now be applied on unseen data sets.
example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
estimated performance (Performance Vector)
This port delivers a performance vector of the SVM model which gives an estimation of statistical performance of this model.
weights (Attribute Weights)
This port delivers the attribute weights. This is possible only when the dot kernel type is used, it is not possible with other kernel types.

Parameters

kernel_typeThe type of the kernel function is selected through this parameter. Following kernel types are supported: dot, radial, polynomial, neural, anova, epachnenikov, gaussian combination, multiquadric
- dot: The dot kernel is defined by k(x,y)=x*y i.e. it is inner product of x and y.
- radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma, it is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
- polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of polynomial and it is specified by the kernel degree parameter. The polynomial kernels are well suited for problems where all the training data is normalized.
- neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
- anova: The anova kernel is defined by raised to power d of summation of exp(-g (x-y)) where g is gamma and d is degree. gamma and degree are adjusted by the kernel gamma and kernel degree parameters respectively.
- epachnenikov: The epachnenikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1 and kernel degree.
- gaussian_combination: This is the gaussian combination kernel. It has adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
- multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2. It has adjustable parameters kernel sigma1 and kernel sigma shift.
Range: selection
kernel_gammaThis is the SVM kernel parameter gamma. This is available only when the kernel type parameter is set to radial or anova. Range: real
kernel_sigma1This is the SVM kernel parameter sigma1. This is available only when the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric. Range: real
kernel_sigma2This is the SVM kernel parameter sigma2. This is available only when the kernel type parameter is set to gaussian combination. Range: real
kernel_sigma3This is the SVM kernel parameter sigma3. This is available only when the kernel type parameter is set to gaussian combination. Range: real
kernel_shiftThis is the SVM kernel parameter shift. This is available only when the kernel type parameter is set to multiquadric. Range: real
kernel_degreeThis is the SVM kernel parameter degree. This is available only when the kernel type parameter is set to polynomial, anova or epachnenikov. Range: real
kernel_aThis is the SVM kernel parameter a. This is available only when the kernel type parameter is set to neural. Range: real
kernel_bThis is the SVM kernel parameter b. This is available only when the kernel type parameter is set to neural. Range: real
kernel_cacheThis is an expert parameter. It specifies the size of the cache for kernel evaluations in megabytes. Range: real
CThis is the SVM complexity constant which sets the tolerance for misclassification, where higher C values allow for 'softer' boundaries and lower values create 'harder' boundaries. A complexity constant that is too large can lead to over-fitting, while values that are too small may result in over-generalization. Range: real
convergence_epsilonThis is an optimizer parameter. It specifies the precision on the KKT conditions. Range:
max_iterationsThis is an optimizer parameter. It specifies to stop iterations after a specified number of iterations. Range: integer
scaleThis is a global parameter. If checked, the example values are scaled and the scaling parameters are stored for a test set. Range: boolean
L_posA factor for the SVM complexity constant for positive examples. This parameter is part of the loss function. Range: real
L_negA factor for the SVM complexity constant for negative examples.This parameter is part of the loss function. Range: real
epsilonThis parameter specifies the insensitivity constant. No loss if the prediction lies this close to true value. This parameter is part of the loss function. Range: real
epsilon_plusThis parameter is part of the loss function. It specifies epsilon for positive deviation only. Range: real
epsilon_minusThis parameter is part of the loss function. It specifies epsilon for negative deviation only. Range: real
balance_costIf checked, adapts Cpos and Cneg to the relative size of the classes. Range: boolean
quadratic_loss posUse quadratic loss for positive deviation. This parameter is part of the loss function. Range: boolean
quadratic_loss_negUse quadratic loss for negative deviation. This parameter is part of the loss function. Range: boolean

Tutorial Processes

Getting started with SVM

This is a simple Example Process which gets you started with the SVM operator. The Retrieve operator is used to load the 'Golf' data set. The Nominal to Numerical operator is applied on it to convert its nominal attributes to numerical form. This step is necessary because the SVM operator cannot take nominal attributes, it can only classify using numerical attributes. The model generated from the SVM operator is then applied on the 'Golf-Testset' data set. Nominal to Numerical operator was applied on this data set as well. This is necessary because the testing and training data set should be in the same format. The statistical performance of this model is measured using the Performance operator. This is a very basic process. It is recommended that you develop a deeper understanding of SVM for getting better results through this operator. The support vector machine (SVM) is a popular classification technique. However, beginners who are not familiar with SVM often get unsatisfactory results since they miss some easy but significant steps.

Using 'm' numbers to represent an m-category attribute is recommended. Only one of the 'm' numbers is 1, the others are 0. For example, a three-category attribute such as Outlook {overcast, sunny, rain} can be represented as (0,0,1), (0,1,0), and (1,0,0). This can be achieved by setting the coding type parameter to 'dummy coding' in the Nominal to Numerical operator. Generally, if the number of values in an attribute is not too large, this coding might be more stable than using a single number.

To get a more accurate classification model from SVM, scaling is recommended. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is to avoid numerical difficulties during the calculation. Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems. Scaling should be performed on both training and testing data sets. In this process the scale parameter is checked. Uncheck the scale parameter and run the process again. You will see that this time it takes a lot longer than the time taken with scaling.

You should have a good understanding of kernel types and different parameters associated with each kernel type in order to get better results from this operator. The gaussian combination kernel was used in this example process. All parameters were used with default values. The accuracy of this model was just 35.71%. Try changing different parameters to get better results. If you change the parameter C to 1 instead of 0, you will see that accuracy of the model rises to 64.29%. Thus, you can see how making small changes in parameters can have a significant effect on overall results. Thus it is very necessary to have a good understanding of parameters of kernel type in use. It is equally important to have a good understanding of different kernel types, and choosing the most suitable kernel type for your ExampleSet. Try using the polynomial kernel in this Example Process (also set the parameter C to 0); you will see that accuracy is around 71.43% with default values for all parameters. Change the value of the parameter C to 1 instead of 0. Doing this increased the accuracy of model with gaussian combination kernel, but here you will see that accuracy of the model drops.

We used default values for most of the parameters. To get more accurate results these values should be carefully selected. Usually techniques like cross-validation are used to find the best values of these parameters for the ExampleSet under consideration.