Support Vector Machine (LibSVM) (RapidMiner Studio Core)

Synopsis

This operator is an SVM (Support vector machine) Learner. It is based on the Java libSVM.

Description

This operator applies the "http://www.csie.ntu.edu.tw/~cjlin/libsvm" libsvm learner by Chih-Chung Chang and Chih-Jen Lin. SVM is a powerful method for both classification and regression. This operator supports the C-SVC and nu-SVC SVM types for classification tasks as well as the epsilon-SVR and nu-SVR SVM types for regression tasks. Additionally one-class SVM type is supported for distribution estimation. The one-class SVM type gives the possibility to learn from just one class of examples and later on test if new examples match the known ones. In contrast to other SVM learners, the libsvm supports internal multiclass learning and probability estimation based on Platt scaling for proper confidence values after applying the learned model on a classification data set.

Here is a basic description of SVM. The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes comprises the input, making the SVM a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite- dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space would be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mapping used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function K(x,y) selected to suit the problem. The hyperplanes in the higher dimensional space are defined as the set of points whose inner product with a vector in that space is constant.

For more information regarding libsvm you can visit "http://www.csie.ntu.edu.tw/~cjlin/libsvm".

Input

  • training set (Data Table)

    This input port expects an ExampleSet. This operator cannot handle nominal attributes; it can be applied on data sets with numeric attributes. Thus often you may have to use the Nominal to Numerical operator before applying this operator.

Output

  • model (Model)

    The SVM model is delivered from this output port. This model can now be applied on unseen data sets.

  • example set (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • svm_typeThe SVM type is selected through this parameter. This operator supports the C-SVC and nu-SVC SVM types for classification tasks. The epsilon-SVR and nu-SVR SVM types are for regression tasks. The one-class SVM type is for distribution estimation. The one-class SVM type gives the possibility to learn from just one class of examples and later on test if new examples match the known ones. Range: selection
  • kernel_typeThe type of the kernel function is selected through this parameter. Following kernel types are supported: linear, poly, rbf, sigmoid, precomputed. The rbf kernel type is the default value. In general, the rbf kernel is a reasonable first choice. Here are a few guidelines regarding different kernel types.
    • the rbf kernel nonlinearly maps samples into a higher dimensional space
    • the rbf kernel, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear
    • the linear kernel is a special case of the rbf kernel
    • the sigmoid kernel behaves like the rbf kernel for certain parameters
    • the number of hyperparameters influence the complexity of model selection. The poly kernel has more hyperparameters than the rbf kernel
    • the rbf kernel has fewer numerical difficulties
    • the sigmoid kernel is not valid under some parameters
    • There are some situations where the rbf kernel is not suitable. In particular, when the number of features is very large, one may just use the linear kernel.
    Range: selection
  • degreeThis parameter is only available when the kernel type parameter is set to 'poly'. This parameter is used to specify the degree for a polynomial kernel function. Range: real
  • gammaThis parameter is only available when the kernel type parameter is set to 'poly', 'rbf' or 'sigmoid'. This parameter specifies gamma for 'polynomial', 'rbf', and 'sigmoid' kernel functions. The value of gamma may play an important role in the SVM model. Changing the value of gamma may change the accuracy of the resulting SVM model. So, it is a good practice to use cross-validation to find the optimal value of gamma. Range: real
  • coef0This parameter is only available when the kernel type parameter is set to 'poly' or 'precomputed'. This parameter specifies coef0 for 'poly' and 'precomputed' kernel functions. Range: real
  • CThis parameter is only available when the svm type parameter is set to 'c-SVC', 'epsilon-SVR' or 'nu-SVR'. This parameter specifies the cost parameter C for 'c-SVC', 'epsilon-SVR' and 'nu-SVR'. C is the penalty parameter of the error term. Range: real
  • nuThis parameter is only available when the svm type parameter is set to 'nu-SVC', 'one-class' and 'nu-SVR'. This parameter specifies the nu parameter for 'nu-SVC', 'one-class' and 'nu-SVR'. Its value should be between 0.0 and 0.5. Range: real
  • cache_sizeThis is an expert parameter. It specifies the Cache size in Megabyte. Range: real
  • epsilonThis parameter specifies the tolerance of the termination criterion. Range: real
  • pThis parameter is only available when the svm type parameter is set to 'epsilon-SVR'. This parameter specifies tolerance of loss function of 'epsilon-SVR'. Range: real
  • class_weightsThis is an expert parameter. It specifies the weights 'w' for all classes. The Edit List button opens a new window with two columns. The first column specifies the class name and the second column specifies the weight for that class. Parameter C is calculated as weight of class multiplied by C. If weight of a class is not specified, that class is assigned weight = 1. Range: list
  • shrinkingThis is an expert parameter. It specifies whether to use the shrinking heuristics. Range: boolean
  • calculate_confidencesThis parameter indicates if proper confidence values should be calculated. Range: boolean
  • confidence_for_multiclassThis is an expert parameter. It indicates if the class with the highest confidence should be selected in the multiclass setting. Uses binary majority vote over all 1-vs-1 classifiers otherwise (selected class must not be the one with highest confidence in that case). Range: boolean

Tutorial Processes

SVM with rbf kernel

This is a simple Example Process which gets you started with the SVM(libSVM) operator. The Retrieve operator is used to load the 'Golf' data set. The Nominal to Numerical operator is applied on it to convert its nominal attributes to numerical form. This step is necessary because the SVM(libSVM) operator cannot take nominal attributes, it can only classify using numerical attributes. The model generated from the SVM(libSVM) operator is then applied on the 'Golf-Testset' data set using the Apply Model operator. The Nominal to Numerical operator was also applied on this data set. This is necessary because the testing and training data sets should be in the same format. The statistical performance of this model is measured using the Performance operator. This is a very basic process. It is recommended that you develop a deeper understanding of the SVM(libSVM) for getting better results through this operator. The support vector machine (SVM) is a popular classification technique. However, beginners who are not familiar with SVM often get unsatisfactory results since they miss some easy but significant steps.

Using 'm' numbers to represent an m-category attribute is recommended. Only one of the 'm' numbers is 1, and others are 0. For example, a three-category attribute such as Outlook {overcast, sunny, rain} can be represented as (0,0,1), (0,1,0), and (1,0,0). This can be achieved by setting the coding type parameter to 'dummy coding' in the Nominal to Numerical operator. Generally, if the number of values in an attribute is not too large, this coding might be more stable than using a single number.

This basic process omitted various essential steps that are necessary for getting acceptable results from this operator. For example to get a more accurate classification model from SVM, scaling is recommended. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is to avoid numerical difficulties during the calculation. Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems. Scaling should be performed on both training and testing data sets.

We have used default values of the parameters C, gamma and epsilon. To get more accurate results these values should be carefully selected. Usually techniques like cross-validation are used to find best values of these parameters for the ExampleSet under consideration.