Naive Bayes (Kernel) (RapidMiner Studio Core)
Synopsis
This operator generates a Kernel Naive Bayes classification model using estimated kernel densities.Description
A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be the 'independent feature model'. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. The Naive Bayes classifier performs reasonably well even if the underlying assumption is not true
The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix. In contrast to the Naive Bayes operator, the Naive Bayes (Kernel) operator can be applied on numerical attributes.
A kernel is a weighting function used in non-parametric estimation techniques. Kernels are used in kernel density estimation to estimate random variables' density functions, or in kernel regression to estimate the conditional expectation of a random variable.
Kernel density estimators belong to a class of estimators called non-parametric density estimators. In comparison to parametric estimators where the estimator has a fixed functional form (structure) and the parameters of this function are the only information we need to store, Non-parametric estimators have no fixed structure and depend upon all the data points to reach an estimate.
Input
- training set (Data Table)
The input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.
Output
- model
The Kernel Naive Bayes classification model is delivered from this output port. This classification model can now be applied on unseen data sets for prediction of the label attribute.
- example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Parameters
- laplace_correctionThis parameter indicates if Laplace correction should be used to prevent high influence of zero probabilities. There is a simple trick to avoid zero probabilities. We can assume that our training set is so large that adding one to each count that we need would only make a negligible difference in the estimated probabilities, yet would avoid the case of zero probability values. This technique is known as Laplace correction. Range: boolean
- estimation_modeThis parameter specifies the kernel density estimation mode. Two options are available.
- full: If this option is selected, you can select a bandwidth through heuristic or a fix bandwidth can be specified.
- greedy: If this option is selected, you have to specify the minimum bandwidth and the number of kernels.
- bandwidth_selectionThis parameter is only available when the estimation mode parameter is set to 'full'. This parameter specifies the method to set the kernel bandwidth. The bandwidth can be selected through heuristic or a fix bandwidth can be specified. Please note that the bandwidth of the kernel is a free parameter which exhibits a strong influence on the resulting estimate. It is important to choose the most appropriate bandwidth as a value that is too small or too large is not useful. Range: selection
- bandwidthThis parameter is only available when the estimation mode parameter is set to 'full' and the bandwidth selection parameter is set to 'fix'. This parameter specifies the kernel bandwidth. Range: real
- minimum_bandwidthThis parameter is only available when the estimation mode parameter is set to 'greedy'. This parameter specifies the minimum kernel bandwidth. Range: real
- number_of_kernelsThis parameter is only available when the estimation mode parameter is set to 'greedy'. This parameter specifies the number of kernels. Range: integer
- use_application_gridThis parameter indicates if the kernel density function grid should be used in the model application. It speeds up model application at the expense of the density function precision. Range: boolean
- application_grid_sizeThis parameter is only available when the use application grid parameter is set to true. This parameter specifies the size of the application grid. Range: integer
Tutorial Processes
Introduction to the Naive Bayes (Kernel) operator
The 'Golf' data set is loaded using the Retrieve operator. The Naive Bayes (Kernel) operator is applied on it. All parameters of the Naive Bayes (Kernel) operator are used with default values. The model generated by the Naive Bayes (Kernel) operator is applied on the 'Golf-Testset' data set using the Apply Model operator. The results of the process can be seen in the Results Workspace. Please note that parameters should be carefully chosen for this operator to obtain better performance. Specially the bandwidth should be selected carefully.